Including File Properties and Metadata in a U-SQL Script

Kenn A
2 min readDec 30, 2020

--

U-SQL adds support for computed file property columns on EXTRACT statement. Sometimes customers would like to get information about the files that they process, such as the full URI path or information about size, creation or modification dates. Likewise customers would like to use that information to filter the set of files that being processed.

U-SQL adds support for computed file property columns on EXTRACT expressions. It provides the following capabilities:

  1. Provide built-in file property functions that expose specific file properties.
  2. Provide a way to assign the properties to “virtual” columns in the EXTRACT‘s schema. This is done using a “calculated column” syntax.
  3. Allow constant-foldable comparisons on these “virtual” columns in subsequent query predicates to limit the files being processed by the EXTRACT expression.

U-SQL provides the following file property functions in the EXTRACT expression’s computed column definition:

File Property FunctionReturn typeDescriptionFILE.URI()stringReturns the file’s absolute URI or the full local path if used in a local run.FILE.MODIFIED()DateTime?Returns the file’s last modification time stamp in UTC-0 timezone.FILE.CREATED()DateTime?Returns the file’s creation time stamp in UTC-0 timezone.FILE.LENGTH()long?Returns the file’s size in bytes.Table. File properties function

N.B: The EXTRACT expression’s schema definition needs to contain at least one non-virtual, non-computed column.

The following U-SQL script extracts data, and file properties, as well as virtual column filename.

@SampleData = 
EXTRACT
EID int
, FirstName string
, LastName string
, DoB Date
, JoiningDate DateTime
, filename string // virtual column
, uri = FILE.URI()
, modified_date = FILE.MODIFIED()
, created_date = FILE.CREATED()
, file_sz = FILE.LENGTH()
FROM "/Data/{filename}.csv"
USING Extractors.Csv();
@RefinedData =
SELECT *
FROM @sampleData
WHERE modified_date > "2020-09-18";
OUTPUT @RefinedData
TO "/output/SampleData_" + DateTime.Now.ToString("yyyyMMdd") + ".csv"
USING Outputters.Csv(outputHeader : true);

By adding file properties and other metadata within the data results, we can easily troubleshoot issues (handling duplicate rows , or files that’re not processed — specially invalid JSON format files).

Originally published at https://techfura.com on Sep 18, 2020.

--

--

Kenn A
Kenn A

Written by Kenn A

Data Engineer and a lifetime learner.

No responses yet