U-SQL adds support for computed file property columns on EXTRACT
statement. Sometimes customers would like to get information about the files that they process, such as the full URI path or information about size, creation or modification dates. Likewise customers would like to use that information to filter the set of files that being processed.
U-SQL adds support for computed file property columns on EXTRACT
expressions. It provides the following capabilities:
- Provide built-in file property functions that expose specific file properties.
- Provide a way to assign the properties to “virtual” columns in the
EXTRACT
‘s schema. This is done using a “calculated column” syntax. - Allow constant-foldable comparisons on these “virtual” columns in subsequent query predicates to limit the files being processed by the
EXTRACT
expression.
U-SQL provides the following file property functions in the EXTRACT
expression’s computed column definition:
File Property FunctionReturn typeDescriptionFILE.URI()string
Returns the file’s absolute URI or the full local path if used in a local run.FILE.MODIFIED()DateTime?
Returns the file’s last modification time stamp in UTC-0 timezone.FILE.CREATED()DateTime?
Returns the file’s creation time stamp in UTC-0 timezone.FILE.LENGTH()long?
Returns the file’s size in bytes.Table. File properties function
N.B: The EXTRACT
expression’s schema definition needs to contain at least one non-virtual, non-computed column.
The following U-SQL script extracts data, and file properties, as well as virtual column filename.
@SampleData =
EXTRACT
EID int
, FirstName string
, LastName string
, DoB Date
, JoiningDate DateTime
, filename string // virtual column
, uri = FILE.URI()
, modified_date = FILE.MODIFIED()
, created_date = FILE.CREATED()
, file_sz = FILE.LENGTH()
FROM "/Data/{filename}.csv"
USING Extractors.Csv(); @RefinedData =
SELECT *
FROM @sampleData
WHERE modified_date > "2020-09-18"; OUTPUT @RefinedData
TO "/output/SampleData_" + DateTime.Now.ToString("yyyyMMdd") + ".csv"
USING Outputters.Csv(outputHeader : true);
By adding file properties and other metadata within the data results, we can easily troubleshoot issues (handling duplicate rows , or files that’re not processed — specially invalid JSON format files).
Originally published at https://techfura.com on Sep 18, 2020.