Extract only required files in U-SQL

1.2k views Asked by At

Is it possible to extract files only for 3 days, without extracting all the files.

 DROP VIEW IF EXISTS dbo.Read;  
 CREATE VIEW IF NOT EXISTS dbo.Read AS


EXTRACT     
    Statements
FROM  
"adl://Test/{date:yyyy}/{date:M}/{date:d}/Testfile.csv"
USING Extractors.Csv(silent:true,quoting : true, nullEscape : "/N");

@res =
SELECT * FROM dbo.Read
WHERE date BETWEEN DateTime.Parse("2015/07/01") AND DateTime.Parse("2015/07/03");

OUTPUT @res
TO "adl://test/Testing/loop.csv"
USING Outputters.Csv();
2

There are 2 answers

2
Alexandre Gattiker On BEST ANSWER

Partition elimination already ensures for your query that only files matching predicates will actually be read (you can confirm that in the job graph).

See also my previous answer for How to implement Loops in U-SQL

If you have remaining concerns about performance, the job graph can also help you nail down where they originate.

1
OmidA On

You can use the pattern identifiers in the fileset specification in parts of the path or even parts of the name (see https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx). You can do lists of files, so if you only have one file in each directory you can do;

EXTRACT ...
FROM "adl://Test/2015/07/1/Testfile.csv"
     , "adl://Test/2015/07/2/Testfile.csv"
USING ...;

If there is more than one file in each directory you can do individual extracts for each day and then union the result. Something like:

@a = EXTRACT .... 
     FROM "adl://Test/2015/07/1/{*}.csv"
     USING ...;
@b = EXTRACT ....          
     FROM "adl://Test/2015/07/2/{*}.csv"
     USING ...;
@fullset = SELECT * FROM @a UNION SELECT * FROM @b;

Unfortunately I believe there is no list of filesets at the moment allowing you to do above case in one EXTRACT statement.