How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

1.9k views Asked by At

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.

My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".

I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:

// Output to file
OUTPUT @dt
TO @"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();

It gives the following error:

Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d

Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.

Thanks in advance for reading/responding!

1

There are 1 answers

2
saveenr On BEST ANSWER

Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.

Dynamic outputs based on data are something we are actively working for release sometime later in 2017.

In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts

The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.

Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber