U-SQL Extract Statement - working with hundreds of columns

1.2k views Asked by At

Is there any way in the U-SQL extract statement to only specify the input columns that I care about? I'm working with a legacy database that exports several tables to csv that has about 200 columns. I only care about 10 of those fields. I was hoping to do something like:

EXTRACT CustomerID:0   string,
    StoreNumber:5    double,
    ReceiptNumber:20   double,
    Timestamp:125   int
FROM "somefile.csv"
USING Extractors.Csv();

So the idea would be to specify an index that corresponds to which column the data comes from.

Is there anything like that in the works? Or do I always need to define all 200 columns? I'm open to other solutions as well. Thanks!

2

There are 2 answers

2
Alexandre Gattiker On BEST ANSWER

You can write your custom extractor code. Use the sample extractors at https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats as guidance.

0
wBob On

There isn't an easy way to do this as you describe, but it would not take long to script. For example I often use Excel to help with scripting, eg paste the known metadata for the 200 column csv then map it to U-SQL data types, eg

Excel U-SQL helper

You could create an initial U-SQL view that had all the columns using the method above, then create another view based on the initial view for the columns you want:

CREATE VIEW IF NOT EXISTS dbo.view1
    AS
EXTRACT col1 int,
        col2 string,
        col3 string,
        col4 string,
        col5 string
FROM "/input/input44.txt"
USING Extractors.Csv();


CREATE VIEW IF NOT EXISTS dbo.view2
    AS
SELECT  col1,
        col2
FROM dbo.view1;


@input =
    SELECT *
    FROM dbo.view2;

I also sometimes use Visual Studio Find and Replace with RegEx to do similar things.