Combine 65000 small xml files in blobs storage and copy them into data lake as big CSV files

371 views Asked by At

I have around 65000 small xml files (around 1 to 3 kb per file) per hour in Azure blob storage. Its telemetry data and will be there in blob each hour. I want to combine them all and create big xml files(hourly one file or so) and convert that into big csv file.Then I want to copy it into in azure data lake store, so that I can further do analysis on that using U-SQL scripts. Please tell me what is the correct way to do this.

Right now we are using azure batch service which does this by keeping all file names in Azure table storage and each task in batch service will read data from table storage and update it once complete.

I think we can do better than this using web job sdk and Azure Service bus - topics and subscription. For each hour we can consider one topic and one subscriber.

Let me know that my thinking is correct or we can do this with some better technique.

1

There are 1 answers

4
Brando Zhang On

In my opinion, you could directly combine and convert the xml files to csv in the web job(You data is not very large). After the operation completely, you could upload the converted file to the data lake store.

But if you use this way, you couldn't keep the data in real time upload to the data lake store.

You could use azure web job timer trigger(run the convert code each hour) to help you to combine and convert the data.

If you want to keep the data in real time, I suggest you could consider using webjob(using blob trigger) to read the data to service bus firstly, then you could use stream analysis to convert the data to csv in data store lake.