How to sort text qualified CSV file that doesn't fit in memory?

757 views Asked by At

I have some large (20GB+) CSV files which are double quote " text qualified that I need to sort and output to a new file.

Some files are just sorted on one column numerically, whilst others are on two columns, the first numerically and the second by string.

So far I have tried Pythons csv sort, which failed as it eventually ran out of memory. And also CoreUtils for Windows, though the sort doesn't seem to handle the text qualifier and gives incorrect results.

Are there any recommended/existing solutions which will handle this kind of sort? Platform is Windows Server 2008 R2.

1

There are 1 answers

0
Aditya On BEST ANSWER

Here you need some external sorting tricks. The idea is to create smaller sorted files which are then sorted one by one and saved in a new file. Here's a quick summary.

  1. Split 20GB file into 100-1000 files (depending on your system's performance)
  2. Sort each of the 100 files (chunks) individually using your conventional CSV sorting methods
  3. Create another file which would be finally the sorted output. Let's call it RESULT.
  4. Read first file and RESULT line by line. Iteratively add the lines to another file (say, TEMP). RESULT will not be blank after first iteration. Let's say current chunk file contains [1,3,5] and RESULT contains [2,4,6], compare their lowest unread elements. Here, (1,2). Write '1' to the TEMP. In next iteration, you'll have to compare (3,2) and write '2' to TEMP. Go on. Basic idea is core of the regular merge sort. Rename TEMP as result and clear TEMP.
  5. Repeat this for each chunk and RESULT.

enter image description here

So slowly RESULT grows as you keep iterating over chunks and is kept sorted all the time. This file is the final sorted CSV after iterations are over.

You can try several variations of the algorithm to suit your need. Check https://en.wikipedia.org/wiki/External_sorting for more details.

I was able to thus sort a 40GB file in 2-3 hours on a 8GB machine that also had several other processes running.