I have some large (20GB+) CSV files which are double quote " text qualified that I need to sort and output to a new file.
Some files are just sorted on one column numerically, whilst others are on two columns, the first numerically and the second by string.
So far I have tried Pythons csv sort, which failed as it eventually ran out of memory. And also CoreUtils for Windows, though the sort doesn't seem to handle the text qualifier and gives incorrect results.
Are there any recommended/existing solutions which will handle this kind of sort? Platform is Windows Server 2008 R2.
Here you need some external sorting tricks. The idea is to create smaller sorted files which are then sorted one by one and saved in a new file. Here's a quick summary.
So slowly RESULT grows as you keep iterating over chunks and is kept sorted all the time. This file is the final sorted CSV after iterations are over.
You can try several variations of the algorithm to suit your need. Check https://en.wikipedia.org/wiki/External_sorting for more details.
I was able to thus sort a 40GB file in 2-3 hours on a 8GB machine that also had several other processes running.