I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq
). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...
I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String
is overkill for 36-bytes of ASCII, so I figured on using ByteString
. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString
. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s
) and I think it only gets worse as the program keeps running.
I was thinking that I should maybe use Repa or some sort of single giant ByteString
/IOVector Char8
/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.
Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.
So, how fast can we be? Let's do some tests with a file generated by @ja.'s code:
The same in Haskell?
Timing?
Takes twice as long but I suppose it's not too bad. Using a strict bytestring because we want to chunk it up in a second.
Next, can we use
Vector
or is it too slow? Let's build aVector
of chunks and put them back together again. I use theblaze-builder
package for optimized output.How long does it take?
Not bad at all! Even though you are using a regex to read the lines instead of my fixed chunk positions, we still can conclude that you can put your chunks in a
Vector
with no serious performance hits.What's left? Sorting. Theoretically a bucket sort should be an ideal algorithm for this kind of problem. I tried implementing one myself:
Testing it gave a time of about 1:50 minutes, which is probably acceptable. We are talking of an O(c*n) algorithm for n in the range of some millions and a constant c of 36*something. But I'm sure you can optimize it further.
Or you can just use the
vector-algorithms
package. Testing with a heap sort:This does the job in about 1:20 minutes on my machine, so right now it's faster than my bucket sort. Both of the final solutions consume something in the range of 1-1.2 GB of RAM.
Good enough?