What is the best way to store and sort rectangular data in Haskell?

522 views Asked by At

I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...

I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String is overkill for 36-bytes of ASCII, so I figured on using ByteString. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s) and I think it only gets worse as the program keeps running.

I was thinking that I should maybe use Repa or some sort of single giant ByteString/IOVector Char8/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.

Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.

5

There are 5 answers

0
firefrorefiddle On

So, how fast can we be? Let's do some tests with a file generated by @ja.'s code:

cat data.txt > /dev/null
  --> 0.17 seconds

The same in Haskell?

import qualified Data.ByteString as B

f = id

main = B.readFile "data.txt" >>= return . f >>= B.putStr

Timing?

time ./Test > /dev/null
  --> 0.32 seconds

Takes twice as long but I suppose it's not too bad. Using a strict bytestring because we want to chunk it up in a second.

Next, can we use Vector or is it too slow? Let's build a Vector of chunks and put them back together again. I use the blaze-builder package for optimized output.

import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid

recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length 

substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) recordLen

mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)

mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
     where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
           foldToBS = mappend . BB.fromWrite . BB.writeByteString

main = B.readFile "data.txt" >>= return . mkBS . mkVector >>= L.putStr

How long does it take?

time ./Test2 > /dev/null
  --> 1.06 seconds

Not bad at all! Even though you are using a regex to read the lines instead of my fixed chunk positions, we still can conclude that you can put your chunks in a Vector with no serious performance hits.

What's left? Sorting. Theoretically a bucket sort should be an ideal algorithm for this kind of problem. I tried implementing one myself:

import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Data.Vector.Generic.Mutable as MV
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
import Control.Monad.ST
import Control.Monad.Primitive

recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length 

substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)

mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)

mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
     where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
           foldToBS = mappend . BB.fromWrite . BB.writeByteString

bucketSort :: Int -> V.Vector B.ByteString -> V.Vector B.ByteString
bucketSort chunkSize v = runST $ emptyBuckets >>= \bs -> 
                                 go v bs lastIdx (chunkSize - 1)
           where lastIdx = V.length v - 1

                 emptyBuckets :: ST s (V.MVector (PrimState (ST s)) [B.ByteString])
                 emptyBuckets = V.thaw $ V.generate 256 (const [])

                 go :: V.Vector B.ByteString -> 
                       V.MVector (PrimState (ST s)) [B.ByteString] -> 
                       Int -> Int -> ST s (V.Vector B.ByteString)
                 go v _ _ (-1) = return v
                 go _ buckets (-1) testIdx = do
                    v' <- unbucket buckets
                    bs <- emptyBuckets
                    go v' bs lastIdx (testIdx - 1)
                 go v buckets idx testIdx = do
                    let testChunk = v V.! idx
                        testByte = fromIntegral $ testChunk `B.index` testIdx
                    b <- MV.read buckets testByte
                    MV.write buckets testByte (testChunk:b)
                    go v buckets (idx-1) testIdx

                 unbucket :: V.MVector (PrimState (ST s)) [B.ByteString] -> 
                             ST s (V.Vector B.ByteString)
                 unbucket v = do 
                          v' <- V.freeze v
                          return . V.fromList . concat . V.toList $ v'

main = B.readFile "data.txt" >>= return . process >>= L.putStr
     where process =  mkBS . 
                      bucketSort (recordLen) . 
                      mkVector

Testing it gave a time of about 1:50 minutes, which is probably acceptable. We are talking of an O(c*n) algorithm for n in the range of some millions and a constant c of 36*something. But I'm sure you can optimize it further.

Or you can just use the vector-algorithms package. Testing with a heap sort:

import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Vector.Algorithms.Heap (sort)
import Data.Monoid
import Control.Monad.ST

recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length 

substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)

mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)

mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
     where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
           foldToBS = mappend . BB.fromWrite . BB.writeByteString

sortIt v = runST $ do 
       mv <- V.thaw v
       sort mv
       V.freeze mv

main = B.readFile "data.txt" >>= return . process >>= L.putStr
     where process =  mkBS . 
                      sortIt .
                      mkVector

This does the job in about 1:20 minutes on my machine, so right now it's faster than my bucket sort. Both of the final solutions consume something in the range of 1-1.2 GB of RAM.

Good enough?

2
Joachim Breitner On

The Array family of types has built-in support for multi-dimensional arrays. The indices can be anything with an Ix instance, in particular for your case (Int, Int). It also does not support sorting, unfortunately.

But for your use case, do you really need sorting? If you have a map from identifiers to Int you can just increase the count as you go, and later select all keys with value 1. You can check out the bytestring-trie package, although for your use case it suggests to use hashmap.

Another algorithm would be to carry two sets (e.g. HashSet), one with identifiers seen exactly once, and one with identifiers seen more than once, and you update these sets while you go through the list.

Also, how do you read your file: If you read it as one large ByteString and carefully construct the small ByteString objects from it, they will actually be just pointers into the big chunk of memory with the large file, possibly eliminating your GC problems. But to assess that we’d need to see your code.

0
Levi Pearson On

There are a couple of wrappers around mmap available that can give you either Ptrs to data in your file or a big ByteString. A ByteString is really just a pointer,offset,length tuple; splitting that big ByteString into a bunch of small ones is really just making a bunch of new tuples that point to subsets of the big one. Since you say each record is at a fixed offset in the file, you should be able to create a bunch of new ones without actually accessing any of the file at all via ByteString's take.

I don't have any good suggestions for the sorting part of the problem, but avoiding the copying of the file data in the first place ought to be a good start.

0
ja. On

A trie should work. This code takes 45 mins on a file of 18 million lines, 6 million unique keys, on a dual-core laptop with 4 gig RAM:

--invoked:  test.exe +RTS -K3.9G -c -h
import qualified Data.ByteString.Char8 as B
import qualified Data.Trie as T

file = "data.txt"
main = ret >>= print
ret = do  -- retreive the data
    ls <- B.readFile file >>= return.B.lines
    trie <- return $ tupleUp ls
    return $ T.size trie 
tupleUp:: [B.ByteString] -> T.Trie Int
tupleUp l = foldl f T.empty l
f acc str = case T.lookup str acc 
            of Nothing -> T.insert str 1 acc
               Just n ->  T.adjust (+1) str acc

Here's the code used to generate the data file (6MM keys, then 3 copies into 1 file to get the 18MM keys:

import qualified Data.ByteString.Char8 as BS
import System.Random
import Data.List.Split

file = "data.txt"
numLines = 6e6 --17e6
chunkSize = 36
charSet = ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9']

-- generate the file
gen = do
  randgen <- getStdGen
  dat <- return $ t randgen
  writeFile file (unlines dat)

t gen = take (ceiling numLines) $ charChunks
    where
      charChunks = chunksOf chunkSize chars
      chars = map (charSet!!) rands
      rands = randomRs (0,(length charSet) -1) gen

main = gen
0
leventov On

I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector

Suspicious. Retaining ByteStrings should not be collected. Maybe there is excessive data copying somewhere in your code?

  • ByteString is a header (8 bytes) + ForeignPtr Word8 ref (8 bytes) + Int offset (4 bytes) + Int length (4 bytes)

  • ForeignPtr is a header (8 bytes) + Addr# (8 bytes) + PlainPtr ref (8 bytes)

  • PlainPtr is a header (8 bytes) + MutableByteArray# ref (8 bytes)

(Revised according to https://stackoverflow.com/a/3256825/648955)

All in all, ByteString overhead is at least 64 bytes (correct me, of some fields are shared).

Write your own data management: big flat Word8 array and adhoc offset wrapper

newtype ByteId = ByteId { offset :: Word64 }

with Ord instance.

Overhead would be 8 bytes per identifier. Store offsets in unboxed Vector. Sort with something like this: http://hackage.haskell.org/packages/archive/vector-algorithms/0.5.4.2/doc/html/Data-Vector-Algorithms-Intro.html#v:sort