So I've finished my Lempel-Ziv compression/decompression, but compared to C++ its unbelievably slow, I've tried it on http://norvig.com/big.txt file, but my program can't process it, while in C++ it took only 1 second. Could any of you Haskell Gurus look at my code and tell me if there are some obvious flaws?
After adding prepending instead of appending, I managed to reduce the time from 16 seconds to 0.4!
Haskell's laziness is so deceiving, it was printing 'compression finished' almost immediately, but in fact it was the compression that made the program run so slow
import System.IO
import Control.Monad
import qualified Data.Map as Map
import Debug.Trace
main = do
contents <- readFile "plik.txt"
let compressed = reverse $ compress contents Map.empty 1 "" []
let decompressed = reverse $ decompress compressed Map.empty 1 ""
--print $ contents
print $ length compressed
print $ length decompressed
--print $ contents == decompressed
compress :: String -> Map.Map String Int -> Int -> String -> [(Int,Char)]-> [(Int,Char)]
compress (s:x) m i c out = do
if Map.member (c++[s]) m == False
then do
if c == ""
then do
let newMap = Map.insert [s] i m
compress x newMap (i+1) c ((0,s):out)
else do
let newMap = Map.insert (c++[s]) i m
compress x newMap (i+1) "" ((newMap Map.! c, s):out)
else compress x m i (c++[s]) out
compress s m i c out = compress2 out
compress2 :: [(Int,Char)]-> [(Int,Char)]
compress2 out = trace("COMPRESSION FINISHED") out
decompress :: [(Int,Char)] -> Map.Map Int String -> Int -> String -> String
decompress (s:x) m i out = do
if fst s == 0
then do
let newMap = Map.insert i [snd s] m
decompress x newMap (i+1) ((snd s):out)
else do
let newMap = Map.insert i ((m Map.! fst s)++[snd s]) m
decompress x newMap (i+1) ((snd s):(reverse (newMap Map.! fst s))++out)
decompress s m i out = out
decompress2 :: String -> String
decompress2 out = trace("DECOMPRESSION FINISHED") out
As others have already said in the comments:
Avoid using
String
; it uses a lot of RAM and it's quite slow to process. UseByteString
(if you're expecting to process raw binary data) orText
(if you're expecting Unicode text).Don't append to lists. (
String
is also a list — but don't useString
in the first place.) Prepend if it's easy. If not, useData.Sequence
,Data.Set
orData.Map
as appropriate. Or maybe evenText
,ByteString
orVector
.If you see a Haskell program that's really slow and uses buckets of RAM even though the input and output files are tiny, you probably have something somewhere being too lazy.
As an example, I had a program to produce a byte histogram. It took about 20 minutes and consumed 8 GB of RAM. I changed a data constructor to be strict, simply by adding a single
!
to it. Now the program takes a fraction of a second! Laziness can make an absurdly large difference if you get it wrong.Looking at the code you've posted, that probably covers it all. (I'm not sure exactly what changes you've tried and where you're up to now.)