The following piece of code experiences a stack overflow for large inputs:
{-# LANGUAGE DeriveDataTypeable, OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
genTweets :: L.ByteString -> L.ByteString
genTweets text | L.null text = ""
| otherwise = L.intercalate "\n\n" $ genTweets' $ L.words text
where genTweets' txt = foldr p [] txt
where p word [] = [word]
p word words@(w:ws) | L.length word + L.length w <= 139 =
(word `L.append` " " `L.append` w):ws
| otherwise = word:words
I assume my predicate is building a list of thunks, but I'm not sure why, or how to fix it.
The equivalent code using foldl'
runs fine, but takes forever, since it appends constantly, and uses a ton of memory.
import Data.List (foldl')
genTweetsStrict :: L.ByteString -> L.ByteString
genTweetsStrict text | L.null text = ""
| otherwise = L.intercalate "\n\n" $ genTweetsStrict' $ L.words text
where genTweetsStrict' txt = foldl' p [] txt
where p [] word = [word]
p words word | L.length word + L.length (last words) <= 139 =
init words ++ [last words `L.append` " " `L.append` word]
| otherwise = words ++ [word]
What is causing the first snippet to build up thunks, and can it be avoided? Is it possible to write the second snippet so that it doesn't rely on (++)
?
This is the problem. On each iteration, you're traversing the accumulator list, and then
appending at the end. Obviously this going to take a long time (proportional to the length of the accumulator list). A better solution is to generate the output list lazily, interleaving processing with reading the input stream (you don't need to read the whole input to output the first 140-character tweet).
The following version of your program processes a relatively large file (
/usr/share/dict/words
) in under a 1 second time, while using O(1) space: