Avoiding thunks in sparsely evaluated list generated by monadic unfold

133 views Asked by At

I have a simulation library that uses the FFI wrapped in a monad M, carrying a context. All the foreign functions are pure, so I've decided to make the monad lazy, which is normally convenient for flow-control. I represent my simulation as a list of simulation-frames, that I can consume by either writing to a file, or by displaying the frame graphically.

simulation :: [(Frame -> M Frame)] -> Frame -> M [Frame] 
simulation [] frame = return [frame]
simulation (step:steps) frame 
   = step frame >>= fmap (frame:) . simulation steps 

Each frame consists of a tuple of newtype-wrapped ForeignPtrs that I can lift to my Haskell representation with

lift :: Frame -> M HFrame

Since the time-steps in my simulation are quite short, I only want to look at every n frames, for which I use

takeEvery n l = foldr cons nil l 0 where
    nil _ = []
    cons x rest 0 = x : rest n
    cons x rest n = rest (n-1)

So my code looks something like

main = consume 
     $ takeEvery n 
     $ runM 
     $ simulation steps initialFrame >>= mapM lift

Now, the problem is that as I increase n, a thunk builds up. I've tried a couple of different ways to try to strictly evaluate each frame in simulation, but I have yet to figure out how to do so. ForeignPtr doesn't appear to have a NFData instance, so I can't use deepseq, but all my attempts with seq, including using seq on each element in the tuple, have been without noticeable effect.

EDIT:

Upon request, I have included more specifics, that I initially excluded since I think they are probably mostly noise for this question.

The monad

newtype FT c a = FT (Context -> a)

instance Functor (FT c) where
    fmap f (FT a) = FT (f.a)

instance Applicative (FT c) where
    pure a = FT (\_ -> a)
    (<*>) (FT a) (FT b) = FT (\c -> a c $ b c)

instance Monad (FT c) where
    return = pure
    (>>=) (FT a) f = FT (\c -> (\(FT b) -> b c) $ f $ a c)

runFTIn :: Context -> (forall c. FT c a) -> a
runFTIn context (FT a) = a context


runFTWith :: [ContextOption] -> (forall c. FT c a) -> a
runFTWith options a
    = unsafePerformIO
    $ getContext options >>= \c -> return $ runFTIn c a
runFT = runFTWith []

unsafeLiftFromIO :: (Context -> IO a) -> FT c a
unsafeLiftFromIO a = FT (\c -> unsafePerformIO $ a c)

All the foreign functions are lifted from IO with unsafeLiftFromIO

newtype Box c = Box (ForeignPtr RawBox)
newtype Coordinates c = Coordinates (ForeignPtr RawCoordinates)
type Frame c = (Box c, Coordinates c)

liftBox :: Box c -> FT c HBox
liftCoordinates :: Coordinates c -> FT c HCoordinates
liftFrame (box, coordinates) = do
    box' <- liftBox box
    coordinates' <- liftCoordinates coordinates
    return (box', coordinates') 

The steps themselves are supposed to be arbitrary (Frame c -> FT c (Frame c)), so strictness should preferably be in the higher level code.

EDIT2:

I have now tried to use Streamly, however the problem persists, so I think the issue really is finding a way to strictly evaluate ForeignPtrs.

current implementations:

import Streamly
import qualified Streamly.Prelude as S
import qualified Streamly.Internal.Data.Stream.Serial as Serial

takeEvery n = Serial.unfoldrM ((fmap.fmap) (\(h, t) -> (h, S.drop (n-1) t)) . S.uncons)

(#) = flip ($)
simulation
    :: (IsStream t)
    => Frame c 
    -> t (FT c) (Frame c -> FT c (Frame c)) 
    -> t (FT c) (Frame c)
simulation frame = S.scanlM' (#) frame

EDIT3:

To clarify the symptoms and how I have diagnosed the problem.

The library calls OpenCL functions running on a GPU. I am sure that the freeing of the pointers is handled correctly - the ForeignPtrs have the correct freeing functions, and memory use is independent of total number of steps as long as this number is larger than n. What I find is that memory use on the GPU is basically linearly correlated to n. The consumer I've been using for this testing is

import qualified Data.ByteString.Lazy as BL
import Data.Binary
import Data.Binary.Put

writeTrajectory fn = fmap (BL.writeFile fn . runPut) . S.foldr ((>>).putFrame) (pure ()) . serially

For my streamly implementation, and

writeTrajectory fn = BL.writeFile fn . runPut . MapM_ putFrame

For the original implementation. Both should consume the stream continuously. I've generated the steps for testing with replicate.

I am unsure of how to more precisely analyze the memory-use on the GPU. System memory use is not an issue here.

Update: I am starting to think it's not a matter of strictness, but of GC-problems. The run-time system does not know the size of the memory allocated on the GPU and so does not know to collect the pointers, this is less of an issue when there is stuff going on CPU-side as well, as that will produce allocations too, activating the GC. This would explain the slightly non-determinstic memory usage, but linear correlation to n that I've seen. How too solve this nicely is another issue, but I suspect there will be a substantial overhaul to my code.

1

There are 1 answers

2
Li-yao Xia On

I think the issue really is finding a way to strictly evaluate ForeignPtrs

If that is really the issue, one way to do that is to change the second clause of simulation:

{-# LANGUAGE BangPatterns #-}

simulation :: [(Frame -> M Frame)] -> Frame -> M [Frame] 
simulation [] frame = return [frame]
simulation (step:steps) frame@(!_, !_)  -- Evaluate both components of the pair
   = step frame >>= fmap (frame:) . simulation steps