I'm writing code to do a subset product: it takes a list of elements and a list of indicator variables (of the same length). The product is computed in a tree, which is crucial to our application. Each product is expensive, so my goal was to compute each level of the tree in parallel, evaluating consecutive levels in sequence. Thus there isn't any nested parallelism going on.
I only have repa code in ONE function, near the top level of my overall code. Note that subsetProd is not monadic.
The steps:
- chunk up the lists into pairs (no parallelism)
- zip the chunked lists (no parallelism)
- map the product function over this list (using Repa map), creating a Delayed array
- call computeP to evaluate the map in parallel
- convert the Repa result back to a list
- make a recursive call (on lists half the size of the inputs)
The code:
{-# LANGUAGE TypeOperators, FlexibleContexts, BangPatterns #-}
import System.Random
import System.Environment (getArgs)
import Control.Monad.State
import Control.Monad.Identity (runIdentity)
import Data.Array.Repa as Repa
import Data.Array.Repa.Eval as Eval
import Data.Array.Repa.Repr.Vector
force :: (Shape sh) => Array D sh e -> Array V sh e
force = runIdentity . computeP
chunk :: [a] -> [(a,a)]
chunk [] = []
chunk (x1:x2:xs) = (x1,x2):(chunk xs)
slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1)
testSubsetProd :: Int -> Int -> IO ()
testSubsetProd size seed = do
let work = do
!flags <- replicateM size (state random)
!values <- replicateM size (state $ randomR (1,10))
return $ subsetProd values flags
value = evalState work (mkStdGen seed)
print value
subsetProd :: [Int] -> [Bool] -> Int
subsetProd [!x] _ = x
subsetProd !vals !flags =
let len = (length vals) `div` 2
!valpairs = Eval.fromList (Z :. len) $ chunk vals :: (Array V (Z :. Int) (Int, Int))
!flagpairs = Eval.fromList (Z :. len) $ chunk flags :: (Array V (Z :. Int) (Bool, Bool))
!prods = force $ Repa.zipWith mul valpairs flagpairs
mul (!v0,!v1) (!f0,!f1)
| (not f0) && (not f1) = 1
| (not f0) = v0+1
| (not f1) = v1+1
| otherwise = fromInteger $ slow_fib ((v0*v1) `mod` 35)
in subsetProd (toList prods) (Prelude.map (uncurry (||)) (toList flagpairs))
main :: IO ()
main = do
args <- getArgs
let [numleaves, seed] = Prelude.map read args :: [Int]
testSubsetProd numleaves seed
The entire program is compiled with
ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3
per these instructions, on GHC 7.6.2 x64.
I ran my program (Subset) using
$> time ./Test 4096 4 +RTS -sstderr -N4
8 seconds later later:
672,725,819,784 bytes allocated in the heap
11,312,267,200 bytes copied during GC
866,787,872 bytes maximum residency (49 sample(s))
433,225,376 bytes maximum slop
2360 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284212 colls, 1284212 par 174.17s 53.20s 0.0000s 0.0116s
Gen 1 49 colls, 48 par 13.76s 4.63s 0.0946s 0.6412s
Parallel GC work balance: 16.88% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 497.80s (448.38s elapsed)
GC time 187.93s ( 57.84s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 685.73s (506.21s elapsed)
Alloc rate 1,351,400,138 bytes per MUT second
Productivity 72.6% of total user, 98.3% of total elapsed
gc_alloc_block_sync: 8670031
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 571398
My code does get slower as I increase the -N parameter, (7.628 seconds for -N1, 7.891 seconds for -N2, 8.659 seconds for -N4) but I'm getting 0 sparks created, which seems like a prime suspect as to why I'm not getting any parallelism. Also, compiling with a whole slew of optimizations helps with the runtime, but not the parallelism.
Threadscope confirms that no serious work is being done on three HECs, but the garbage collector seems to be using all 4 HECs.
So why isn't Repa making any sparks? My product tree has 64 leaves, so even if Repa made a spark for every internal node, there should be ~63 sparks. I feel like it could have something to do with my use of the ST monad encapsulating the parallelism, though I'm not quite sure why this would cause an issue. Perhaps sparks can only be created in an IO monad?
If this is the case, does anyone have an idea of how I could perform this tree product where each level is done in parallel (without resulting in nested parallelism, which seems unnecessary for my task). In general, perhaps there is a better way to parallelize the tree product or make better use of Repa.
Bonus points for explaining why the runtime increases as I increase the -N parameter, even when no sparks are created.
EDIT I changed the code example above to be a compiling example of my problem. The program flow almost perfectly matches my real code: I randomly choose some inputs, and then do a subset product on them. I am now using the identity monad. I have tried lots of small changes to my code: inlining or not, bang patterns or not, variations on using two Repa lists and a Repa zipWith vs zipping the lists sequentially and using a Repa map, etc, none of which helped at all.
Even if I'm running into this problem in my example code, my real program is much larger.
Why is there no parallelism?
The main reason (at least for your now simplified and working) program for there being no parallelism is that you're using
computeP
on an array ofV
representation, and normal vectors aren't strict in their element types. So you aren't actually doing any real work in parallel. The easiest fix is to use an unboxedU
array as the result, by changingforce
to this definition:I do recall that in your original code you claimed you're working with a complicated datatype that isn't unboxed. But is it really impossible to make it so? Perhaps you can extract the data you actually need into some unboxable representation? Or make the type an instance of the
Unbox
class? If not, then you can also use the following variant offorce
that works for aV
-array:The idea here is that we first compute the
V
-array structure, and then we compute aU
-array of()
type from it by mappingrnf
over the array. The resulting array is uninteresting, but each of theV
-array's elements will be forced in the process1.Either of these changes brings runtime for a problem size of
4096
from ~9 down to ~3 seconds with-N4
on my machine.In addition, I think it's strange that you convert between lists and arrays in every step. Why not make
subsetProd
take two arrays? Also, at least for the values, using an intermediateV
array for the pairs seems unnecessary, you could just as well use aD
array. But in my experiments these changes didn't have a significant beneficial effect on runtime.Why are there no sparks?
Repa does never create sparks. Haskell has many different approaches to parallelism, and sparks are one particular mechanism that has special support in the run-time system. However, only some libraries, for example the
parallel
package and one particular scheduler of themonad-par
package, actually make use of the mechanism. Repa, however, does not. It usesforkIO
, i.e., threads, internally, but provides a pure interface to the outside. So the absence of sparks is in itself nothing to worry about.1. I originally had no idea how to do that, so I asked Ben Lippmeier, the author of Repa. Thanks a lot to Ben for pointing out the option of mapping
rnf
to produce a different array, and the fact that there's anUnbox
instance for()
, to me.