On the line of Read large txt file multithreaded?, I have the doubt of whether it is equivalent to pass to each thread an sliced chunk of a Seq and whether it will safely handle the paralellism; is it StreamReader thread-safe?
Here is the code I am using to test this (any advice or critics on the used pattern is welcome :) )
nthreads = 4
let Data = seq {
use sr = new System.IO.StreamReader (filePath)
while not sr.EndOfStream do
yield sr.ReadLine ()
}
let length = (Data |> Seq.length)
let packSize = length / nthreads
let groups =
[ for i in 0..(nthreads - 1) -> if i < nthreads - 1 then Data |> Seq.skip( packSize * i )
|> Seq.take( packSize )
else Data |> Seq.skip( packSize * i ) ]
let f = some_complex_function_modifiying_data
seq{ for a in groups -> f a }
|> Async.Parallel
|> Async.RunSynchronously
Your
Data
value has a typeseq<string>
, which means that it is lazy. This means that when you perform some computation that accesses it, the lazy sequence will create a new instance ofStreamReader
and read the data independently of other computations.You can easily see this when you add some printing to the
seq { .. }
block:As a result, your parallel processing is actually fine. It will create a new computation for every single parallel thread and so the
StreamReader
instances are never shared.Another question is if this is actually a useful thing to do - reading data from disk is often a bottle neck and so it might be faster to just do things in one loop. Even if this works, using
Seq.length
is a slow way to get the length (because it needs to read the whole file) and the same forskip
. A better (but more complex) solution would probably be to use streamSeek
.