The task I have in hand is to read the lines of large file, process them, and return ordered results.
My algorithm is:
- start with master process that will evaluate the workload (written in the first line of the file)
- spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
- master receives all sub-results, sort, and return so basically no communication needed between workers.
My questions:
- How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
- How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
- Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
erlang message passing architecture
Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).
The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn
N
workers, and distribute work between them, waiting to hear back. Re-run the program while changingN
if you don't like how it runs.Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.
But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.
-
Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.
The tricky bit is really that functions like
pread/2-3
takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the wordmy
inthis is my line\nhere it goes\n
, one worker will see itself have an incomplete line, while the other will report only onmy line\n
, missing the priorthis is
.Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.
The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.
In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.
-
Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.
A more realistic very performant program would use very few messages of a very small size.
The fun thing here is that your problem is inherently data-based. So there's a few things you can do:
raw
orram
mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes);raw
andram
modes gives the file descriptor directly to the calling process and is a lot faster.I hope this helps.
P.S. You can try the really simple stuff at first:
either:
{ok, Bin} = file:read_file(Path)
and split lines (withbinary:split(Bin, <<"\n">>, [global])
),{ok, Io} = file:open(File, [read,ram])
and then usefile:read_line(Io)
on the file descriptor repeatedly{ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}])
and then usefile:read_line(Io)
on the file descriptor repeatedlycall
rpc:pmap({?MODULE, Function}, ExtraArgs, Lines)
to run everything in parallel automatically (it will spawn one process per line)call
lists:sort/1
on the result.Then from there you can refine each step if you identify them as problematic.