I'm using worker threads in Node.js to process a massive ndjson file. I'm trying not to be too specific about what I currently have as far as code because that can be changed depending on the answer to this question. The basic design of what I have is (pseudo code):
//Pseudo code:
await pipeline(
1: Use 'readline' on an fs.createReadStream(),
// (Each line passed as an ArrayBuffer)
2: Use Piscina worker thread pool to use workers to do processing of text,
// Between 2 and 3 is where I am unable to find online or figure out a solution/pattern.
3: Use a single writable stream to append the results of the workers to several files.
)
The problem I'm having is that I can't find a design pattern which allows me to send the results of each worker thread in the pool to the next stream in the pipeline (#3) as soon as they finish without 'awaiting' a specific worker from inside of #2 because of how I understand the 'messaging' between worker threads and the main thread to work. This would defeat the entire purpose of having a thread pool and using streams. I'm properly awaiting everything inside of the worker functions themselves of course, I just don't want to wait for the result of each worker after I send it off.
(Clarification) The order of the worker results do not matter. I just want to have a pool of 16 threads, use backpressure to keep that pool busy at all times, and as workers within the pool finish, pipe those results onto the next stream in the pipeline. Obviously there needs to be some sort of glue code that sits between the worker and the stream that uses messages or message events to do this.
I haven't been able to try anything that I actually expect to work yet because I think I'm just missing something. I understand the limitations with messages between the main and worker threads, but it just seems like what I am trying to do - gather a bunch of results from workers and pipe them into a single stream - would be a common pattern people use with workers and streams, but all I can find when I search is a bunch of contrived examples of people doing things that you would never actually do in a real program like go off and do a bunch of simple arithmetic in a single worker, or just console.log() things using workers.
I've read the Node.js worker_thread API documentation, which explains everything great. It makes sense. The problem is just that I don't know how to use messaging to send all worker results to a single place, rather than awaiting each and every worker manually.
I have tried using a 'new MessageChannel()' so that I can use port1 and port2 in hopes that maybe I could just pass things from multiple different workers back to the same 'port' and use an on 'message' listener to stick that into the writable stream (#3), but the problem with that is that it seems like you are supposed to instantiate a new MessageChannel() for each worker(?), which is transferred via transferList, so you can't reference it again from the main thread after that.
What am I missing here? Thank you!!