My code at large takes a Python dictionary, that contains database column names and their content, binarizes the columns (struct.pack, array.array, str.encode) and sends them using socket.sendall.
To improve speed, I wrote the binarization part in a generator function, that yields the binary chunks. A generator is sent to a thread, which generates the chunks and puts them in a queue, which the main thread collects and sends away.
However, I still don't get the speed improvement I expected. I figured I'd try using an auxiliary process instead of an aux. thread. The problem is - I can't pass a generator to a process - "not pickleable".
I would be grateful for any suggestions / feedback / insights as to how to go about this type of mechanics.
EDIT: When profiling the code with snakeviz (cProfile with graphics), I saw that socket.recv takes 3/4 of the time, and time.sleep (Waiting for chunks in the main thread) another 1/4. That 1/4 is what I thought I could mitigate with another therad/process, as I read that both blocking socket operations and time.sleep are supposed to release the GIL.
I don't see how you would expect to have any performance improvement by doing this in another process. The function of producing a binary representation (serialization) of your dictionary is completely CPU bound -- and, relatively speaking, should be very fast -- while the function of sending it to another system is I/O bound, and will almost certainly be slower and the ultimate bottleneck.
In fact, I wouldn't be surprised if delivering these binary chunks you've created through a queue from one thread to another takes more time than simply running the serializer directly in the socket-sending thread when you factor in thread context-switching and queue insertion/extraction/synchronization overhead (and the effects of the GIL if you're using CPython). And the queue sending/synchronization overhead is unlikely to change for the better if you move the serialization to a separate process.
If you want to do the sending concurrently with other activity -- you're concerned with tying up your main thread for a long time -- then you should probably just delegate the entire task (serialization plus sending) to another thread or process.
Another thing to understand is that when you first begin to send on a socket, the kernel will copy the initial data into internal buffers and immediately return control to you -- the kernel then breaks the data up into packets and sends on the wire as the protocol permits asynchronously. So, the sends will not (at first) appear to be I/O bound. But if you have many megabytes of data to send, eventually the kernel buffer space allowed to your socket will fill and then your thread will be blocked until enough packets have been sent and acknowledged by the peer to free up some of that space.
In other words, in a single-threaded implementation, if G means generate a chunk of data and S is a
socket.sendall
call, your total time spent in each phase will look something like this:At first, the sends will seem near-instantaneous, but after a while, will start to take longer to complete. And if you aren't generating enough data to experience this effect, then it's even less likely you have any need to push the serialization to a separate thread.