I am creating a program which needs to read from a file that is still being written.
The main question is this: If the read and write will be performed using InputStream
and OutputStream
classes running on a separate thread, what are the catches and edge cases that I will need to be aware of in order to prevent data corruption?
In case anyone is wondering if I have considered other, non-InputStream
based approach, the answer is yes, I have but unfortunately it's not possible in this project since the program uses libraries that only works with InputStream
and OutputStream
.
Also, several readers have asked why this complications is necessary. Why not perform reading after the file has been written completely?
The reason is efficiency. The program will perform the following
- Download a series of byte chunks of 1.5MB each. The program will receive thousands of such chunks that can total up to 30GB. Also, chunks are downloaded concurrently in order to maximize bandwidth, so they may arrive out of order.
- The program will send each chunk for processing as soon as they have arrived. Please note that they will be sent for processing in order. If chunk m arrives before chunk m-1 does, they will be buffered on disk until chunk m-1 arrives and is sent for processing.
- perform processing of these chunks starting from chunk 0 up to chunk n until every chunks has been processed
- Resend the processed result back.
If we are to wait for the whole file to be transferred, it will introduce a huge delay on what is supposed to be a real-time system.
So your problem (as you've cleared it up now) is that you can't start processing until chunk#1 has arrived, and you need to buffer every chunk#N (N > 1) until you can process them.
I would write each chunk to their own file and create a custom
InputStream
that will read every chunk in order. While downloading the chunkfile would be named something likechunk.1.downloading
and when the whole chunk is loaded it will be renamed tochunk.1
.The custom
InputStream
will check to see if filechunk.N
exists (where N = 1...X). If not, it will block. Each time a chunk has been downloaded completely, theInputStream
is notified, it will check if the downloaded chunk was the next one to be processed. If yes, read as normally, otherwise block again.