I have a very basic question. I have several text files with data which are several GB's in size each. I have a C# WPF application which I'm using to process similar data files but nowhere close to that size (probably around 200-300mb right now). How can I efficiently read this data and then write it somewhere else after processing without everything freezing and crashing? Essentially whats the best way to read from a very large file? For my low scale application right now, I use System.IO.File.ReadAllLines
to read and a streamwriter
to write. I'm sure those 2 methods are not the best idea for such large files. I don't have much experience with C#, any help will be appreciated!
Fastest way to read very large text file in C#
3k views Asked by sparta93 AtThere are 2 answers
This might be an overlapped transformation of some kind.
https://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx
First, you'll want to allocate your destination file to as close to the result size as estimable. Overshooting may be preferable to undershooting in most situations, you can always truncate to a given length, but growth may require non-contiguous allocation. If excessive growth is expected, you may be able to allocate the file as a "sparse" file.
Pick an arbitrary (maybe binary power) block size (test to find best performance) greater than or equal to 512 bytes.
Map 2 blocks of the source file. This is your source buffer.
Map 2 blocks of the destination file. This is your destination buffer.
Operate on the lines within a block. Read from your source block, write to your destination block.
Once you transition a block boundary, perform a "buffer swap" to trade the previous completed block for the next block.
There are several ways to accomplish these tasks.
If you wish, you may allocate more blocks at a time for the operation, though you'll need to apply a "triple buffering" strategy of overlapped operation to utilize. If write is far slower than read, you may even implement unbounded memory buffering with the same pattern as triple buffering.
Depending on your data, you may also be able to distribute blocks to separate threads, even if it's a "line based" file.
If every line is dependent on previous data, there may be no way to accelerate the operation. If not, indexing the lines in the file prior to performing operations will allow multiple worker threads, each operating on independent blocks.
If I need to elaborate on anything, just say what part.
If you can do this line by line then the answer is simple:
If you want it to go a bit faster, put those in three
BlockingCollections
with a specified upper bound of something like 10, so a slower step is never waiting on a faster step. If you can output to a different physical disc (if output is to disc).OP changed the rules even after being asked if the process was line by line (twice).