I want to implement an external Merge Sort in Scala. It is used to sort huge files which can not fit in the main memory in their entirety.
Details can be found here :- How external merge sort algorithm works?
Now, I need to read chunks of the file, sort it and write it to disk etc etc
What's the most idiomatic / functional way of reading/writing a huge file in parts ?
- If I use the 'Source.fromFile(filename).getLines' method, I know I get an iterator over the file and can read it in parts. But when I get an iterator, how much of the file is read in the main memory ? Is it possible to read a fixed number of bytes from it ?
- Any other suggestions on how to implement this ? May some pointers to a fs2(scalaz-stream)/Akka Stream / Monix implementation where I can treat the file as a Stream and read in chunks ?
Chunked Sorting/Writing
Suppose you want to keep N numbers in memory at a time and further suppose that you are given a function which writes the N sorted numbers to a file:
As indicated in your question, an Iterator can be used to only keep N numbers in RAM at a time to sort them and write them to the intermediate files:
Now you have each group of N numbers in a different file. The only thing left to do is merge the files together.
Merging
Given some reading function that returns the Iterators from each of the sub-files:
We can write a function that will return a new Iterator that gets the values from each file and sorts them:
Note: the above function will keep a single
Integer
in memory for each file, therefore the RAM usage is dependent on how many different files thatwriteToFile
created.Now just write the values to a file:
Imperfect Sorting
One thing to note: if N is small and the source file is not sufficiently random then the solution will not produce a perfect sort. Example: suppose
N = 2
and the initial list was[10,11,0,1]
then the algorithm, after one pass, would produce[0,10,1,11]
as the result.