Scala : File reading to Build an external Merge Sort

602 views Asked by At

I want to implement an external Merge Sort in Scala. It is used to sort huge files which can not fit in the main memory in their entirety.

Details can be found here :- How external merge sort algorithm works?

Now, I need to read chunks of the file, sort it and write it to disk etc etc

What's the most idiomatic / functional way of reading/writing a huge file in parts ?

  • If I use the 'Source.fromFile(filename).getLines' method, I know I get an iterator over the file and can read it in parts. But when I get an iterator, how much of the file is read in the main memory ? Is it possible to read a fixed number of bytes from it ?
  • Any other suggestions on how to implement this ? May some pointers to a fs2(scalaz-stream)/Akka Stream / Monix implementation where I can treat the file as a Stream and read in chunks ?
1

There are 1 answers

4
Ramón J Romero y Vigil On

Chunked Sorting/Writing

Suppose you want to keep N numbers in memory at a time and further suppose that you are given a function which writes the N sorted numbers to a file:

val N : Int = ???

val writeToFile : Seq[Int] => Unit = ???

As indicated in your question, an Iterator can be used to only keep N numbers in RAM at a time to sort them and write them to the intermediate files:

val sourceFileName : String = ???

val sortAndWrite : Seq[Int] => Unit = 
  (_.sorted) andThen writeToFile

Source
  .fromFile(sourceFileName)
  .getLines
  .map(_.toInt)
  .grouped(N)
  .foreach(sortAndWrite)

Now you have each group of N numbers in a different file. The only thing left to do is merge the files together.

Merging

Given some reading function that returns the Iterators from each of the sub-files:

val subFiles : Iterable[Iterator[String]] = ???

We can write a function that will return a new Iterator that gets the values from each file and sorts them:

val mergeSort : Iterable[Iterator[String]] => Iterator[Int] = 
  (fileIterators) => {

    val nonEmptyFiles = fileIterators filter (_.hasNext)

    nonEmptyFiles
      .map(_.next)
      .map(_.toInt)
      .sorted
      .toIterator ++ mergeSort(nonEmptyFiles)
  }

Note: the above function will keep a single Integer in memory for each file, therefore the RAM usage is dependent on how many different files that writeToFile created.

Now just write the values to a file:

 val destinationFileName : String = ???

 val writer : Writer = new FileWriter(destinationFileName)

 mergeSort(subFiles) foreach (i => writer write i.toString)

Imperfect Sorting

One thing to note: if N is small and the source file is not sufficiently random then the solution will not produce a perfect sort. Example: suppose N = 2 and the initial list was [10,11,0,1] then the algorithm, after one pass, would produce [0,10,1,11] as the result.