Use HashMap to store file positions and access these randomly using RandomAccessFile

417 views Asked by At

Initial problem:

I have the following issue: I am joining 2 CSVs using Java. While I can "stream" one of the CSVs (read in, process, write out line-by-line), the smaller one resides in memory (a HashMap to be precise), as I need to look up the keys of each row of the big CSV while going through it. The problem: if the "small CSV" is too large to keep in mem, I am running into OutOfMem errors.

While I know that I could avoid these issues by just reading both CSVs into a DB and perform the join there, it is infeasible in my application to do so. Is there a Java wrapper (or some other sort of object) which would allow me to keep only the HashMap's keys in memory, and put all of its values into a temp file on disk (in a self-managed fashion)?


Update:

After the comments of ThomasKläger and JacobG, I solved the problem in the following way:

Use a HashMap to store a row’s keys and that row’s start and end position using RandomAccessFile’s .getFilePointer().

While going through the large CSV, I am now using the HashMap to look up the matching rows’ positions, .seek(pos), and read them.

This is a working solution, thanks a lot.

1

There are 1 answers

0
fxrbfg On

According to what you describe you need something like off heap collections, in example MapDb lib, http://www.mapdb.org/ From description:

MapDB provides Java Maps, Sets, Lists, Queues and other collections backed by off-heap or on-disk storage. It is a hybrid between java collection framework and embedded database engine.