Initial problem:
I have the following issue: I am joining 2 CSVs using Java. While I can "stream" one of the CSVs (read in, process, write out line-by-line), the smaller one resides in memory (a HashMap
to be precise), as I need to look up the keys of each row of the big CSV while going through it. The problem: if the "small CSV" is too large to keep in mem, I am running into OutOfMem errors.
While I know that I could avoid these issues by just reading both CSVs into a DB and perform the join there, it is infeasible in my application to do so. Is there a Java wrapper (or some other sort of object) which would allow me to keep only the HashMap
's keys in memory, and put all of its values into a temp file on disk (in a self-managed fashion)?
Update:
After the comments of ThomasKläger and JacobG, I solved the problem in the following way:
Use a HashMap
to store a row’s keys and that row’s start and end position using RandomAccessFile
’s .getFilePointer()
.
While going through the large CSV, I am now using the HashMap
to look up the matching rows’ positions, .seek(pos)
, and read them.
This is a working solution, thanks a lot.
According to what you describe you need something like off heap collections, in example MapDb lib, http://www.mapdb.org/ From description: