I have a Java program that is supposed to read independent serialized objects from a file (no interdependencies between the objects), process them, then write them as independent serialized objects to another file. It looks something like this (forgive any typos, this is hand written, as the program does other stuff, like figure out how it should be processed):
try {
ObjectOutputStream fileOut = new ObjectOutputStream(new FileOutputStream("outputFile"));
ObjectInputStream fileIn = new ObjectInputStream(new FileInputStream("inputFile"));
for(int i = 0; i < numThingsInFile; i++){
MyObject thingToProcess = (MyObject) fileIn.readObject();
thingToProcess.process();
fileOut.writeObject(thingToProcess);
fileOut.flush();
}
fileIn.close();
fileOut.close();
} catch (IOException | ClassNotFoundException e1) {
e1.printStackTrace();
}
The code does the process correctly. And, from my end, I should be discarding "thingsToProcess" on every iteration of the loop, which should get garbage collected at the computer's leisure.
However, the memory used by the program keeps increasing as it reads more things until it slows to a crawl. I used a heap dump and an analyzer to look at it and it says the ObjectInputStream fileIn is taking up an absurd amount of memory. Specifically, it says the "entries" array is huge. It is significantly larger than the file it originated from. The file is 400 kB, but this entries array is over 600 MB just from reading that file. I also have other threads reading other files in the same way, so I am running out of memory. I know I could give Java more memory, but that is a band-aid solution that doesn't fix the underlying problem, as I want this process to work with larger files with more objects.
I would prefer not to break up the files more than they already are.
Is there a way to have the ObjectInputStream not store previous entries or clear the previous entries?
I've tried adding a BufferedInputStream and using mark/reset (before I realized the issue was the entries array within ObjectInputStream):
ObjectInputStream fileIn = new ObjectInputStream(new FileInputStream("inputFile"));
I've tried using readUnshared():
MyObject thingToProcess = (MyObject) fileIn.readUnshared();
This improved things and let me run my program, but it still had hundreds of thousands of objects in its Entries array that was expanding as time went on, which would cause problems with more objects.
I've tried calling fileOut.reset(), but this did not resolve the issue. On the idea that the file may have been formatted strangely, I also added resets to the 'ObjectOutputStream' that wrote the file inputFile.
You can not clear back-references on an
ObjectInputStream, as theObjectInputStreammust be prepared to handle the back-references of the incoming data, as produced by the writing side. That’s why the writing side is responsible for callingreset(), to enforce that no back-references may occur after this point.Note that this data sharing even applies to the class descriptors of the stored instances, so a hypothetical way to reset the input stream not matching the output stream would break as soon as you try to read the next instance of
MyObject, as it has a back-reference to the previously writtenMyObject.class.This also implies that calling
reset()can produce significantly bigger files, as even if theMyObjectinstances weren’t shared anyway, there might be more shared data than you were aware of, which will become duplicated afterreset().When you call
readUnshared(), you will enforce that the stream does not store a back-reference, but when it is not paired with awriteUnshared()on the producing side, there is the risk that the writing side did write the reference again, which will produce an exception on the reading side.The following program demonstrates that using either,
reset()orreadUnshared(), will have the intended effect of not maintaining references in theObjectInputStream:which will print
Demo on tio.run
One interesting point is that
readUnshared()will not maintain a reference in the first place, whereas the reset will be performed on the reading side when encountering the reset marker on the next read operation, so the garbage collection is one object behind, compared to thereadUnshared()approach. Further, as predicted, the serialized data is much bigger when usingreset().So, the best option for your scenario, is to use
writeUnsharedon the producing side, paired withreadUnsharedon the reading side.