In part of a larger project I need to create an NGram model using Java which is not optimal nor optional I am using JDK 20 and vs code to run the code. When I try to run the code on vs code I get:
` Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.HashMap.resize(HashMap.java:710)
at java.base/java.util.HashMap.putVal(HashMap.java:635)
at java.base/java.util.HashMap.put(HashMap.java:618)
at Ngram.NGramNode.addNGram(NGramNode.java:277)
at Ngram.NGramNode.addNGram(NGramNode.java:280)
at Ngram.NGram.addNGramSentence(NGram.java:157)
at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:68)
at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
at com.glmadu.App.main(App.java:21)`
Error I did increase the heap space from launch.JSON to 8GB and the corpus file is around 750 MB the code piece is here
private static void getCorpus(String output) {
ArrayList<ArrayList<String>> corpus = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader("path/to/corpus"))) {
String line;
while ((line = br.readLine()) != null) {
String[] tokens = line.split(" "); //line 63
ArrayList<String> sentence = new ArrayList<>();
for (String token : tokens) {
sentence.add(token); // line 68
}
corpus.add(sentence);
}
NGram<String> nGram = new NGram<>(corpus, 2);
nGram.saveAsText(output);
} catch (IOException e) {
e.printStackTrace();
}
}
I do not understand how can I still get Heap space after push it to 8GB I tried with 12 and 10 but I get
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
` at java.base/java.lang.String.split(String.java:3138)
at java.base/java.lang.String.split(String.java:3212)
at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:63)
at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
at com.glmadu.App.main(App.java:21)
`
error when I do that. I am using vs code to run this. Thanks in advance
I tried increasing the heap size, I tried reading less lines still got error even when I tried to read first 1000 lines. I tried not saving NGram model and from that I can derive it's not the NGram modeling but mode like arrays take too much space in memory, also when I checked the memory usage from task manager it sits at 4-5 GB and does not get close to 8 I allocated
Alright here is how I "solved" the problem I set an initial array size according with @Sascha 's response but it still got problems so I divided the problem and merged them later on
It takes a String path to output file after that to save it runs saveNgram which is quite basic as it takes the output concatenate it to add partx to it and saves the NGram
At the and if there are any leftover lines saves it again and calls mergeNGram which is just a BufferedReader/Writer to write to the final file
It is nowhere near perfect but it solves my current problem and that is all I can do at the moment, special thanks to @Sascha for the help I am leaving this here so anyone with similar problem can find and adopt