I am trying to train Doc2Vec model using the following code:
String modelPath = "input_data.csv";
File file = new File(modelPath);
SentenceIterator iter = new BasicLineIterator(file);
AbstractCache<VocabWord> cache = new AbstractCache<>();
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
LabelsSource source = new LabelsSource("DOC_");
ParagraphVectors vec = new ParagraphVectors.Builder()
.minWordFrequency(1)
.iterations(5)
.epochs(1)
.layerSize(100)
.learningRate(0.025)
.labelsSource(source)
.windowSize(5)
.iterate(iter)
.trainWordVectors(false)
.vocabCache(cache)
.tokenizerFactory(t)
.sampling(0)
.workers(4)
.build();
vec.fit();
File tempFile = new File("trained_model.zip");
WordVectorSerializer.writeParagraphVectors(vec, tempFile);
- This code works for small input file
When I try to execute this code on large file (18GB), I am getting the following error
......... o.d.m.s.SequenceVectors - Time spent on training: 5667912 ms Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Stream Closed at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:477) at org.deeplearning4j.examples.nlp.paragraphvectors.ParagraphVectorsTextExample.main(ParagraphVectorsTextExample.java:73) Caused by: java.lang.RuntimeException: java.io.IOException: Stream Closed at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeWordVectors(WordVectorSerializer.java:393) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:687) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:475) ... 1 more Caused by: java.io.IOException: Stream Closed at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeWordVectors(WordVectorSerializer.java:392) ... 3 more
I am not sure what I am doing wrong. Is there any way around this?