Stream closed: writeParagraphVectors for large file deeplearning4j

218 views Asked by At

I am trying to train Doc2Vec model using the following code:

String modelPath = "input_data.csv";
File file = new File(modelPath);
SentenceIterator iter = new BasicLineIterator(file);

AbstractCache<VocabWord> cache = new AbstractCache<>();

TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());

LabelsSource source = new LabelsSource("DOC_");

ParagraphVectors vec = new ParagraphVectors.Builder()
    .minWordFrequency(1)
    .iterations(5)
    .epochs(1)
    .layerSize(100)
    .learningRate(0.025)
    .labelsSource(source)
    .windowSize(5)
    .iterate(iter)
    .trainWordVectors(false)
    .vocabCache(cache)
    .tokenizerFactory(t)
    .sampling(0)
    .workers(4)
    .build();

vec.fit();
File tempFile = new File("trained_model.zip");
WordVectorSerializer.writeParagraphVectors(vec, tempFile);
  • This code works for small input file
  • When I try to execute this code on large file (18GB), I am getting the following error

    .........
    o.d.m.s.SequenceVectors - Time spent on training: 5667912 ms
    Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Stream Closed
    at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:477)
    at org.deeplearning4j.examples.nlp.paragraphvectors.ParagraphVectorsTextExample.main(ParagraphVectorsTextExample.java:73)
    Caused by: java.lang.RuntimeException: java.io.IOException: Stream Closed
    at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeWordVectors(WordVectorSerializer.java:393)
    at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:687)
    at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:475)
    ... 1 more
    Caused by: java.io.IOException: Stream Closed
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:326)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
    at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeWordVectors(WordVectorSerializer.java:392)
    ... 3 more
    

I am not sure what I am doing wrong. Is there any way around this?

0

There are 0 answers