Mallet topic model - inconsistent results with serialized file

463 views Asked by At

I train a topic model with Mallet, and I want to serialize it for later use. I ran it on two test documents, and then deserialized it and ran the loaded model on the same documents, and the results were completely different.

Is there anything wrong with the way I'm saving/loading the documents (code attached)?

Thanks!

List<Pipe> pipeList = initPipeList();
// Begin by importing documents from text to feature sequences

InstanceList instances = new InstanceList(new SerialPipes(pipeList));

for (String document : documents) {
    Instance inst = new Instance(document, "","","");
    instances.addThruPipe(inst);
}

ParallelTopicModel model = new ParallelTopicModel(numTopics, alpha_t * numTopics, beta_w);
model.addInstances(instances);
model.setNumThreads(numThreads);
model.setNumIterations(numIterations);
model.estimate();

printProbabilities(model, "doc 1"); // I replaced the contents of the docs due to copywrite issues
printProbabilities(model, "doc 2");

model.write(new File("model.bin"));
model = ParallelTopicModel.read("model.bin");

printProbabilities(model, "doc 1");
printProbabilities(model, "doc 2");

Definition of printProbabilities():

public void printProbabilities(ParallelTopicModel model, String doc) {

    List<Pipe> pipeList = initPipeList();

    InstanceList instances = new InstanceList(new SerialPipes(pipeList));
    instances.addThruPipe(new Instance(doc, "", "", ""));

    double[] probabilities = model.getInferencer().getSampledDistribution(instances.get(0), 10, 1, 5);

    for (int i = 0; i < probabilities.length; i++) {
        double probability = probabilities[i];
        if (probability > 0.01) {
            System.out.println("Topic " + i + ", probability: " + probability);
        }
    }
}
2

There are 2 answers

1
Nikola Morena On BEST ANSWER

You have to use the same pipe for training and for classification. During traning, pipe's data alphabet gets updated with each training instance. You don't produce the same pipe using new SerialPipe(pipeList) as its data alphabet is empty. Save/load the pipe or instance list containing the pipe along with the model, and use that pipe to add test instances.

1
Sir Cornflakes On

When you don't fix a random seed, every run of Mallet gives you a different topic model (with the numbers of the topics permuted, some topics slightly different, other topics very different).

Fix the random seed to get replicable topics.