MALLET Topic Modeling: Inconsistent Estimations

199 views Asked by At

I'm using MALLET to train a ParallelTopicModel. After training, I get a TopicInferencer, take a sentence, run it through the inferencer 15 times, and check the results. I'm finding that for some topics, the estimation is different each time and not consistent at all.

For example, with 20 topics, this is the output I'm getting for the estimated topic probabilities, for the same sentence:

[0.004888044738437717, 0.2961123293878907, 0.0023192114841146965, 0.003828168015645214, 0.3838058036596986, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.26812948669964976, 0.0023192114841146965, 0.0038281680156452146, 0.35582296097145744, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.052283368409032215, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.003828168015645214, 0.3931334178891125, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839043, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.09765976509353493, 0.03773855412711243, 0.007213888668919175, 0.0029028156321696105, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.3278401182832166, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.06967692240529397, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.0038281680156452146, 0.5143924028714901, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412126, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.014972911491377543, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.20283618709375414, 0.0023192114841146965, 0.0038281680156452146, 0.29985727559497544, 0.0023130490636768045, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.11631499355236223, 0.028410939897698752, 0.007213888668919175, 0.002902815632169611, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437716, 0.43602654282909553, 0.0023192114841146965, 0.0038281680156452146, 0.2998572755949755, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.03773855412711241, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.07224958788196291, 0.0023192114841146965, 0.0038281680156452146, 0.3278401182832165, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412129, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.04295575417961857, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.4490991032655942, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.07093859686785953, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.24014664401140884, 0.0023192114841146965, 0.0038281680156452146, 0.26254681867732077, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.06967692240529395, 0.05639378258593975, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.3744781894302849, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.06967692240529398, 0.047066168356526085, 0.007213888668919175, 0.002902815632169611, 0.06161098263844586, 0.0085078656328731, 0.0071022047541209835, 0.012203497697416594]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.0038281680156452146, 0.32784011828321646, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.10956004479961755, 0.0023192114841146965, 0.0038281680156452146, 0.3838058036596989, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.11631499355236223, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.25880187247023617, 0.0023192114841146965, 0.0038281680156452146, 0.28120204713614816, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.03773855412711241, 0.007213888668919175, 0.002902815632169611, 0.08959382532668683, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437716, 0.2214914155525815, 0.0023192114841146965, 0.0038281680156452146, 0.37447818943028494, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.07900453663470762, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.007102204754120983, 0.0028758834680029416]

As you can see, a few columns are very inconsistent. Why is this, and is there a way to prevent this? I'm using the distribution as features into another machine learning model, and having these inconsistencies are throwing my other model off.

My code:

ldaModel = new ParallelTopicModel(numTopics, alphaSum, beta);
instances = new InstanceList(new SerialPipes(pipeList));

for (int i = 0; i < data.length; i++) {
  String dataPt = data[i];
  Instance dataPtInstance = new Instance(dataPt, null, null, dataPt);
  instances.addThruPipe(dataPtInstance);
}
ldaModel.addInstances(instances);
ldaModel.setNumThreads(numThreads);
ldaModel.setNumIterations(numIterations);

try {
  ldaModel.setRandomSeed(DEFAULT_SEED);
  ldaModel.estimate();
  inferencer = ldaModel.getInferencer();
} catch (IOException e) {
  System.out.println(e);
}

String dataPt = "This is a test sentence.";
Instance dataPtInstance = new Instance(dataPt, null, null, dataPt);
InstanceList testList = new InstanceList(new SerialPipes(pipeList));
testList.addThruPipe(dataPtInstance);
double[] prob = inferencer.getSampledDistribution(testList.get(0), testIterations, thinIterations, burnInIterations);
2

There are 2 answers

1
Bernard On

If you want the inference to be consistent over multiple runs, you must also set the inferencer's random seed.

inferencer = ldaModel.getInferencer();
inferencer.setRandomSeed(DEFAULT_SEED);

Besides when training a model make sure to use a recent version as a bug with the random seed initialization was fixed around a year ago.

0
kk415kk On

I believe I figured out why. Due to Gibbs sampling, the sampled output for estimation is not guaranteed to be the same every time. One way around it is to put 0 iterations of sampling.