MALLET Topic Modeling: Inconsistent Estimations

Question

MALLET Topic Modeling: Inconsistent Estimations

191 views Asked by kk415kk At 25 June 2015 at 01:06

I'm using MALLET to train a ParallelTopicModel. After training, I get a TopicInferencer, take a sentence, run it through the inferencer 15 times, and check the results. I'm finding that for some topics, the estimation is different each time and not consistent at all.

For example, with 20 topics, this is the output I'm getting for the estimated topic probabilities, for the same sentence:

[0.004888044738437717, 0.2961123293878907, 0.0023192114841146965, 0.003828168015645214, 0.3838058036596986, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.26812948669964976, 0.0023192114841146965, 0.0038281680156452146, 0.35582296097145744, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.052283368409032215, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.003828168015645214, 0.3931334178891125, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839043, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.09765976509353493, 0.03773855412711243, 0.007213888668919175, 0.0029028156321696105, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.3278401182832166, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.06967692240529397, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.0038281680156452146, 0.5143924028714901, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412126, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.014972911491377543, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.20283618709375414, 0.0023192114841146965, 0.0038281680156452146, 0.29985727559497544, 0.0023130490636768045, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.11631499355236223, 0.028410939897698752, 0.007213888668919175, 0.002902815632169611, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437716, 0.43602654282909553, 0.0023192114841146965, 0.0038281680156452146, 0.2998572755949755, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.03773855412711241, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.07224958788196291, 0.0023192114841146965, 0.0038281680156452146, 0.3278401182832165, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412129, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.04295575417961857, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.4490991032655942, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.028410939897698755, 0.007213888668919175, 0.002902815632169611, 0.07093859686785953, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.24014664401140884, 0.0023192114841146965, 0.0038281680156452146, 0.26254681867732077, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.06967692240529395, 0.05639378258593975, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.2588018724702361, 0.0023192114841146965, 0.0038281680156452146, 0.3744781894302849, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.06967692240529398, 0.047066168356526085, 0.007213888668919175, 0.002902815632169611, 0.06161098263844586, 0.0085078656328731, 0.0071022047541209835, 0.012203497697416594]
[0.004888044738437717, 0.2681294866996498, 0.0023192114841146965, 0.0038281680156452146, 0.32784011828321646, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.08833215086412127, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.024300525720791197, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.10956004479961755, 0.0023192114841146965, 0.0038281680156452146, 0.3838058036596989, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.018537939424381665, 0.11631499355236223, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437717, 0.25880187247023617, 0.0023192114841146965, 0.0038281680156452146, 0.28120204713614816, 0.002313049063676805, 0.007598391273477824, 0.019202573922429175, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.09765976509353493, 0.03773855412711241, 0.007213888668919175, 0.002902815632169611, 0.08959382532668683, 0.0085078656328731, 0.0071022047541209835, 0.0028758834680029416]
[0.004888044738437716, 0.2214914155525815, 0.0023192114841146965, 0.0038281680156452146, 0.37447818943028494, 0.002313049063676805, 0.007598391273477824, 0.009874959693015517, 0.0030322960351839047, 0.0019749423390935396, 0.002792447952547967, 0.01853793942438167, 0.07900453663470762, 0.03773855412711243, 0.007213888668919175, 0.002902815632169611, 0.03362813995020485, 0.0085078656328731, 0.007102204754120983, 0.0028758834680029416]

As you can see, a few columns are very inconsistent. Why is this, and is there a way to prevent this? I'm using the distribution as features into another machine learning model, and having these inconsistencies are throwing my other model off.

My code:

ldaModel = new ParallelTopicModel(numTopics, alphaSum, beta);
instances = new InstanceList(new SerialPipes(pipeList));

for (int i = 0; i < data.length; i++) {
  String dataPt = data[i];
  Instance dataPtInstance = new Instance(dataPt, null, null, dataPt);
  instances.addThruPipe(dataPtInstance);
}
ldaModel.addInstances(instances);
ldaModel.setNumThreads(numThreads);
ldaModel.setNumIterations(numIterations);

try {
  ldaModel.setRandomSeed(DEFAULT_SEED);
  ldaModel.estimate();
  inferencer = ldaModel.getInferencer();
} catch (IOException e) {
  System.out.println(e);
}

String dataPt = "This is a test sentence.";
Instance dataPtInstance = new Instance(dataPt, null, null, dataPt);
InstanceList testList = new InstanceList(new SerialPipes(pipeList));
testList.addThruPipe(dataPtInstance);
double[] prob = inferencer.getSampledDistribution(testList.get(0), testIterations, thinIterations, burnInIterations);

Original Q&A

There are 2 answers

**Bernard** · Answer 1 · 2015-08-27T09:40:41+00:00

If you want the inference to be consistent over multiple runs, you must also set the inferencer's random seed.

inferencer = ldaModel.getInferencer();
inferencer.setRandomSeed(DEFAULT_SEED);

Besides when training a model make sure to use a recent version as a bug with the random seed initialization was fixed around a year ago.

**kk415kk** · Answer 2 · 2015-06-27T00:12:00+00:00

kk415kk On 27 June 2015 at 00:12

I believe I figured out why. Due to Gibbs sampling, the sampled output for estimation is not guaranteed to be the same every time. One way around it is to put 0 iterations of sampling.

TechQA.

MALLET Topic Modeling: Inconsistent Estimations

There are 2 answers

Related Questions in JAVA

Related Questions in LDA

Related Questions in TOPIC-MODELING

Related Questions in MALLET

Popular Questions

Popular Tags

Trending Questions