How to find the number of documents (and fraction) per topic using LDA?

570 views Asked by At

I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api.

public static void LDAModel(int numofK,int numbofIteration,int numberofThread,String outputDir,InstanceList instances) throws Exception
{
   // Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
    //  Note that the first parameter is passed as the sum over topics, while
    //  the second is the parameter for a single dimension of the Dirichlet prior.
    int numTopics = numofK;
    ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);

    model.addInstances(instances);

    // Use two parallel samplers, which each look at one half the corpus and combine
    //  statistics after every iteration.
    model.setNumThreads(numberofThread);

    // Run the model for 50 iterations and stop (this is for testing only, 
    //  for real applications, use 1000 to 2000 iterations)
    model.setNumIterations(numbofIteration);
    model.estimate();
    // Show the words and topics in the first instance

    // The data alphabet maps word IDs to strings
    Alphabet dataAlphabet = instances.getDataAlphabet();

    FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
    LabelSequence topics = model.getData().get(0).topicSequence;

    Formatter out = new Formatter(new StringBuilder(), Locale.US);
    for (int position = 0; position < tokens.getLength(); position++) {
       // out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
         out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));

    }
    System.out.println(out);

    // Estimate the topic distribution of the first instance, 
    //  given the current Gibbs state.
    double[] topicDistribution = model.getTopicProbabilities(0);

    // Get an array of sorted sets of word ID/count pairs
    ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();

    // Show top 10 words in topics with proportions for the first document
    String topicsoutput="";
    for (int topic = 0; topic < numTopics; topic++) {
        Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();

        out = new Formatter(new StringBuilder(), Locale.US);
        out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
        int rank = 0;
        while (iterator.hasNext() && rank < 10) {
            IDSorter idCountPair = iterator.next();
            out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
            //out.format("%s ", dataAlphabet.lookupObject(idCountPair.getID()));
            rank++;
        }
        System.out.println(out);

    }


    // Create a new instance with high probability of topic 0
    StringBuilder topicZeroText = new StringBuilder();
    Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();

    int rank = 0;
    while (iterator.hasNext() && rank < 10) {
        IDSorter idCountPair = iterator.next();
        topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
        rank++;
    }

    // Create a new instance named "test instance" with empty target and source fields.
    InstanceList testing = new InstanceList(instances.getPipe());
    testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));

    TopicInferencer inferencer = model.getInferencer();
    double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
    System.out.println("0\t" + testProbabilities[0]);


    File pathDir = new File(outputDir + File.separator+ "NumofTopics"+numTopics);   //FIXME replace all strings with constants
pathDir.mkdir();
    String DirPath = pathDir.getPath();
    String stateFile = DirPath+File.separator+"output_state.gz";
    String outputDocTopicsFile = DirPath+File.separator+"output_doc_topics.txt";
    String topicKeysFile = DirPath+File.separator+"output_topic_keys";
    PrintWriter writer=null;
    String topicKeysFile_fromProgram = DirPath+File.separator+"output_topic";

    try {
        writer = new PrintWriter(topicKeysFile_fromProgram, "UTF-8");
        writer.print(topicsoutput);
        writer.close();
    } catch (Exception e) {
            e.printStackTrace();
    }

    model.printTopWords(new File(topicKeysFile), 11, false);           
    model.printDocumentTopics(new File (outputDocTopicsFile));
    model.printState(new File (stateFile));

}
 public static void main(String[] args) throws Exception{

    // Begin by importing documents from text to feature sequences
    ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

    // Pipes: lowercase, tokenize, remove stopwords, map to features
    pipeList.add( new CharSequenceLowercase() );
    pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
    pipeList.add( new TokenSequenceRemoveStopwords(new File("H:\\Data\\stoplists\\en.txt"), "UTF-8", false, false, false) );
    pipeList.add( new TokenSequence2FeatureSequence() );
    InstanceList instances = new InstanceList (new SerialPipes(pipeList));

    Reader fileReader = new InputStreamReader(new FileInputStream(new File("E:\\Thesis Data\\DataForLDA\\freshnewData\\cleanTweets.txt")), "UTF-8");
    instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
                                           3, 2, 1)); // data, label, name fields

    int numberofTopic=5;
    int numberofIteration=50;
    int numberofThread=6;
    String outputDir="J:\\Topics\\";

    //int numberofTopic=5;
     LDAModel(numberofTopic,numberofIteration,numberofThread,outputDir,instances); 
    TimeUnit.SECONDS.sleep(30);
    numberofTopic=10;  }       

I have got three files from the above program. 1. state file 2. topic proportion file 3. key topic list

I would like to find out the number of documents allocated per topic. For example I got the following output from key topic list file

  1. 0.004 obama (5471) canada (5283) woman (5152) vote (4879) police(3965)

where first column means topic serial number, second column means topic weight, third column means words under this topic (number of words)

Here, I got number of words under this topic but I would also like to show the number of documents where I got this topic. It would be helpful to show this output as a separate file like this. For example,

Topic 1: doc1(80%) doc2(70%) .......

Could anyone please give some idea or any source code for this? Thanks.

1

There are 1 answers

2
Sir Cornflakes On

The information you are looking for is contained in the file "2. topic proportion" you mentioned. Note that every document contains each topic with some percentage (although the percentages may be large for one topic and extremly small for others). You will have to decide what you want to extract from the file: The dominant topic (it is in column 3); The dominant topic, but only when the percentage is at least 50% (sometimes, two topics have almost the same percentage) ...