I am trying to find the most frequent words in the text field of an indexed document using Solr 4.10. I created a PDF document from a text file with some text and posted it to Solr using post.jar and when queried based on its id it gives me pdf contents which are shown below and all meta-data of the document.
<arr name="text">
<str>sample1</str>
<str/>
<str>application/pdf</str>
<str>
sample1 sample1.txt cook cook1 book1 book1 book2 nook1 nook1 nook2 nook2 two three four Page 1
</str>
</arr>
In summary I want to detect that we have cook, cook1 with count 1 each and book1,book2,nook1, nook2 with count 2 each.
I used TermVectorComponent configuration from TermVectorComponent and my schema.xml has the text field:
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
and solrconfig.xml has
<searchComponent name="tvComponent" class="solr.TermVectorComponent"/>
<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="df">text</str>
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
The field type 'text_general' is defined as:
<fieldType class="solr.TextField" name="text_general" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Finally when I query it from browser using following query which I think is requesting the word count in the 'text' field of the document with id provided i.e.
http://localhost:8983/solr/select/?q=id:7e75017b-066d-4257-af10-b770726c7cf4&start=0&rows=100&indent=on&qt=tvrh&tv=true&tv.fl=text&f.text.tv.tf=true&tv.fl=text
it returns me all information of the document response except the word count. I only want to see the word count in the 'text' field just like the response we obtain when we use rows=0 for faceting i.e. an string array of word vs count.
Any help will be greatly appreciated.
NOTE: I am trying to get word frequency of 'text' field of one document not of 'text' field of all indexed documents. In other words, how to ask Solr to avoid throwing away duplicate tokens or duplicate stemmed tokens so we can search for most frequent words in a field.
You don't need to use the terms component for this. If you are tokenizing the text field, you should be able to easily facet on the field like so :
This will give you a list of tokens (words) in that text field sorted by frequency of occurrence. You can also tweak the facet parameters to suit your needs with facet.limit and etc...
Keep in mind that this will count the tokens in that field, so make sure you review field analyzers/filters to make sure you are getting the correct results since different filters will generate tokens differently.
For exact word count, tokenization on whitespace plus basic stemming will probably get you where you need to be.