Get stemmed word in Lucene

1.2k views Asked by At

In Lucene I use the SnowballAnalyzer for indexing and searching.

When I have the index built I make queries on my index. For example I make a query 'specialized' for the field 'body'. IndexSearcher returns documents containing 'specialize, specialized etc.' because of the stemming done by the SnowballAnalyzer.

Now - having top documents - I want to get a text snippet from the body field. This snipped should contain the stemmed version of the query word.
For example one of the returned documents has the body field: "Unfortunately, in some states, blind people only have access to general rehabilitation agencies, which serve people with a variety of disabilities. In these cases, specialized services for visually impaired people are not always available." Then I wish to get the part 'In these cases, specialized services for visually' as the snippet. Additionally I want to have terms from this snippet. Code which will do it, but with one marked '?' character, where I have a question is:

How I want to do it is IndexReader ir = IndexReader.open(fsDir);
TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");

? - here: query - query has to be the term. So if the real query was 'specialized' then the query should be specialize, what normally the snowball analyzer does. How can I get the term analyzed by the analyzer for a single word or a phrase, since query can contain a phrase: "specialized machines".

int idx = tv.indexOf(query);
int [] idxs = tv.getTermPositions(idx);
for(String t : tv.getTerms()){
int iidx = tv.indexOf(t);
int [] iidxs = tv.getTermPositions(iidx);
for(int ni : idxs){
tmpValue = 0.0f;
for(int nni : iidxs){
if(Math.abs(nni-ni)<= Settings.termWindowSize){

edit
I found the way to get the stemmed term:
Query q = queryParser.parse("some text to be parsed"); String parsedQuery = q.toString();
There is a method for the Query object toString(String fieldName);

1

There are 1 answers

1
Yuval F On BEST ANSWER

I believe you are mixing several questions. First, to see the stemmed version of your query, and other useful information, you can use the IndexSearcher's explain() method. Please see my answer to this question.

The Lucene solution for getting snippets is the Highlighter. Another option is the FastVectorHighlighter. I believe you can customize both to get the stemmed term rather than the full one.