We are building a bool query out of search term strings to search our Lucene indexes. I would like these strings to be analyzed with the Standard Analyzer, the analyzer we are using for our indexes. For example, foo-bar 1-2-3 should be broken up as foo, bar, 1-2-3 since the Lucene doc states that hyphens cause numbers to stay together but words to be tokenized. What is the best way to do this?
Currently I am running my search term strings through a QueryParser.
QueryParser parser = new QueryParser("", new StandardAnalyzer());
Query query = parser.parse(aSearchTermString);
The problem with this is that quotes are inserted. For example, foo-bar 1-2-3 becomes "foo bar", 1-2-3, which does not return anything because Lucene would have tokenized foo-bar into foo and bar.
I definitely don't want to hack this situation by removing the quotes with replace because I feel that I am probably missing something or doing something incorrectly.
I am actually getting different results for
StandardAnalyzer. Consider this code (using Lucene v4):Above prints:
So above code proves that
StandardAnalyzer, unlike for exampleClassicAnalyzer, should be splitting1-2-3into different tokens - exactly as you want. For queries, you need to escape every keyword, including space, otherwise QP thinks this has a different meaning.If you don't want to escape your query string, you can always tokenize it manually (like in
printTokensmethod above), then wrap each token with aTermQueryand stack all TermQueries into aBooleanQuery.