The following program:
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.util.Version;
public class LuceneTest {
static final List<Character> SPECIAL_CHARS =
Arrays.asList('\\', '+', '-', '!', '(', ')', ':', '^', '[', ']', '"', '{', '}', '~', '*', '?', '|', '&');
public static void main(String[] args) throws ParseException {
QueryParser query =
new QueryParser(Version.LUCENE_31, "", new StandardAnalyzer(Version.LUCENE_31));
for (char c : SPECIAL_CHARS) {
System.out.println(c + " -> " + query.parse("__catch_all:foo\\" + c + "bar").toString());
}
}
}
Gives this output:
\ -> __catch_all:foo __catch_all:bar
+ -> __catch_all:foo __catch_all:bar
- -> __catch_all:foo __catch_all:bar
! -> __catch_all:foo __catch_all:bar
( -> __catch_all:foo __catch_all:bar
) -> __catch_all:foo __catch_all:bar
: -> __catch_all:foo:bar
^ -> __catch_all:foo __catch_all:bar
[ -> __catch_all:foo __catch_all:bar
] -> __catch_all:foo __catch_all:bar
" -> __catch_all:foo __catch_all:bar
{ -> __catch_all:foo __catch_all:bar
} -> __catch_all:foo __catch_all:bar
~ -> __catch_all:foo __catch_all:bar
* -> __catch_all:foo __catch_all:bar
? -> __catch_all:foo __catch_all:bar
| -> __catch_all:foo __catch_all:bar
& -> __catch_all:foo __catch_all:bar
Note the apparent inconsistency with : and also note that I'm escaping the special character (doing exactly the same as QueryParser.escape does). I expect StandardAnalyzer to strip out special punctuation from query terms, and it does in almost all cases.
The reason this seems particularly inconsistent is that writing a document with a StandardAnalyzer and a field text of "foo:bar" will give me a two term field, foo and bar!
A second round of escaping gives the correct result, i.e. effectively "foo\\:bar"; but why is this necessary for colons only? Why should I need to do QueryParser.escape(QueryParser.escape(mystring)) to avoid this behaviour?
The different handling of ':' is not the fault of QueryParser but of StandardAnalyzer. Actually, ':' is the only character from your list which is not considered to be a separator by StandardAnalyzer. As a consequence, analyzing "a:b" would yield one token "a:b" whereas analyzing "a'b" would yield two tokens "a" and "b".
Here is what happens:
Original String -> unescaped string -> tokens -> query
"foo\:bar" -> "foo:bar" -> [ "foo:bar" ] -> TermQuery(__catch_all, "foo:bar")
"foo\+bar" -> "foo+bar" -> [ "foo", "bar" ] -> TermQuery(__catch_all, "foo") OR TermQuery(__catch_all, "bar")