Searching for alphanumeric space-separated string in a field with multiple strings (Lucene)

444 views Asked by At

Background

I have a Lucene 3.6.0 index with these two fields (sample data below each):

company
-------
Tesla Car Works
Family Auto Body

codes
-----
CHP-13724 CHP-194561
RPS-204978 RPS-204979 CHP-194567

The codes field is made up of multiple code strings, e.g. "CHP-13724", or "RPS-204979".

The problem: I can't search for an individual code string in the codes field. (See "Details" below for more info.)

Question

Is there a way to search for one of these codes successfully, ideally using the standard Lucene packages and not a Contrib package? (If it has to be a Contrib package, please point me to a download link.)

Details

If I use Luke to search a field, and set the analyzer for the searched field to StandardAnalyzer (or WhitespaceAnalyzer, or any of the many I've tried), I have not been able to find an individual 'code' string and end up with an empty resultset. So if I search with 'codes:"CHP-194561"' as my query in Luke, I get nothing. However, if I search with 'company:"Car"' I have no problem getting a result.

The exception: if I search for the first code in a record's space-separated list of codes with a wildcard, e.g. codes:RPS-204978*, it will give me the expected row. But using the second code, e.g. codes:RPS-204979*, returns nothing.

So: in the codes field, for some reason, it can't find a space-delimited string unless it's the first string and a wildcard is used in the query, but it can in the company field no matter where the string appears, and without using wildcards.

EDIT: The codes field is indexed using NOT_ANALYZED. (So the field contains a single term, a string, which is made up of a whitespace-delimited series of codes.)

2

There are 2 answers

0
M Pickles On BEST ANSWER

Basically, I've found that I can't search for space-delimited substrings within a single long non-tokenized string (aside from searching for the first part of the string).

P.S. Eventually, I gave up and indexed the field with the WhitespaceAnalyzer (which breaks up the long string into multiple terms). I was hoping to avoid this as it means a rather long index rebuild, but it's the only way forward that I can see.

1
mindas On

Have you tried using WhitespaceTokenizer? It splits on whitespace which is what you need.