Luke reveals unknown term values for numeric fields in index

2k views Asked by At

We use Lucene.net for indexing. One of the fields that we index, is a numeric field with the values 1 to 6 and 9999 for not set.

When using Luke to explore the index, we see terms that we do not recognize. The index contains a total of 38673 documents, and Luke shows the following top ranked terms for this field:

Term | Rank  | Field | Text | Text (decoded as numeric-int)
 1   | 38673 | Axis  | x    | 0  
 2   | 38673 | Axis  | p    | 0  
 3   | 38673 | Axis  | t    | 0  
 4   | 38673 | Axis  | |    | 0  
 5   | 19421 | Axis  | l    | 0  
 6   | 19421 | Axis  | h    | 0  
 7   | 19421 | Axis  | d@   | 0  
 8   | 19252 | Axis  | `  N | 9999  
 9   | 19252 | Axis  | l    | 8192
10   | 19252 | Axis  | h  ' | 9984
11   | 19252 | Axis  | d@ p | 9984 
12   | 18209 | Axis  | `    | 4  
13   |   950 | Axis  | `    | 1  
14   |   116 | Axis  | `    | 5  
15   |   102 | Axis  | `    | 6  
16   |    26 | Axis  | `    | 3  
17   |    18 | Axis  | `    | 2  

We find the same pattern for other numeric fields.

Where does the unknown values come from?

1

There are 1 answers

3
Jf Beaulac On BEST ANSWER

NumericFields are indexed using a trie structure. The terms you see are part of it, but will not return results if you query for them.

Try indexing your NumericField with a precision step of Int32.MaxValue and the values will go away.

NumericField documentation

... Within Lucene, each numeric value is indexed as a trie structure, where each term is logically assigned to larger and larger pre-defined brackets (which are simply lower-precision representations of the value). The step size between each successive bracket is called the precisionStep, measured in bits. Smaller precisionStep values result in larger number of brackets, which consumes more disk space in the index but may result in faster range search performance. The default value, 4, was selected for a reasonable tradeoff of disk space consumption versus performance. You can use the expert constructor NumericField(String,int,Field.Store,boolean) if you'd like to change the value. Note that you must also specify a congruent value when creating NumericRangeQuery or NumericRangeFilter. For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use Integer.MAX_VALUE, which produces one term per value. ...

More details on the precision step available in the NumericRangeQuery documentation:

Good values for precisionStep are depending on usage and data type:

• The default for all data types is 4, which is used, when no precisionStep is given.

• Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.

• Ideal value in most cases for 32 bit data types (int, float) is 4.

• For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use •Integer.MAX_VALUE (see below).

• Steps ≥64 for long/double and ≥32 for int/float produces one token per value in the index and querying is as slow as a conventional TermRangeQuery. But it can be used to produce fields, that are solely used for sorting (in this case simply use Integer.MAX_VALUE as precisionStep). Using NumericFields for sorting is ideal, because building the field cache is much faster than with text-only numbers. These fields have one term per value and therefore also work with term enumeration for building distinct lists (e.g. facets / preselected values to search for). Sorting is also possible with range query optimized fields using one of the above precisionSteps.

EDIT

little sample, the index produced by this will show terms with value 8192, 9984, 1792, etc in luke, but using a range that would include them in the query doesnt produce results:

NumericField number = new NumericField("number", Field.Store.YES, true);
Field regular = new Field("normal", "", Field.Store.YES, Field.Index.ANALYZED);

IndexWriter iw = new IndexWriter(FSDirectory.GetDirectory("C:\\temp\\testnum"), new StandardAnalyzer(), true);

Document doc = new Document();
doc.Add(number);
doc.Add(regular);

number.SetIntValue(1);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(2);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(13);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(2000);
regular.SetValue("one");
iw.AddDocument(doc);

number.SetIntValue(9999);
regular.SetValue("one");
iw.AddDocument(doc);

iw.Commit();

IndexSearcher searcher = new IndexSearcher(iw.GetReader());

NumericRangeQuery rangeQ = NumericRangeQuery.NewIntRange("number", 1, 2, true, true);
var docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 2

rangeQ = NumericRangeQuery.NewIntRange("number", 13, 13, true, true);
docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 1

rangeQ = NumericRangeQuery.NewIntRange("number", 9000, 9998, true, true);
docs = searcher.Search(rangeQ);
Console.WriteLine(docs.Length().ToString()); // prints 0

Console.ReadLine();