Lucene get all non deleted document from index file

957 views Asked by At

I am trying to get all documents from Lucene Index (which is already not deleted ).

I heard that if I delete something from Lucene Index, Lucene will not delete immediately from file.

So I wanted to get the documents from Index file which is not deleted.

2

There are 2 answers

2
knutwalker On BEST ANSWER

Lucene provides a bitset of all non-deleted documents, called liveDocs. You can get it by iterating over all LeafReaders (or using the SlowCompositeReaderWrapper) and calling the liveDocs method or by using the MultiFields class.

Once you have this bitset, you can iterator from 0 to IndexReader#maxDoc and consult the bitset to know whether a docid is representing a deleted document or a live one. You can access all stored fields of a deleted document just as you would from a live one.

However, once a segment gets merged, its deleted documents are permanently deleted and thus removed from the index.

1
Bruno dos Santos On

It's not possible. When you delete a document from Lucene index it is not immediately deleted because reconstruct all index is so expensive. This old document is flagged to be definitively removed on an index optimize. But it's not visible anymore. It's just visible internally to Lucene. If you remove a document and commit it never could be get anymore.