How often should I re-warm my Lucene index?

1k views Asked by At

I was wondering if anyone else has had the the same Lucene (not Solr) situation?

When I open a Lucene index I warm it with a typical query and then keep the searcher cached for a period of time so that many queries can use it. I then re-open it and repeat. Because I am running Lucene 3.6 on Linux, as I understand it most of my open index data resides in the filesystem cache rather than the JVM heap. What I find is that the response time for queries increases over time - unless I keep re-warming the searcher by re-running my typical query. Has anyone else had this issue? If so, is re-warming the only way to keep he query responsive? How often works best?

Some background

  • the machine is always very busy doing other non-Lucene file processing, which makes me suspect the F/S cache pages are being replaced over time
  • my indexer does not run in the same JVM as my query server, so NRT etc. isn't relevant

Thanks!

Chris

2

There are 2 answers

2
mindas On

Which directory are you using?

You can try playing with swappiness as explained http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.

Another option would be using mlockall as explained in http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html.

0
Salah On

I think that this issue is not related to lucene itself, i think its an OS issues, as you know lucene is using java I/O libraries, which use the OS native I/O methods.

So what i think that happened that for each time you warm your searcher in a new query, your OS has cache the entire files that retrieved by that query, so if you re-warming the searcher in the same query, it will retrieve fast, but if warm your searcher in another query, then your OS need to cache the files again because its different files. and that is really an over head on your OS resources.

But i am really wondering why do want to keep your reader for a period of time, what i am trying to say is, if the search queries come from users, the percentage of repeating the same query is very weak, also creating a new IndexSearcher object is not that cost.

so my suggestions for you is to create a IndexSearcher for each query (get rid of the resources once you finish the job). if your business case can work with that.