lucene searching

303 views Asked by At

Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it. Please try to understand what is my actual requirement. The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index. the html files not only having text only but also they are having links , images etc.

thanks in advance hoping help from you regards

1

There are 1 answers

7
Vikdor On

In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:

  • Stored the HTML document as is on disk (you can store in the DB as well).
  • Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
  • The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
  • Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
  • Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.

HTH.