Preserving emails while tokenizing based on . with lucene

147 views Asked by At

Would like to tokenize strings based on . , ; etc however would like to preserve email addresses, ip addresses and the likes. How do i use an analyzer with lucence to do this task? The following code which i found on stackoverflow does not preserve emails. Any pointers to documentation on how to use the pattern specification feature of StandardAnalyzer of lucene will also be helpful. Thanks much

   String text 
         = "Lucene is simple yet powerful java based search library. [email protected]";
      Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

      TokenStream tokenStream = analyzer.tokenStream(
         LuceneConstants.CONTENTS, new StringReader(text));

      TermAttribute term = tokenStream.addAttribute(TermAttribute.class);

      while(tokenStream.incrementToken()) {
         System.out.print("[" + term.term() + "] ");
1

There are 1 answers

2
femtoRgon On

ClassicAnalyzer, which was the StandardAnalyzer before version 3.1, handles email addresses and IP addresses in the way you are looking for.

It's less refined on text segmentation in general than StandardAnalyzer (especially for non-European languages), but works well for your test case.