Preserving emails while tokenizing based on . with lucene

Question

Preserving emails while tokenizing based on . with lucene

146 views Asked by STEMExchanger At 24 June 2016 at 07:43

Would like to tokenize strings based on . , ; etc however would like to preserve email addresses, ip addresses and the likes. How do i use an analyzer with lucence to do this task? The following code which i found on stackoverflow does not preserve emails. Any pointers to documentation on how to use the pattern specification feature of StandardAnalyzer of lucene will also be helpful. Thanks much

   String text 
         = "Lucene is simple yet powerful java based search library. [email protected]";
      Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

      TokenStream tokenStream = analyzer.tokenStream(
         LuceneConstants.CONTENTS, new StringReader(text));

      TermAttribute term = tokenStream.addAttribute(TermAttribute.class);

      while(tokenStream.incrementToken()) {
         System.out.print("[" + term.term() + "] ");

Original Q&A

There are 1 answers

**femtoRgon** · Answer 1 · 2016-06-24T14:42:12+00:00

ClassicAnalyzer, which was the StandardAnalyzer before version 3.1, handles email addresses and IP addresses in the way you are looking for.

It's less refined on text segmentation in general than StandardAnalyzer (especially for non-European languages), but works well for your test case.

TechQA.

Preserving emails while tokenizing based on . with lucene

There are 1 answers

Related Questions in LUCENE

Related Questions in STANDARDANALYZER

Popular Questions

Popular Tags

Trending Questions