Would like to tokenize strings based on . , ; etc however would like to preserve email addresses, ip addresses and the likes. How do i use an analyzer with lucence to do this task? The following code which i found on stackoverflow does not preserve emails. Any pointers to documentation on how to use the pattern specification feature of StandardAnalyzer of lucene will also be helpful. Thanks much
String text
= "Lucene is simple yet powerful java based search library. [email protected]";
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
TokenStream tokenStream = analyzer.tokenStream(
LuceneConstants.CONTENTS, new StringReader(text));
TermAttribute term = tokenStream.addAttribute(TermAttribute.class);
while(tokenStream.incrementToken()) {
System.out.print("[" + term.term() + "] ");
ClassicAnalyzer, which was the StandardAnalyzer before version 3.1, handles email addresses and IP addresses in the way you are looking for.
It's less refined on text segmentation in general than StandardAnalyzer (especially for non-European languages), but works well for your test case.