I'm actually developing a system where you input some text files to a StandardAnalyzer, and the contents of that file are then replaced by the output of the StandardAnalyzer (which tokenizes and removes all the stop words). The code ive developed till now is :
File f = new File(path);
TokenStream stream = analyzer.tokenStream("contents",
new StringReader(readFileToString(f)));
CharTermAttribute charTermAttribute = stream.getAttribute(CharTermAttribute.class);
while (stream.incrementToken()) {
String term = charTermAttribute.toString();
System.out.print(term);
}
//Following is the readFileToString(File f) function
StringBuilder textBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
Scanner scanner = new Scanner(new FileInputStream(f));
while (scanner.hasNextLine()){
textBuilder.append(scanner.nextLine() + ls);
}
scanner.close();
return textBuilder.toString();
The readFileToString(f) is a simple function which converts the file contents to a string representation. The output i'm getting are the words each with the spaces or the new line between them removed. Is there a way to preserve the original spaces or the new line characters after the analyzer output, so that i can replace the original file contents with the filtered contents of the StandardAnalyzer and present it in a readable form?
Tokenizers save the term position, so in theory you could look at the position to determine how many characters there are between each token, but they don't save the data which was between the tokens. So you could get back spaces, but not newlines.
If you're comfortable with JFlex you could modify the tokenizer to treat newlines as a token. That's probably harder than any gain you'd get from it though.