I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.
Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.
Cheers!
You can't really access that type of data in a Document
What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.
This introduces some more work, but the best way to achieve that IMO.