I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java.
At this time I am looking for a database or a text file of english words with their different forms. for example:
run running ran ... include including included ... ...
Thank you for your help or advise.
You could download LanguageTool (Disclaimer: I'm the maintainer), which comes with a binary file
english.dict
. The LanguageTool Wiki describes how to dump that file as a text file:For
run
, the file will contain this:The first column is the inflected form, the second is the base form, and the third is the part-of-speech tag according to the (slightly extended) Penn Treebank tagset.