I need a sample code to help me detect farsi language web pages by apache tika toolkit.
LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
String language = identifier.getLanguage();
I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?
Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:
In your example the input is misdetected as
li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works ofLanguageIdentifier.The Farsi language (Persian, ISO 639-1 2-letter code
fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.For this the following steps are necessary:
Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txtThis will a file calledfa.ngpwhich contains the n-grams.Configure Tika so that it recognizes the new language. Either do this programmatically using
LanguageIdentifier.initProfiles()or put a property file with the nametika.language.override.propertiesinto the classpath. Make sure the ngram file is in the classpath as well.If you now run Tika, it should correctly detect your language.
Update: Detailed the steps necessary to create a language profile.