how can I detect farsi web pages by tika?

3k views Asked by At

I need a sample code to help me detect farsi language web pages by apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?

1

There are 1 answers

6
Kai Sternad On BEST ANSWER

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.

The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.

For this the following steps are necessary:

  1. Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.

  2. Create an ngram file for the language identifier. This can be done using TikaCLI:

    java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt This will a file called fa.ngp which contains the n-grams.

  3. Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.

If you now run Tika, it should correctly detect your language.

Update: Detailed the steps necessary to create a language profile.