I use wordnet similarity java api to measure similarity between two synsets as such:
public class WordNetSimalarity {
private static ILexicalDatabase db = new NictWordNet();
private static RelatednessCalculator[] rcs = {
new HirstStOnge(db), new LeacockChodorow(db), new Lesk(db), new WuPalmer(db),
new Resnik(db), new JiangConrath(db), new Lin(db), new Path(db)
};
public static double computeSimilarity( String word1, String word2 ) {
WS4JConfiguration.getInstance().setMFS(true);
double s=0;
for ( RelatednessCalculator rc : rcs ) {
s = rc.calcRelatednessOfWords(word1, word2);
// System.out.println( rc.getClass().getName()+"\t"+s );
}
return s;
}
Main class
public static void main(String[] args) {
long t0 = System.currentTimeMillis();
File source = new File ("TagsFiltered.txt");
File target = new File ("fich4.txt");
ArrayList<String> sList= new ArrayList<>();
try {
if (!target.exists()) target.createNewFile();
Scanner scanner = new Scanner(source);
PrintStream psStream= new PrintStream(target);
while (scanner.hasNext()) {
sList.add(scanner.nextLine());
}
for (int i = 0; i < sList.size(); i++) {
for (int j = i+1; j < sList.size(); j++) {
psStream.println(sList.get(i)+" "+sList.get(j)+" "+WordNetSimalarity.computeSimilarity(sList.get(i), sList.get(j)));
}
}
psStream.close();
} catch (Exception e) {e.printStackTrace();
}
long t1 = System.currentTimeMillis();
System.out.println( "Done in "+(t1-t0)+" msec." );
}
My database contain 595 synsets that's mean method computeSimilarity
will be called (595*594/2) time
To compute Similarity between two words it spend more than 5000 ms
!
so to finalize my task I need at least one week !!
My question is how to reduce this period !
How to ameliorate performances??
I don't think language is your issue.
You can help yourself with parallelism. I think this would be a good candidate for map reduce and Hadoop.