WordNetSimalarity in large dataset of synsets

591 views Asked by At

I use wordnet similarity java api to measure similarity between two synsets as such:

 public class WordNetSimalarity {
 private static ILexicalDatabase db = new NictWordNet();
 private static RelatednessCalculator[] rcs = {
                 new HirstStOnge(db), new LeacockChodorow(db), new Lesk(db),  new WuPalmer(db), 
                 new Resnik(db), new JiangConrath(db), new Lin(db), new Path(db)
                 };

 public static double computeSimilarity( String word1, String word2 ) {
         WS4JConfiguration.getInstance().setMFS(true);
         double s=0;
         for ( RelatednessCalculator rc : rcs ) {
                 s = rc.calcRelatednessOfWords(word1, word2);
                // System.out.println( rc.getClass().getName()+"\t"+s );
         }

        return s;
 } 

Main class

      public static void main(String[] args) {
         long t0 = System.currentTimeMillis();

         File source = new File ("TagsFiltered.txt");
         File target = new File ("fich4.txt");
         ArrayList<String> sList= new ArrayList<>();

         try {
             if (!target.exists()) target.createNewFile();
            Scanner scanner = new Scanner(source);
            PrintStream psStream= new PrintStream(target);
            while (scanner.hasNext()) {
                sList.add(scanner.nextLine());                  
            }
            for (int i = 0; i < sList.size(); i++) {
            for (int j = i+1; j < sList.size(); j++) {
                psStream.println(sList.get(i)+" "+sList.get(j)+" "+WordNetSimalarity.computeSimilarity(sList.get(i), sList.get(j)));
            }
        }   

            psStream.close();
        } catch (Exception e) {e.printStackTrace();
        }


         long t1 = System.currentTimeMillis();
         System.out.println( "Done in "+(t1-t0)+" msec." );
 }

My database contain 595 synsets that's mean method computeSimilarity will be called (595*594/2) time To compute Similarity between two words it spend more than 5000 ms! so to finalize my task I need at least one week !!

My question is how to reduce this period !

How to ameliorate performances??

4

There are 4 answers

1
duffymo On

I don't think language is your issue.

You can help yourself with parallelism. I think this would be a good candidate for map reduce and Hadoop.

5
Steve P. On

Perl is different from a lot of other languages when it comes to threading/forking.

One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.

When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.

Forking Advantages

  • Very fast to create a fork
  • Very robust

Forking Disadvantages

  • Communicating between the processes can be slow and awkward

Thread Advantages

  • Thread coordination and data interchange is fairly easy
  • Threads are fairly easy to use

Thread Disadvantages

  • Each thread takes a lot of memory
  • Threads can be slow to start
  • Threads can be buggy (better the more recent your perl)
  • Database connections are not shared across threads

In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.

For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.

For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut. Also, the number of threads is going to be dependent on the number of cores that your computer has. If you have four cores, creating more than 3 threads isn't really going to be very helpful (3 + 1 for your main program).

To be honest, though, threads/forking may not be the way to go. In fact, in many situations they can even slow stuff down due to overhead. If you really need the speed, the best way to get it is going to be through distributed computing. I would suggest that you look into some sort of distributed computing platform to make your runtime better. If you can reduce the dimensionality of your search/compareTo space to less than n^2, then map reduce or Hadoop may be a good option; otherwise, you'll just have a bunch of overhead and no use of the real scalability that Hadoop offers (@Thomas Jungblut).

0
Marc Shapiro On

Have you tried the MatrixCalculator?

0
Sergey Zyuzin On

I don't know if it is possible to optimize this algorithm-wise.

But definitely you can run this much faster. On my machine this operation takes twice less time, so if you have eight i7 cores, you'd need 15 hours to process everything(if you process the loop in parallel)

You can get virtual machines at Amazon Web Services. So if you get several machines and run multithreaded processing for different chunks of data on each machine - you will complete in several hours.

Technically it is possible to use Hadoop for this, but if you need to run this just once - making computation parallel and launching on several machines will be simpler in my opinion.