WordNetSimalarity in large dataset of synsets

Question

WordNetSimalarity in large dataset of synsets

595 views Asked by nawara At 10 May 2013 at 09:04

I use wordnet similarity java api to measure similarity between two synsets as such:

 public class WordNetSimalarity {
 private static ILexicalDatabase db = new NictWordNet();
 private static RelatednessCalculator[] rcs = {
                 new HirstStOnge(db), new LeacockChodorow(db), new Lesk(db),  new WuPalmer(db), 
                 new Resnik(db), new JiangConrath(db), new Lin(db), new Path(db)
                 };

 public static double computeSimilarity( String word1, String word2 ) {
         WS4JConfiguration.getInstance().setMFS(true);
         double s=0;
         for ( RelatednessCalculator rc : rcs ) {
                 s = rc.calcRelatednessOfWords(word1, word2);
                // System.out.println( rc.getClass().getName()+"\t"+s );
         }

        return s;
 }

Main class

      public static void main(String[] args) {
         long t0 = System.currentTimeMillis();

         File source = new File ("TagsFiltered.txt");
         File target = new File ("fich4.txt");
         ArrayList<String> sList= new ArrayList<>();

         try {
             if (!target.exists()) target.createNewFile();
            Scanner scanner = new Scanner(source);
            PrintStream psStream= new PrintStream(target);
            while (scanner.hasNext()) {
                sList.add(scanner.nextLine());                  
            }
            for (int i = 0; i < sList.size(); i++) {
            for (int j = i+1; j < sList.size(); j++) {
                psStream.println(sList.get(i)+" "+sList.get(j)+" "+WordNetSimalarity.computeSimilarity(sList.get(i), sList.get(j)));
            }
        }   

            psStream.close();
        } catch (Exception e) {e.printStackTrace();
        }


         long t1 = System.currentTimeMillis();
         System.out.println( "Done in "+(t1-t0)+" msec." );
 }

My database contain 595 synsets that's mean method computeSimilarity will be called (595*594/2) time To compute Similarity between two words it spend more than 5000 ms! so to finalize my task I need at least one week !!

My question is how to reduce this period !

How to ameliorate performances??

Original Q&A

There are 4 answers

**duffymo** · Answer 1 · 2013-05-10T09:05:57+00:00

duffymo On 10 May 2013 at 09:05

I don't think language is your issue.

You can help yourself with parallelism. I think this would be a good candidate for map reduce and Hadoop.

**Steve P.** · Answer 2 · 2013-05-19T07:09:42+00:00

Perl is different from a lot of other languages when it comes to threading/forking.

One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.

When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.

Forking Advantages

Very fast to create a fork
Very robust

Forking Disadvantages

Communicating between the processes can be slow and awkward

Thread Advantages

Thread coordination and data interchange is fairly easy
Threads are fairly easy to use

Thread Disadvantages

Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads

In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.

For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.

For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut. Also, the number of threads is going to be dependent on the number of cores that your computer has. If you have four cores, creating more than 3 threads isn't really going to be very helpful (3 + 1 for your main program).

To be honest, though, threads/forking may not be the way to go. In fact, in many situations they can even slow stuff down due to overhead. If you really need the speed, the best way to get it is going to be through distributed computing. I would suggest that you look into some sort of distributed computing platform to make your runtime better. If you can reduce the dimensionality of your search/compareTo space to less than n^2, then map reduce or Hadoop may be a good option; otherwise, you'll just have a bunch of overhead and no use of the real scalability that Hadoop offers (@Thomas Jungblut).

**Marc Shapiro** · Answer 3 · 2013-05-23T00:08:47+00:00

Marc Shapiro On 23 May 2013 at 00:08

Have you tried the MatrixCalculator?

**Sergey Zyuzin** · Answer 4 · 2013-05-23T05:31:39+00:00

I don't know if it is possible to optimize this algorithm-wise.

But definitely you can run this much faster. On my machine this operation takes twice less time, so if you have eight i7 cores, you'd need 15 hours to process everything(if you process the loop in parallel)

You can get virtual machines at Amazon Web Services. So if you get several machines and run multithreaded processing for different chunks of data on each machine - you will complete in several hours.

Technically it is possible to use Hadoop for this, but if you need to run this just once - making computation parallel and launching on several machines will be simpler in my opinion.

TechQA.

WordNetSimalarity in large dataset of synsets

There are 4 answers

Related Questions in JAVA

Related Questions in TAGS

Related Questions in SEMANTICS

Related Questions in WORDNET

Related Questions in WS4J

Popular Questions

Popular Tags

Trending Questions