Optimization for Comparing HashMap Elements while Iterating

Question

Optimization for Comparing HashMap Elements while Iterating

247 views Asked by Yilmazerhakan At 06 January 2017 at 22:24

I have a csv file with names nearly 845k line.

I want to compare fuzzy name string matching. I used Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm.

Implemented below code it works perfect for me. The Problem is process time to much. Every line compare time is nearly 15 sec with other lines. This is 240 line for an hour and whole process will be nearly 6000 row. And all process will be finish in months. This is unacceptable working time.

I need an optimization technique or method. I need some suggestion rather than solution.

What you suggest for below code?

BufferedReader br = new BufferedReader(new FileReader("data/names.csv"));
BufferedWriter bw = new BufferedWriter(new FileWriter("data/similars.csv"));
ConcurrentHashMap<Integer,String> map = new ConcurrentHashMap<Integer,String>();

String lines;
while( (lines = br.readLine()) != null ){
    String[] line = lines.split("\\t",-1);
    Integer nameId = Integer.parseInt(line[0]);
    String name = line[1];
    map.put(nameId, name);
}

for (Map.Entry<Integer, String> entry1 : map.entrySet()) {
    Integer nameId1 = entry1.getKey();
    String name1 = entry1.getValue();

    for (Map.Entry<Integer, String> entry2 : map.entrySet()) {
        Integer nameId2 = entry2.getKey();
        if (nameId1 == nameId2) {
            continue;
        }
        String name2 = entry2.getValue();
        int ratio = FuzzySearch.ratio(name1,name2);
        if(ratio > 95){
            bw.write(nameId1 + "," + nameId2 + "\n");
        }
    }
    // For to prevent matching same pairs again 
    map.remove(nameId1);
}

Original Q&A

There are 1 answers

**Anton** · Answer 1 · 2017-01-06T22:59:16+00:00

You can try instead Levenshtein distance algorithm, maybe it will give you better performance. Or try any other algorithm. Provide a link to algoritm implementation
It's better not to compare Integer with ==, use nameId1.intValue() == nameId2
Create N threads, where N is the number of cores. Put all your lines in the ConcurrentLinkedQueue. Let you threads poll the queue, take a word, make compassion, once finished - write to file under synchronized section. It will allow you to use all your cores on your PC, not only 1
Why it takes so much time, maybe you have some memory restriction, which force GC to eat your CPU cycles and affect performance.
You can apply some minor optimization, let's say if different between words length is more then 50%, you will never get 95% match
Take a look at this implementation they are using threshold value, I believe it will give you max boost, I think algoritm will return earlier if distance is greater than threshold. Also check this question

TechQA.

Optimization for Comparing HashMap Elements while Iterating

There are 1 answers

Related Questions in JAVA

Related Questions in HASHMAP

Related Questions in ITERATION

Related Questions in SIMILARITY

Related Questions in CONCURRENTHASHMAP

Popular Questions

Popular Tags

Trending Questions