I have a csv file with names nearly 845k line.
I want to compare fuzzy name string matching. I used Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm.
Implemented below code it works perfect for me. The Problem is process time to much. Every line compare time is nearly 15 sec with other lines. This is 240 line for an hour and whole process will be nearly 6000 row. And all process will be finish in months. This is unacceptable working time.
I need an optimization technique or method. I need some suggestion rather than solution.
What you suggest for below code?
BufferedReader br = new BufferedReader(new FileReader("data/names.csv"));
BufferedWriter bw = new BufferedWriter(new FileWriter("data/similars.csv"));
ConcurrentHashMap<Integer,String> map = new ConcurrentHashMap<Integer,String>();
String lines;
while( (lines = br.readLine()) != null ){
String[] line = lines.split("\\t",-1);
Integer nameId = Integer.parseInt(line[0]);
String name = line[1];
map.put(nameId, name);
}
for (Map.Entry<Integer, String> entry1 : map.entrySet()) {
Integer nameId1 = entry1.getKey();
String name1 = entry1.getValue();
for (Map.Entry<Integer, String> entry2 : map.entrySet()) {
Integer nameId2 = entry2.getKey();
if (nameId1 == nameId2) {
continue;
}
String name2 = entry2.getValue();
int ratio = FuzzySearch.ratio(name1,name2);
if(ratio > 95){
bw.write(nameId1 + "," + nameId2 + "\n");
}
}
// For to prevent matching same pairs again
map.remove(nameId1);
}
nameId1.intValue() == nameId2