Processing array while it's being used across multiple threads in groovy

83 views Asked by At

I have ~13k Strings in a csv file. My program is reading this file, getting the distance between those strings and outputs that to another file.

Currently this is all done in a single threaded function, but that is very slow. I need to improve the efficiency of said function. My best guess was to begin with making it run parallel across multiple threads.

I don't know much about parallelism and have a hard time understanding it, more certainly how I need to modify my code to work with in that way.

I am trying to do it in Groovy, but if you know a better language for that please tell me. This is the function (still single threaded) I am trying to convert to an parallel one:

long startTime = new Date().getTime()
long calcTime = 0
def outputList = []
for (int i = 0; i < records.size(); i++) {
    long currTime = new Date().getTime()
    int matchIndex = -1
    int matchDistance = -1
    if (i % 50 == 0) {
        println("Status: ${df.format(i / (records.size() - 1) * 100)} % done. [${i}/${records.size() - 1}] (Last calcTime: ${df.format(calcTime / 1000.0)}s // ${calcTime}ms) (outputList.size(): ${outputList.size()})")
    }
    for (int j = 0; j < outputList.size(); j++) {
        String s1 = "" + records[i][2]
        String s2 = "" + outputList[j][2]
        int distance = StringUtils.getLevenshteinDistance(s1, s2)
        if (distance <= 10 && distance> matchDistance) {
            matchIndex = j
            matchDistance = distance
        }
    }
    calcTime = new Date().getTime() - currTime
    outputList += [records[i] + [matchIndex, matchDistance]]
}

StringUtils is from the Apache Commons library. records is an 2 dimensional array, extracted from an CSV File. (Index [x][2] is the String I want to compare, same goes for outputList)

I've read about the GPars Library and am trying to do it that way, but as I've said I have a really hard time understanding how it works.

I would really appreciate if you could tell and explain to me how you would solve this problem, or link me resources to help me understand it.

EDIT: Here are 5 lines of the input csv file:

server.log.2021.10.29|139712|2021-10-29 15:23:34,672 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH HTML Tags in Mail with MIME Type text/plain detected. Skipping HTML Link creation⦀
server.log.2021.10.29|139713|2021-10-29 15:23:49,546 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Admintool Template xxx.csv contains wrong line: 16⦀2021-10-29 15:23:49,546 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Non Standard Pattern Template lines must contain ###number### pattern and at least 3 more non blank signs.⦀2021-10-29 15:23:49,546 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Admintool Template xxx.csv contains wrong line: 17⦀2021-10-29 15:23:49,546 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Non Standard Pattern Template lines must contain ###number### pattern and at least 3 more non blank signs.⦀2021-10-29 15:23:49,546 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Admintool Template xxx.csv contains wrong line: 25⦀2021-10-29 15:23:49,546 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Non Standard Pattern Template lines must contain ###number### pattern and at least 3 more non blank signs.⦀
server.log.2021.10.29|139841|2021-10-29 15:23:50,018 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH HTML Tags in Mail with MIME Type text/plain detected. Skipping HTML Link creation⦀
server.log.2021.10.29|139855|2021-10-29 15:24:04,701 WARN  [   xxx.groovy] [xxx:nimh:pipe--] NIMH HTML Tags in Mail with MIME Type text/plain detected. Skipping HTML Link creation⦀
server.log.2021.10.29|140031|2021-10-29 15:24:08,435 WARN  [ice.aspect.ScriptMetricsAspect] [xxx:nimh:pipe--] Execution of script: xxx.groovy took 3 seconds⦀
0

There are 0 answers