I am working on finding matches between 2 large csv files. I use this function to compute the similarity between 2 strings. If the given ratio is greater than a predefine threshold, then I will accept this as a match.
def similar(a, b): return SequenceMatcher(None, a, b).ratio()
Because I need to go through every single line of both file, the time complexity is O(n^2). I've considered using hash to reduce the time complexity to O(n), but that would limit my match to be an exact match without flexibility. However, the first approach would take me several days to execute on my local computer with CPU. Therefore, I am wondering whether there is a way to use cuDF to boost the operation with GPU.
Also, when I tried cuDF applymap function, it said that it does not support string dtype, so is there any other way that I can use cuDF to implement this? Thank you!