Comparing elements of nested lists in python's record linkage library using BaseCompareFeature

56 views Asked by At

I am working with python's record linkage library to identify matching entities in two different dataframes (let us call them dfA and dfB for simplicity).

One of the features that I would like to use for comparison contains nested lists, i.e., one of the columns in each of the two dataframes contains as elements lists. Let us call this column col_X, and let's call the elements of those nested lists e_i. (In my application e_i are identifiers that come as int types.)

What I would like to do specifically is calculate for each comparison pair the number of e_i of the list in col_X of dfA that are identical to elements e_i of the list in col_X of dfB.

My basic idea was to create a series object that gives me the number of elements common to both lists within each comparison pair using set operators. Something like this:

(dfA.col_x.apply(set) & dfB.col_x.apply(set)).apply(len). 

After having read the documentation (https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.base.BaseCompareFeature), I understood that in principle I have to use the BaseCompareFeature and create my own function.

I have then tried doing the following:

class CompareElements(BaseCompareFeature):

    def __init__(self, left_on, right_on, *args, **kwargs):
        super(CompareElements, self).__init__(left_on, right_on, *args, **kwargs)

    def _compute_vectorized(self, s1, s2):

        sets_s1 = s1.apply(set) 
        sets_s2 = s2.apply(set)
        sim = (sets_s1 & sets_s2).apply(len) 
        return sim

The problem with this approach seems to be that I cannot apply the set operators on the series objects sets_s1 and sets_s2.

At the same time iterating within the function seems not to be an option either as the comparison is done only later with the compare and compute functions:

comparer = rl.Compare([
    CompareElements('col_X', 'col_X', label='label_X')
])

comparison_vectors = comparer.compute(pairs, dfA, dfB)

Unfortunately the examples in the documentation do not cover my use case.

Any ideas on how I can work around this are highly appreciated.

Note, that I am using multiple other compare features which is why I would like to stick to the record linkage library to accomplish this task if possible.

1

There are 1 answers

0
mbgzoo On

I have found the solution to my question.

The trick was to work with list comprehensions and convert them to Pandas Series so that the CompareElementsclass would accept them as output.

So what I did to compute the share of elements that two comparison lists share is the following:

class CompareElements(BaseCompareFeature):

def __init__(self, left_on, right_on, *args, **kwargs):
    super(CompareElements, self).__init__(left_on, right_on, *args, **kwargs)

def _compute_vectorized(self, s1, s2):

    L1 = pd.Series([list(set(x+y)) for x, y in zip(s1, s2)]).apply(len)
    L2 = pd.Series([list(set(x) & set(y)) for x, y in zip(s1, s2)]).apply(len)
    
    sim = L2/L1
    
    return sim