Retrieving matched record ids in the recordlinkage library

1.2k views Asked by At

I was trying out this example: https://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html

Following is the code snippet:

import recordlinkage
from recordlinkage.datasets import load_febrl1

dfA = load_febrl1()

# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='given_name')
candidate_links = indexer.index(dfA)

compare_cl = recordlinkage.Compare()

compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, 
label='address_1')

features = compare_cl.compute(candidate_links, dfA)
matches = features[features.sum(axis=1) > 3]
print(len(matches))

I would now like to separately print the record_ids that have been matched.I tried listing down the column names of 'matches', but record_id isn't a part of it, and I cannot seem to figure out a way to get it done(I just want the record_ids separately)

Is there a way to retrieve the record_ids, and maybe either print it separately or store it as a list or an array?

2

There are 2 answers

1
shadowtalker On BEST ANSWER

Don't forget that a Pandas data frame has an "index" in addition to its data columns. Usually this is a single "extra" column of integers or strings, but more complex indices are possible, e.g. a "multi-index" consisting of more than one column.

You can see this if you print(matches.head()). The first two columns have names that are slightly offset, because they aren't data columns; they are columns in the index itself. This data frame index is in fact a multi-index containing two columns: rec_id_1 and rec_id_2.

The result from load_febrl encodes record ID as the index of dfA. Compare.compute preserves the indices of the input data: you can always expect the indices from the original data to be preserved as a multi-index.

The index of a data frame by itself can be accessed with the DataFrame.index attribute. This returns an Index object (of which MultiIndex is a subclass) that can in turn be converted as follows:

  • Index.tolist(): convert to a list of its elements; MultiIndex becomes a list of tuples
  • Index.to_series(): convert to a Series of its elements; MultiIndex becomes a Series of tuples
  • Index.values: access underlying data as NumPy ndarray; MultiIndex becomes a ndarray of tuples.
  • Index.to_frame(): convert to a DataFrame, with index columns as data frame columns

So you can quickly access the record id's with matches.index, or export them to a list with matches.tolist().

You can also use matches.reset_index() to turn Index columns back into regular data columns.

0
usct01 On

Here is the code to complete the answer using pandas merge on index and reset_index property

This will convert the multi index to columns named as level_0, level_1

matches = matches.reset_index()

We can see the column level_0 is same as index in dfA

matches.columns 
dfA.index

Now merge this with dfA by index and level_0

import pandas as pd
matched_dfA=pd.merge(matches,dfA,left_on='rec_id_1',right_index=True)

Check the results

matched_dfA.head()