Labeling large set of paired training data

73 views Asked by At

I'm training a model to determine if two people are the same. The model should take in two people(represented as dataframe rows)

I'm trying to label paired data of the form

Id  | age    | gender| occupation  | region | height | weight(kg)
100 | 16     | 0     | "plumber"   | na     | 169    | 20
300 | 50     | 1     | na          | africa | 12     | 90
Id  | age    | gender| occupation  | region | height | weight(kg)
100 | 16     | 0     | "plumber"   | na     | 169    | 20
700 | 100    | 0     | na          | africa | 12     | 90

Each of these pairs is sent to separate csv files for labeling, since I want to train a classifier that takes in pairs of people rows, and labels them as duplicates or not.

As you can see, if I have only 10 people, this could quickly get out of hand. 10 C 2 = 45 pairs. Any ideas, on how to make labeling the data easier?

I've thought about doing this in excel, but I feel like opening this many excel files is sure to create issues.

2

There are 2 answers

0
coderboi On BEST ANSWER

So I figured it out, I just need to pair the rows in excel, ie row1 features, row2 features, label. It is pretty annoying to read the features horizontally, but if I use an external monitor or 2 it isn't terrible.

0
Prune On
  • Sort the data frame O(N*log(N))
  • Check to see whether adjacent rows are equal O(N)

To do something with adjacent rows, simply shift the column one position; compare each row to the original.