I'd like to 'anonymize' or 'recode' a column in a pandas DataFrame. What's the most efficient way to do so? I wrote the following, but it seems likely there's a built-in function or better way.
dataset = dataset.sample(frac=1).reset_index(drop=False) # reorders dataframe randomly (helps anonymization, since order could have some meaning)
# make dictionary of old and new values
value_replacer = 1
values_dict = {}
for unique_val in dataset[var].unique():
values_dict[unique_val] = value_replacer
value_replacer += 1
# replace old values with new
for k, v in values_dict.items():
dataset[var].replace(to_replace=k, value=v, inplace=True)
IIUC you want to factorize your values:
Demo: