Here is my dataframe:

    RIGHT_SHORTNAME     Item_Name
0   S/BAG PKT SEMBAKO   S/BAG PKT SEMBAKO
1   ORAL B 123 SOFT2S   ORAL B 123 SOFT2S
2   ORAL B 123 SOFT2S   ORAL B 123 SOFT2S
3   CINDERELLA COTBUD   CINDERELLA COTBUD
4   PROCHIZ 10S 170GR   PROCHIZ 10S 170GR
... ... ...
97163   TT MAX CHO 12X17GR  TT MAX CHO 12X17GR
97164   ICELAND VOD 350ML   ICELAND VOD 350ML
97165   SUNKIST GUAVA 1 LT  SUNKIST GUAVA 1 LT
97166   COSM FAN 12DAR  COSM FAN 12DAR
97167   BATHSALT MINERAL C  BATHSALT MINERAL C

I want to add column name 'distance' with this code:

def distance(a, b):
    _, z, _=process.extractOne(str(a),[str(b)])
    return z
df['distance']=distance(df['RIGHT_SHORTNAME'],df['Item_Name'])

it yields this:

    RIGHT_SHORTNAME     Item_Name           distance
0   S/BAG PKT SEMBAKO   S/BAG PKT SEMBAKO   98.595506
1   ORAL B 123 SOFT2S   ORAL B 123 SOFT2S   98.595506
2   ORAL B 123 SOFT2S   ORAL B 123 SOFT2S   98.595506
3   CINDERELLA COTBUD   CINDERELLA COTBUD   98.595506
4   PROCHIZ 10S 170GR   PROCHIZ 10S 170GR   98.595506
... ... ... ...
97163   TT MAX CHO 12X17GR  TT MAX CHO 12X17GR  98.595506
97164   ICELAND VOD 350ML   ICELAND VOD 350ML   98.595506
97165   SUNKIST GUAVA 1 LT  SUNKIST GUAVA 1 LT  98.595506
97166   COSM FAN 12DAR  COSM FAN 12DAR  98.595506
97167   BATHSALT MINERAL C  BATHSALT MINERAL C  98.595506

when I checked using df['distance'].describe(), it turns out that df['distance'] is all the same. Can anybody help me?

2

There are 2 answers

3
Ynjxsjmh On

This is because your distance method return only one value and you assign that value to the new column distance in dataframe. The distance column thus has all the same value returned by distance method.

process.extractOne(query, choices) accepts a string and a list, I guess you want to following syntax

def distance(x):
    _, z = process.extractOne(x, df['Item_Name'].tolist())
    return z

df['distance'] = df['RIGHT_SHORTNAME'].apply(distance)

Or

df['distance'] = df['RIGHT_SHORTNAME'].apply(lambda x: process.extractOne(x, df['Item_Name'].tolist())[1])
0
WeisSchwarz On

I've found the answer, inspired from http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/

here is my code:

def distance(row):
    return process.extractOne(str(row['RIGHT_SHORTNAME']),[str(row['Item_Name'])])[1]

and then apply it

df['distance']=df.apply(distance, axis=1)

It works in an instant.

result (pardon it's not exactly the same dataframe, and I'm using normalized_levenshtein as scorer this time):

    RSHORTNAME          Item_Name           distance
24  SNSODYNE MLT ACT1   +SNSODYNE MLT ACT2  94.117647
60  CO B J POPCORN      CO B J POPCORN      93.333333
78  LRT PYGR CAN LYCH   LRT PYGR CAN LYCH   94.444444
79  LRT PYGR CAN APL    LRT PYGR CAN APL    94.117647
80  LRT PYGR CAN STRW   LRT PYGR CAN STRW   94.444444
113 GLOVE HG44          GLOVE HG44          90.909091
169 SQ CRISPY57         SQ MIDI ALMOND 30G  22.222222
170 SQ CRISPY57         SQ MIDI ALMOND 30G  22.222222
202 LISTERINE FR250ML   +LISTERINE ZERO250  70.588235