I have a pandas dataframe as follows, consisting of string values.

    0   1   2
0   o   jj  ovg
1   j   jj  jjy
2   y   yk  yku
3   v   vf  vfs
4   i   iw  iwd

I have a function which encodes each column with integer values, and counts the number of unique elements in each column. I used cat.codes and nunique functions of pandas. See below the timing results and the code snippets.

As is evident, these operations take a lot of time. How can I speed them up?

Line #      Hits         Time  Per Hit   % Time         Line Contents
=====================================================================================================================
25           1    7529434.0   7529434.0     79.9      df = df.apply(lambda x: x.astype('category').cat.codes)
26                                               
27                                                    # calculate the number of unique keys for each row
28           1    1825214.0   1825214.0     19.4      len_arr = df.nunique(axis=0).values

Edit Timing results from the answer

df.apply(lambda x: pd.factorize(x)[0])
#100 loops, best of 3: 6.24 ms per loop
%timeit df.apply(lambda x: pd.factorize(x)[0])
#100 loops, best of 3: 4.93 ms per loop

%timeit df1.nunique(axis=0).values
#100 loops, best of 3: 2.34 ms per loop
%timeit df1.apply(lambda x: len(pd.factorize(x)[1]))
#100 loops, best of 3: 2.64 ms per loop

Edit 2 More timing results for fun:

# results with 100 rows
%timeit original()
#100 loops, best of 3: 7 ms per loop
%timeit WeNYoBen()
#100 loops, best of 3: 2.4 ms per loop
%timeit jezrael()
#100 loops, best of 3: 4.03 ms per loop
%timeit piRSquared()
#100 loops, best of 3: 2.29 ms per loop

# results with 10000 rows
%timeit original()
#100 loops, best of 3: 16.6 ms per loop
%timeit WeNYoBen()
#10 loops, best of 3: 23 ms per loop
%timeit jezrael()
#100 loops, best of 3: 6.14 ms per loop
%timeit piRSquared()
#100 loops, best of 3: 19.1 ms per loop

3 Answers

3
jezrael On Best Solutions

Use factorize with length of second array:

a = df.apply(lambda x: len(pd.factorize(x)[1]))
print (a)
0    5
1    4
2    5
dtype: int64

For integers:

b = df.apply(lambda x: pd.factorize(x)[0])
print (b)
   0  1  2
0  0  0  0
1  1  0  1
2  2  1  2
3  3  2  3
4  4  3  4

All together for avoid call function twice:

out = {}
def f(x):
    a, b = pd.factorize(x)
    out[x.name] = len(b)
    return a

b = df.apply(f)
print (b)
   0  1  2
0  0  0  0
1  1  0  1
2  2  1  2
3  3  2  3
4  4  3  4

a = pd.Series(out)
print (a)
0    5
1    4
2    5
dtype: int64
3
piRSquared On

with pd.factorize

The point of this is to capture the both out puts of factorize and use them in the integer encoding as well as the nunique calculation without having to factorize twice.

Run this to get encoding and unique values

e, u = zip(*map(pd.factorize, map(df.get, df)))

Turn encoding into dataframe

pd.DataFrame([*zip(*e)], df.index, df.columns)

   0  1  2
0  0  0  0
1  1  0  1
2  2  1  2
3  3  2  3
4  4  3  4

Turn length of unique values into a series

pd.Series([*map(len, u)], df.columns)

0    5
1    4
2    5
dtype: int64

All together, the assignment of the two objects is

e, u = zip(*map(pd.factorize, map(df.get, df)))    
df_ = pd.DataFrame([*zip(*e)], df.index, df.columns)
c = pd.Series([*map(len, u)], df.columns)

For those stuck with legacy Python, without the [*it] syntax

e, u = zip(*map(pd.factorize, map(df.get, df)))
df_ = pd.DataFrame(list(zip(*e)), df.index, df.columns)
c = pd.Series(list(map(len, u)), df.columns)
3
Wen-Ben On

I think using list map is good enough

l=list(map(set,df.values.T))
l
Out[71]: 
[{'i', 'j', 'o', 'v', 'y'},
 {'iw', 'jj', 'vf', 'yk'},
 {'iwd', 'jjy', 'ovg', 'vfs', 'yku'}]

list(map(len,l))
Out[74]: [5, 4, 5]

Usage of np.unique

def yourfunc(x):
    _,indices = np.unique(x, return_inverse=True)
    return indices
df.apply(yourfunc)
Out[102]: 
   0  1  2
0  2  1  2
1  1  1  1
2  4  3  4
3  3  2  3
4  0  0  0