While creating a new column in pandas dataframe based on some condition, numpy's where method outperforms the apply method in terms of execution time, why is that so?

For example:

df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)

df["log2FC"] = np.where(df["C1Mean"]==0,
                        np.log2(df["C2Mean"]), 
                        np.log2(df["C2Mean"]/df["C1Mean"]))

1 Answers

4
EdChum On Best Solutions

This call to apply is row-wise iteration:

df["log2FC"] = df.apply(lambda x: np.log2(x["C2Mean"]/x["C1Mean"]) if x["C1Mean"]> 0 else np.log2(x["C2Mean"]), axis=1)

apply is just syntactic sugar for looping, you passed axis=1 so it's row-wise.

Your other snippet

df["log2FC"] = np.where(df["C1Mean"]==0,
                        np.log2(df["C2Mean"]), 
                        np.log2(df["C2Mean"]/df["C1Mean"]))

is acting on the entire columns, so it's vectorised.

The other thing is that pandas is performing more checking, index-alignment, etc.. than numpy.

Your calls to np.log2 are meaningless in this context as you pass scalar values:

 np.log2(x["C2Mean"]/x["C1Mean"])

performance-wise it would be the same as calling math.log2

Explaining why numpy is significantly faster or what is vectorisation is beyond the scope of this question. You can see this: What is vectorization?.

The essential thing here is that numpy can and will use external libraries written in C or Fortran which are inherently faster than python.