# Speed up turn probabilities into binary features

I have a dataframe with 3 columns, in each row I have the probability that this row, the feature T has the value 1, 2 and 3

``````import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
``````

For row 0, T is 1 with 80% chance, 2 with 10% and 3 with 10%

I want to simulate the value of T for each row and change the columns T1,T2, T3 to binary features. I have a solution but it needs to loop on the rows of the dataframe, it is really slow (my real dataframe has over 1 million rows) :

``````possib = df.columns
for i in range(df.shape[0]):
probas = df.iloc[i][possib].tolist()
choix_transp = np.random.choice(possib,1, p=probas)[0]
for pos in possib:
if pos==choix_transp:
df.iloc[i][pos] = 1
else:
df.iloc[i][pos] = 0
``````

Is there a way to vectorize this code ?

Thank you !

On Best Solutions

We can use `numpy` for this:

``````result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
``````

This generates a single column of random values and compares it to the column-wise cumsum of the dataframe, which results in a `DataFrame` of values where the first `False` value shows which "bucket" the random value falls in. With `idxmax`, we can get the index of this bucket, which we can then convert back with `pd.get_dummies`.

Example:

``````import numpy as np
import pandas as pd

np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]

df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))

print(result)
``````

Output:

``````   0  1  2
0  1  0  0
1  0  0  1
2  0  1  0
3  0  1  0
4  1  0  0
5  0  0  1
6  0  1  0
7  0  1  0
8  0  0  1
9  0  1  0
``````

A note:

Most of the slowdown comes from `pd.get_dummies`; if you use Divakar's method of `pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns)`, it gets a lot faster.

On

Here's one based on vectorized `random.choice` with a given matrix of probabilities -

``````def matrixprob_to_onehot(ar):
# Get one-hot encoded boolean array based on matrix of probabilities
c = ar.cumsum(axis=1)
idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
ar_out = np.zeros(ar.shape, dtype=bool)
ar_out[np.arange(len(idx)),idx] = 1
return ar_out

ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
``````

Verify with a large dataset for the probabilities -

``````In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})

In [140]: df
Out[140]:
T1    T2   T3
0  0.80  0.10  0.1
1  0.50  0.20  0.3
2  0.01  0.89  0.1

In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)

In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]:
array([[0.80064, 0.0995 , 0.09986],
[0.50051, 0.20113, 0.29836],
[0.01015, 0.89045, 0.0994 ]])

In [145]: np.round(_,2)
Out[145]:
array([[0.8 , 0.1 , 0.1 ],
[0.5 , 0.2 , 0.3 ],
[0.01, 0.89, 0.1 ]])
``````

### Timings on `1000,000` rows -

``````# Setup input
In [169]: N = 1000000
...: a = np.random.rand(N,3)
...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])

# @gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop

# Soln from this post
In [172]: %%timeit
...: ar_out = matrixprob_to_onehot(df.values)
...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop
``````