bin data depending on values of a separate column

Question

bin data depending on values of a separate column

282 views Asked by NoIdeaHowToFixThis At 13 November 2014 at 12:02

I have a dataset which looks somehow like this toy example:

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5) * 10)
cat1 = pd.Series(['s1'] * 5)
cat2 = pd.Series(['s2'] * 5)
s = s1.append(s2).reset_index(drop=True)
c = cat1.append(cat2).reset_index(drop=True)
data = pd.DataFrame({'cat': c,'s': s})
print data

  cat                    s
0  s1                 0.68
1  s1                 0.61
2  s1                 0.43
3  s1                 0.68
4  s1                 0.11
5  s2                 4.82
6  s2                 8.19
7  s2                 3.88
8  s2                 5.51
9  s2                 1.20

I would like to bin the data, using a different binning range depending on the values in the column cat. This is what I tried:

def bucketing_fun(x, cat):
    if cat == 's1':
        return np.digitize([x], s1_buckets)[0]
    else:
        return np.digitize([x], s2_buckets)[0]

data['Buckets'] = data[['s', 'cat']].apply(lambda x: bucketing_fun(x[0], x[1]), axis=1)
print data

This works but I have performance issues on the real dataset which is about 0.5mn rows.

Original Q&A

There are 1 answers

**user1827356** · Answer 1 · 2014-11-13T17:05:57+00:00

user1827356 On 13 November 2014 at 17:05

You're probably losing out on the vectorization speedup

Try this:

buckets = dict(s1=s1_buckets, s2=s2_buckets)
data['Buckets'] = data.groupby(['cat']).apply(lambda df: np.digitize(df.s, buckets[df.cat.irow(0)]))

TechQA.

bin data depending on values of a separate column

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CATEGORIES

Related Questions in DATAFRAME

Related Questions in BINNING

Popular Questions

Popular Tags

Trending Questions