Subsample pandas dataframe

Question

Subsample pandas dataframe

28.3k views Asked by Nishant At 10 September 2013 at 08:22

I have a DataFrame loaded from a .tsv file. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.

I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.

This: Sample two pandas dataframes the same way seems to be on the right track, but I cannot guarantee the subsample size.

Original Q&A

There are 3 answers

Andy Hayden On 10 September 2013 at 09:11

Unfortunately np.random.choice appears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:

from random import sample
df.loc[sample(df.index, 1000)]

For large DataFrame (a million rows), we see small samples:

In [11]: %timeit df.loc[sample(df.index, 10)]
1000 loops, best of 3: 1.19 ms per loop

In [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]
1 loops, best of 3: 1.38 s per loop

In [21]: %timeit df.loc[sample(df.index, 1000)]
10 loops, best of 3: 14.5 ms per loop

In [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]
1 loops, best of 3: 1.28 s per loop    

In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]
1 loops, best of 3: 1.3  s per loop

But around 10% it gets about the same:

In [31]: %timeit df.loc[sample(df.index, 100000)]
1 loops, best of 3: 1.63 s per loop

In [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]
1 loops, best of 3: 1.4 s per loop

and if you are sampling everything (don't use sample!):

In [41]: %timeit df.loc[sample(df.index, 1000000)]
1 loops, best of 3: 10 s per loop

Note: both numpy.random and random accept a seed, to reproduce randomly generated output.

As @joris points out in the comments, choice (without replacement) is actually sugar for permutation so it's no suprise it's constant time and slower for smaller samples...

Alex Coventry On 09 June 2016 at 20:22

These days, one can simply use the sample method on a DataFrame:

>>> help(df.sample)
Help on method sample in module pandas.core.generic:

sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance
    Returns a random sample of items from an axis of object.

Replicability can be achieved by using the random_state keyword:

>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))
1
>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))
40

**joris** · Accepted Answer · 2013-09-10T08:54:05+00:00

You can select random elements from the index with np.random.choice. Eg to select 5 random rows:

df = pd.DataFrame(np.random.rand(10))

df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

df.loc[np.random.permutation(df.index)[:5]]

In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort() afterwards.

TechQA.

Subsample pandas dataframe

There are 3 answers

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in PANDAS

Related Questions in SUBSAMPLING

Popular Questions

Popular Tags

Trending Questions