Pandas for Large Data Sets: Millions of records

418 views Asked by At

I have a dataset in stata that is about 5.8 million rows(records).

I've been learning pandas the past few months and really enjoy its capabilities. Would pandas still work in this scenario?

I am having trouble reading the dataset into a dataframe. I'm currently looking at chunking... chunks = pd.read_stata('data.dta', chunksize = 100000, columns = ['year','race', 'app'])

Is there a better way to go about this? I am hoping to do something like:

df = pd.read_stata('data.dta')
data = df.groupby(['year', 'race']).agg(sum)
data.to_csv('data.csv')

but that does not work because (i think) the dataset is too large. error: OverflowError: Python int too large to convert to C long

Thanks. Cheers

0

There are 0 answers