How to load very big timeseries file(s) in Python to do analysis?

Question

37 views Asked by Fernando Victoria Valpuesta At 28 March 2024 at 17:38

I have some .gz files and they contain data on some timeseries. Naturally, I would like to do some timeseries analysis on this.

I tried this:

import gzip
f=gzip.open('data.csv.gz','r')
file_content=f.read()
print(file_content)

But it was loading for 20mins and I manually stopped it.

My question is, how should I read this? I have some ideas on using Dask, Spark, or should I just yield the lines?

Tried looking in the internet industry standards.

There are 1 answers

**Narges Ghanbari** · Answer 1 · 2024-03-28T17:58:44+00:00

You can use Dask as follows:

import dask.dataframe as dd

df = dd.read_csv('data.csv.gz', compression='gzip')

Apache spark also supports reading .gz files. (It might be overkill for small datasets.
Yielding lines: If you’re writing a function to process the file, you can use a generator to yield lines one by one. This is memory-efficient as only one line is loaded into memory at a time.