How to load very big timeseries file(s) in Python to do analysis?

37 views Asked by At

I have some .gz files and they contain data on some timeseries. Naturally, I would like to do some timeseries analysis on this.

I tried this:

import gzip
f=gzip.open('data.csv.gz','r')
file_content=f.read()
print(file_content)

But it was loading for 20mins and I manually stopped it.

My question is, how should I read this? I have some ideas on using Dask, Spark, or should I just yield the lines?

Tried looking in the internet industry standards.

1

There are 1 answers

2
Narges Ghanbari On
  1. You can use Dask as follows:

    import dask.dataframe as dd
    
    df = dd.read_csv('data.csv.gz', compression='gzip')
    
  2. Apache spark also supports reading .gz files. (It might be overkill for small datasets.

  3. Yielding lines: If you’re writing a function to process the file, you can use a generator to yield lines one by one. This is memory-efficient as only one line is loaded into memory at a time.