How to Optimize Memory Usage When Processing Large CSV Files in Python?

131 views Asked by At

I am working on a Python script to process large CSV files (ranging from 2GB to 10GB) and am encountering significant memory usage issues. The script reads a CSV file, performs various transformations on the data, and then writes the transformed data to a new CSV file. The transformations include filtering rows based on certain criteria, mapping values to new formats, and aggregating data from multiple rows.

To handle the CSV files, I initially used the pandas library due to its powerful data manipulation features. Here's a simplified version of my code:

import pandas as pd

def process_csv(input_file, output_file):
    df = pd.read_csv(input_file)
    filtered_df = df[df['column_name'] > some_value]
    filtered_df.to_csv(output_file, index=False)

process_csv('large_input.csv', 'transformed_output.csv')

Although this approach works well for smaller files, it results in excessive memory usage for larger files, causing my script to crash on machines with limited memory.

Expected vs. Actual Results: I expected pandas to be able to efficiently handle large files through its various optimization options (like chunking). However, even after trying to process the file in chunks, I'm still facing out-of-memory errors.

Question: How can I optimize my Python script to reduce memory usage when processing large CSV files? Are there any best practices for using pandas with large datasets, or should I consider alternative libraries or techniques specifically suited for this scale of data processing?

Specific Challenges: How to efficiently filter and transform data without loading the entire file into memory. Best practices for writing the transformed data to a new CSV file in a memory-efficient manner.

1

There are 1 answers

2
canemirhan On BEST ANSWER

If your dataset has any categorical variable, you can utilize category dtype provided on pandas. Use "dtype" parameter in read_csv() for minimizing memory usage a little bit.

# Without categoric variables
df = pd.read_csv("dataframe.csv")
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1615940 entries, 0 to 1615939
Data columns (total 23 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   Job Id            1615940 non-null  int64  
 1   Experience        1615940 non-null  object 
 2   Qualifications    1615940 non-null  object 
 3   Salary Range      1615940 non-null  object 
 4   location          1615940 non-null  object 
 5   Country           1615940 non-null  object 
 6   latitude          1615940 non-null  float64
 7   longitude         1615940 non-null  float64
 8   Work Type         1615940 non-null  object 
 9   Company Size      1615940 non-null  int64  
 10  Job Posting Date  1615940 non-null  object 
 11  Preference        1615940 non-null  object 
 12  Contact Person    1615940 non-null  object 
 13  Contact           1615940 non-null  object 
 14  Job Title         1615940 non-null  object 
 15  Role              1615940 non-null  object 
 16  Job Portal        1615940 non-null  object 
 17  Job Description   1615940 non-null  object 
 18  Benefits          1615940 non-null  object 
 19  skills            1615940 non-null  object 
 20  Responsibilities  1615940 non-null  object 
 21  Company           1615940 non-null  object 
 22  Company Profile   1610462 non-null  object 
dtypes: float64(2), int64(2), object(19)
memory usage: 3.2 GB
# With categoric variables
cats = {"Work Type":"category", "Job Portal":"category"}
df = pd.read_csv("dataframe.csv",dtype=cats)
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1615940 entries, 0 to 1615939
Data columns (total 23 columns):
 #   Column            Non-Null Count    Dtype   
---  ------            --------------    -----   
 0   Job Id            1615940 non-null  int64   
 1   Experience        1615940 non-null  object  
 2   Qualifications    1615940 non-null  object  
 3   Salary Range      1615940 non-null  object  
 4   location          1615940 non-null  object  
 5   Country           1615940 non-null  object  
 6   latitude          1615940 non-null  float64 
 7   longitude         1615940 non-null  float64 
 8   Work Type         1615940 non-null  category
 9   Company Size      1615940 non-null  int64   
 10  Job Posting Date  1615940 non-null  object  
 11  Preference        1615940 non-null  object  
 12  Contact Person    1615940 non-null  object  
 13  Contact           1615940 non-null  object  
 14  Job Title         1615940 non-null  object  
 15  Role              1615940 non-null  object  
 16  Job Portal        1615940 non-null  category
 17  Job Description   1615940 non-null  object  
 18  Benefits          1615940 non-null  object  
 19  skills            1615940 non-null  object  
 20  Responsibilities  1615940 non-null  object  
 21  Company           1615940 non-null  object  
 22  Company Profile   1610462 non-null  object  
dtypes: category(2), float64(2), int64(2), object(17)
memory usage: 3.0 GB

In addition, you can use chunksize parameter like dtype parameter used on read_csv() above. You can find more information about read_csv() on pandas original documentation: pd.read_csv