How to treat Python interrupting (killing) process because of large files processing?

Question

How to treat Python interrupting (killing) process because of large files processing?

77 views Asked by andrellima At 21 January 2024 at 16:56

I have this little script:

import pandas as pd

import os
import glob

novas_colunas = [
    'UF', 'Municipios', 'Área de Ponderação', 'Controle', 'Peso Amostral', 'Região Geográfica', 'Mesorregião', 'Microrregião',
    'Código da Região Metropolitana', 'Situação do Domicilio', 'Espécie de Unidade Visitada','Tipo de Espécie',
    'Condição de Ocupação', 'Valor do Aluguel', 'Aluguel em número de salários', 'Material Predominante', 'Nº de Cômodos',
    'Densidade de Morador', 'Cômodos  com dormitórios', 'Densidade de morador dormitório','Nº de Banheiros', 'Sanitários', 'Tipo de Esgotamento Sanitário', 'Forma de Abastecimento de Água',
    'Canalização', 'Destino do Lixo', 'Existênia de Energia Elétrica',
    'Existência de Medidor de Energia', 'Rádio', 'Televisão', 'Máquina de Lavar',
    'Geladeira', 'Celular', 'Telefone Fixo', 'Microcomputador', 'Microcomputador com internet', 'Motocicleta', 'Automóvel',
    'ALGUMA PESSOA QUE MORAVA COM VOCÊ(S) ESTAVA MORANDO EM OUTRO PAÍS EM 31 DE JULHO DE 2010',
    'QUANTAS PESSOAS MORAVAM NESTE DOMICÍLIO EM 31 DE JULHO DE 2010', 'A RESPONSABILIDADE PELO DOMICÍLIO É DE',
    'DE AGOSTO DE 2009 A JULHO DE 2010, FALECEU ALGUMA PESSOA QUE MORAVA COM VOCÊ(S)',
    'Rendimento Mensal pelo domicilio em Julho de 2010',
    'RENDIMENTO DOMICILIAR, SALÁRIOS MÍNIMOS, EM JULHO DE 2010',
    'RENDIMENTO DOMICILIAR PER CAPITA EM JULHO DE 2010',
    'RENDIMENTO DOMICILIAR PER CAPITA, EM Nº DE SALÁRIOS    MÍNIMOS, EM JULHO DE 2010',
    'Espécie da Unidade Doméstica', 'ADEQUAÇÃO DA MORADIA', 'MARCA DE IMPUTAÇÃO NA V0201:',
    'MARCA DE IMPUTAÇÃO NA V2011:', 'MARCA DE IMPUTAÇÃO NA V0202:', 'MARCA DE IMPUTAÇÃO NA V0203',
    'MARCA DE IMPUTAÇÃO NA V0204', 'MARCA DE IMPUTAÇÃO NA V0205','MARCA DE IMPUTAÇÃO NA V0206',
    'MARCA DE IMPUTAÇÃO NA V0207', 'MARCA DE IMPUTAÇÃO NA V0208', 'MARCA DE IMPUTAÇÃO NA V0209',
    'MARCA DE IMPUTAÇÃO NA V0210', 'MARCA DE IMPUTAÇÃO NA V0211', 'MARCA DE IMPUTAÇÃO NA V0212',
    'MARCA DE IMPUTAÇÃO NA V0213', 'MARCA DE IMPUTAÇÃO NA V0214', 'MARCA DE IMPUTAÇÃO NA V0215',
    'MARCA DE IMPUTAÇÃO NA V0216', 'MARCA DE IMPUTAÇÃO NA V0217', 'MARCA DE IMPUTAÇÃO NA V0218',
    'MARCA DE IMPUTAÇÃO NA V0219', 'MARCA DE IMPUTAÇÃO NA V0220', 'MARCA DE IMPUTAÇÃO NA V0221',
    'MARCA DE IMPUTAÇÃO NA V0222', 'MARCA DE IMPUTAÇÃO NA V0301', 'MARCA DE IMPUTAÇÃO NA V0401',
    'MARCA DE IMPUTAÇÃO NA V0402', 'MARCA DE IMPUTAÇÃO NA V0701', 'SITUAÇÃO DO SETOR']

pasta = 'saida-microdados/*.csv'
arquivos = []
for i in glob.glob(pasta):
  arquivos.append(i)

for i in range(len(arquivos)):
  dataf = pd.read_csv(arquivos[i])
  velhas_colunas = dataf.columns
  dataf.rename(columns = dict(zip(velhas_colunas, novas_colunas)), inplace = 'True')
  dataf.to_csv(arquivos[i])

I guess it's all fine with the code itself. The problem there are many CSVs in this folder and many of them are quite large, with many rows. So, as I'm not generating any output I don't know in which files it changed the columns, but by accessing them I saw that a few of them were changed, but most of them were not (only a little, little few were changed). So, what would be the best approach to deal with it? I've tried pypy, but it didn't resolve anything. I'll also do list comprehension and avoid dot method calling. I think that I've read somewhere that len(x) consumes a lot, as well. But will be this enough? I think not. I'm thinking to divide these CSV files in categories and make a script for each of them. I was expecting pypy to solve this problem, but it didn't. There is another script, bigger than this one (more than 150 lines) that reads txt files and generates these CSV files, but I'm getting the same problem, the process is killed before it ends.

Original Q&A

There are 1 answers

**tripleee** · Answer 1 · 2024-01-21T17:31:55+00:00

Here is an even simpler replacement for your script.

import fileinput
from glob import glob

novas_colunas = [...]

for filename in glob('saida-microdados/*.csv'):
    for line in fileinput.input(filename, inplace=True):
        if fileinput.filelineno() == 1:
            print(','.join(novas_colunas))
        else:
            print(line)

If you want to add line numbers in front of each row, that's easy enough:

import fileinput
from glob import glob

novas_colunas = [...]

for filename in glob('saida-microdados/*.csv'):
    firstline = True
    for line in fileinput.input(filename, inplace=True):
        lineno = fileinput.filelineno()
        if lineno == 1:
            print(','.join([''] + novas_colunas))
        else:
            print(str(lineno-1) + ',' + line)

Or, if you mean you want an empty column in front of the "UF" on the first line, that's a combination of the two above:

import fileinput
from glob import glob

novas_colunas = [...]

for filename in glob('saida-microdados/*.csv'):
    firstline = True
    for line in fileinput.input(filename, inplace=True):
        lineno = fileinput.filelineno()
        if lineno == 1:
            print(','.join([''] + novas_colunas))
        else:
            print(line)

This presupposes that the new headers don't contain literal commas. You might want to use the csv module for any nontrivial CSV processing.

This simply reads and forgets one line of data at a time, so you should not run out of memory (unless your data contains some really long lines!)

The fileinput module takes care of writing to a temporary file behind the scenes, and replacing the original input file when you're done with that file, by way of its inplace=True mechanism.

Going forward, please review the help and in particular How to ask as well as the guidance for providing a minimum reproducible example. A description of what the code should do, ideally with a small simple CSV example to test with, would reduce the amount of guessing we have to perform.

TechQA.

How to treat Python interrupting (killing) process because of large files processing?

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in LARGE-DATA

Related Questions in KILL-PROCESS

Popular Questions

Trending Questions