xlsxwriter corrupts file after applying data validation on more than 71215 rows

193 views Asked by At

I've recently jumped from VBA to python trying to improve the reporting on my company so I am a bit new in python.

What my program tries to do is after consolidating various CSV files in one dataframe, applying datavalidation in form of a dropdown list with xlsxwriter.

I was getting a corrupted file everytime after the process was done and so, the data validation was gone.

I've been playing with the limits to find out it can only apply data validation for 71215 rows of data, after that the file becomes corrupt.

Because this only happens with large data, I can't provide an example of dataframe but it would be like:

df = pd.read_csv(fichero, sep=";", dtype={
                                            "Dirección": "string",
                                            "Territorio": "string", 
                                            "DNI": "string", 
                                            "Agente": "string",
                                            "CentroACC": "string",
                                            "ModoACC": "string",
                                            "Fecha": "string",
                                            "Presentes": "string",
                                            "Contratados": "string",
                                            "Modo Planificación": "string",
                                            }
                )

Here is the code which is implementing the data validation. Works fine, won't raise any error, but the result is a corrupted xlsx file.

import pandas as pd
import xlsxwriter as xl
import numpy as np
import time
from diccionarios import diccionarioHabilidadesAgente as dh
def cargaListaHabilidades(df, Fecha):
    df = df.copy();
    df['Habilidades Vigentes'] = np.where((df['INICIO'] <= Fecha) & (Fecha <=  df['FIN']), 1, 0)   
    df = df.loc[df['Habilidades Vigentes']==1]
    return df['Campaña'].tolist()
def main(df, rutaficheros):
    inicio = time.perf_counter()
    dfHabilidades = pd.read_csv(rutaficheros + 'Habilidades.csv', sep=";")
    diccionarioHabilidades = dh(dfHabilidades)
    fin1 = time.perf_counter()
    print('diccionario habilidades generado en: ', fin1 - inicio, ' segundos.')
    del(dfHabilidades)
    fichero = rutaficheros + 'Consolidado_ACC.xlsx'    
    writer = pd.ExcelWriter(fichero, engine='xlsxwriter') 
    df.to_excel(writer, sheet_name='Descarga ACC')
    workbook  = writer.book
    worksheet = writer.sheets['Descarga ACC']
    fin2 = time.perf_counter()
    print('Excel generado en: ', fin2 - fin1, ' segundos.')
    for fila in range(0, 71214):#len(df)):
        Agente = df.iloc[fila]['DNI']
        Fecha = df.iloc[fila]['Fecha']
        if Agente in diccionarioHabilidades:
            if Fecha in diccionarioHabilidades[Agente]:
                lista = diccionarioHabilidades[Agente][Fecha]
                worksheet.data_validation('K' + str(fila + 2), {'validate': 'list',
                                            'source': lista})  
    fin3 = time.perf_counter()
    print('validación aplicada en: ', fin3 - fin2)                          
    workbook.close()
    print('guardado el fichero en: ', time.perf_counter() - fin3, ' segundos.')
    return

Like this the code works because I manually input 71214 on the loop, but the total number of rows is like 100k.

I've wanted to ask if someone would know why is the reason of this before doing the less pretty process of spliting the df in 2 and generate 2 separated files.

Edit:

  • Excel can handle way more data validations (it was done by VBA).
  • Each data validation needs to be applied on it's own because people have different set of skills for each day of the month.
0

There are 0 answers