Problems appending to Excel

38 views Asked by At

This is probably a common question, but I have a Python 3.11.5 project using os and pandas packages.

I need to read from and write to Excel files.

The goal is to read from 9 source files. For each source file, I want to get the data beneath the source file's column headers. Then, I want to migrate that data to the destination file's corresponding columns. The first row of the destination file is just information for the analyst. The info, btw is just describing the headers in the next row. So, the second row has headers. The data starts at the third row of the destination excel file. The destination excel file is initially without data.

For each pair of source column headers 'File NameX' and 'File CategoryX', I want to get its data.

Here, X is a number between 1 and 26. There happen to be 26 pairs of headers with 'File Name' and 'File Category' in the header name, so that's why 1 <= X <= 26. After the python script gets data from the Xth pair, I want to populate the Xth destination file's corresponding columns with that data. Destination files are named so that the Xth destination file is called 'dest (X).xlsx'. So, X is the number between 1 and 26 and X from 'dest (X).xlsx' that also equals X from 'File NameX' and 'File CategoryX'. 'File NameX' from source corresponds to 'ITEM_DOCUMENT' in destination and 'File CategoryX' from source corresponds to 'ITEM_DOCUMENT_TYPE' in destination.

The problem: my program is appears to overwrite the destination file instead of simply appending the data from source to below the destination file's header row. I say "appears to overwrite" because while destination file's headers are preserved, the formatting is different (black bold font instead of red bold font in some places), the first row no longer has header info for the analyst, and the widths are different, too.

How do I simply append? A part of my code's main function:

for i, (file_name_col, file_category_col) in \
                enumerate(zip(file_name_cols, file_category_cols), start=1):
                dest_file = os.path.join(dest_folder, f"dest ({i}).xlsx")

                # Check Column Existence:
                file_name_col = f'File Name{i}'
                file_category_col = f'File Category{i}'

                if file_name_col in source_data.columns and \
                    file_category_col in source_data.columns:
                    # Create destination DataFrame with specified headers if the file doesn't exist
                    if not os.path.isfile(dest_file):
                        dest_columns = ['PART_NUMBER', 'LANGUAGE_CODE', 'MANUFACTURER_NAME',
                                        'BRAND_NAME', 'ITEM_DOCUMENT', 'ITEM_DOCUMENT_TYPE']
                        dest_data = pd.DataFrame(columns=dest_columns)
                        dest_data.to_excel(dest_file, index=False)

                    # Read the existing destination data or
                    # create an empty DataFrame if the file doesn't exist
                    dest_data = pd.read_excel(dest_file, header=1) \
                        if os.path.isfile(dest_file) else pd.DataFrame()

                    dest_columns = ['PART_NUMBER', 'LANGUAGE_CODE', 'MANUFACTURER_NAME',
                                    'BRAND_NAME', 'ITEM_DOCUMENT', 'ITEM_DOCUMENT_TYPE']

                    # Ensure that the destination file has the required columns
                    for col in dest_columns:
                        if col not in dest_data.columns:
                            dest_data[col] = ''

                    new_data = source_data[['PART_NUMBER', 'LANGUAGE_CODE', \
                        'MANUFACTURER_NAME', 'BRAND_NAME']].copy()
                    new_data['ITEM_DOCUMENT'] = source_data[file_name_col].copy()
                    new_data['ITEM_DOCUMENT_TYPE'] = \
                        new_data['ITEM_DOCUMENT'].apply(determine_document_type)

                    # Append new data to the existing destination file
                    dest_data = pd.concat([dest_data, new_data], ignore_index=True)

                    # Write the combined data to the destination file
                    dest_data.to_excel(dest_file, index=False, sheet_name='Sheet1', engine='openpyxl')
                else:
                    # Handle the case where the columns don't exist
                    raise ValueError(f"Columns '{file_name_col}' \
                                    and/or '{file_category_col}' do not exist in source_data.")

I've tried running this by ChatGPT but it's running out of ideas and keeps forgetting things that I've already tried. Please help me out. If more info is needed for this I'll happily provide it. I'll monitor this post over the weekend--it's a project for work. Task automation. Trying to do a proof-of-concept. Thanks and take care.

1

There are 1 answers

0
Michael Bujard On

This is really moken's answer but -- replaced

dest_data = pd.concat([dest_data, new_data], ignore_index=True)

# Write the combined data to the destination file
dest_data.to_excel(dest_file, index=False, sheet_name='Sheet1', engine='openpyxl')

with

# Use ExcelWriter to append data to an existing file
with pd.ExcelWriter(dest_file, engine='openpyxl', mode='a', if_sheet_exists='overlay') as writer:
    # Write the new data to the destination file
    new_data.to_excel(writer, index=False, sheet_name='Sheet1', startrow=2, header=None)

and had the desired result of Excel worksheet being appended to, without headers overwritten.