Create directory of filenames with zipfolders with Python

40 views Asked by At

I have a folder filled with lots of zipfolders. Inside each zip is between 1 to 4 CSV files. I'd like to create a directory (printed out to excel preferibly) that has in column 1, the number within the name of the zip folder, and in column 2, part of the name of at least 1 of the csv files contained within that zip folder. So reading the pathname to at least one of each of the csv files and saving two peices of it would work.

More details:

  • Zip Folders: The zip folders have a nameing scheme like: "CSV_2023_Q2_XXXXXXX_XXXXXXX.ZIP". There are 3,000+ of them.

  • CSV files within each Zip: Within each zip folder there's 1 to 4 csv files. These all follow a consistent naming scheme for the beginning of the filename, but with (max) 4 different suffixes. The root goes like this: "202306_COMPANY_NAME_LLC..." (note for companies with more or less than two words in their name it will have a different structure like "202306_COMPANY..." or "202306_COMPANY_WITH_LONG_NAME_LLC..."). The suffix's for these csv files are "_ident.CSV", "_contracts.CSV" "_indexPub.CSV" and "_transactions.CSV". So for example a full csv file name would be: "202306_COMPANY_NAME_LLC_ident.csv". All zip folders have at least the "ident.csv" file but not all have the rest.

  • End goal: I want to traverse through all of these zip folders and create a directory linking the name of the zip folder to the company name inside of it. As long as this directory is searchable I don't care how its built, but I'd prefer if it can be output to an excel file.

Thank you!!

I've tried the following, but all I get is a final csv file that just has the titles of the column headings:

# I first imported the path of the folder as main_folder 

folder_file_info = []
# Traverse the main folder

for root, dirs, files in os.walk(main_folder):

    for folder_name in files:

        if folder_name.startswith("CSV_"):

            zip_folder_path = os.path.join(root, folder_name)

        

            

            # Get the list of files in zip folders

            zip_files = []

            with zipfile.ZipFile(zip_folder_path, "r") as zip_ref:

                zip_files = zip_ref.namelist()

               

            # Find Ident file

            ident_file = next((file for file in zip_files if file.endswith("_ident.csv")),None)

           

            if ident_file:

                # Extract company name

                zip_folder_name = folder_name

                file_beginning = ident_file.rsplit("_ident.csv",1)[0]

                folder_file_info.append((zip_folder_name,file_beginning))

           

#Write folder and file info into a CSV

csv_file_path = "folder_contents_6.csv"

with open(csv_file_path, "w",newline="") as csv_file:

    csv_writer = csv.writer(csv_file)

    csv_writer.writerow(["Zip Folder", "Company Name"])

    csv_writer.writerows(folder_file_info)
1

There are 1 answers

2
richard On
from multiprocessing.dummy import Pool
from pathlib import Path
import csv, zipfile


# function for worker threads
def worker(filename):
    with zipfile.ZipFile(filename) as file:
        ident = (i for i in file.namelist() if i.endswith('_ident.csv'))
        files = ', '.join(i.rsplit('_ident.csv', 1)[0] for i in ident)
    return (filename, files or 'no "_ident.csv" found')

if __name__ == '__main__':
    # use pathlib to make a list of filenames
    # adjust path as needed
    csv_files = (file.resolve() for file in Path('.').glob('CSV_*'))

    # might get a slight bump from multiple threads, might not
    with Pool() as pool:
        # if order matters, use imap
        # imap_unordered returns results "as completed"
        results = [result for result in pool.imap_unordered(worker, csv_files)]
    pool.join()

    with open('folder_contents_6.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['zip_folder', 'company_name'])
        writer.writerows(results)