Copy multiple csv files from multiple subfolders in the same parent directory, then join to create a combined final .csv

86 views Asked by At

This seem like an easy thing to do, but without fail I can't seem to get it right.

I have a parent directory "Sessions" with many subfolders " day* " -> "weather", that contain a similar csv file , " weather*.csv ". I would like to loop through each subfolder within the directory structure to COPY that specific "weather*.csv" file in order to create one master csv file in with the destination in the parent directory.

Directory/file/source paths being:

Sessions/day1/weather/weather1.csv,
Sessions/day2/weather/weather2.csv,
Sessions/day3/weather/weather3.csv,
Sessions/day4/weather/weather4.csv.... and so forth through 150.

Destination path:

Sessions/Weather/weather_All.csv

At this point I want to populate the Weather directory with the 150+ individual weather*.csv

Example:


import os
import glob
import pandas as pd

#connect to the directory where the list of all weather csv now live.
os.chdir("/Sessions/Weather/")

#define the extension type of file we want to concatenate
extension = '.csv'

#search for a all the files that match the extension/format type csv
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all the files in the directory
combined_Weather = pd.concat([pd.read_csv(f) for f in all_filenames])

#export newly concatenated combined 
combined_Weather.to_csv('/Sessions/Weather/weather_All.csv', index= False, encoding='utf-8-sig')


I've attempted to use os.walk, shutil, pandas concat but every attempt I have tried doesn't work properly. I've looked through multiple questions and answers on here, and tried to piece together code chunks to get the output I'm looking for, but haven't been successful.

1

There are 1 answers

3
scotscotmcc On

What about something like this?

This does not create duplicates of each individual csv. I'm know you said you are trying to copy them, but it seems like maybe you don't actually want copies, you just were doing that as a step towards getting the final file. My apologies if that is incorrect.

In this code, we create an empty main_df that is just a skeleton of what we want - of course, you'll need to put the actual column names in that code (probably manually by looking through one of the individual csvs, hopefully that works).

Then we use pathlib.Path.rglob() to iterate over the sessions folder and pull out anything that ends in .csv. We read that into a dataframe and then append it to main.

Once we've gone through all the files, we save main_df as a csv.

from pathlib import Path
import pandas as pd

main_df = pd.DataFrame(columns=['the','columns','that','exist','in','each','csv'])

for file in Path('Sessions').rglob('*.csv')
    df = pd.read_csv(file)
    main_df = pd.concat([master_df,df])
    # if you really want copies of each csv, you can do a df.to_csv(filepath) here to copy the data, but not technically the file

master_df.to_csv(r'Sessions/weather_all.csv')