Iterating through a directory tree in Python

1.5k views Asked by At

I have a set of data that is stored in a directory tree. Data for a given day is in a folder that bares the name of the day (like 01 for the first day, 14 for the fourteenth day in a month, etc.), all folders for days of a given month are stored in a monthly folder (numbers 01 to 12) and all of the month folders are inside a yearly folder (2014, 2015 and 2016 in my case). All of theese are in another folder.

I have defined 3 functions that all iterate throug one folder by using a for iterator with os.listdir() (but perform somewhat different operations):

for e in os.listdir(os.curdir):
    if e.endswith(refer[-4:]:
        #Performs operations (GDAL related)

Earlier in the program I have the user input certain parameters including the top folder (which is input into the function via the os.curdir) and a reference file (stored in refer variable). The functions return a variable arr that is stored in out_arr. In case the user wants to go throug all the data I have this line of code invoke the correct function and go deep enough through the directory tree:

elif param_vrsta == "dr" and param_obdobje == "v":
    for dd in os.listdir(os.curdir):
        for aa in os.listdir(dd):
            for kk in os.listdir(aa):
                out_arr = krog1()

There are several elifs that go one or two or three levels deep before and after and invoke one of the three functions.

But I keep getting an error:

FileNotFoundError: [WinError 3] Path could not be found: '01'

on the

for kk in os.listdir(aa):

line. The endgoal is to be able to choose wether to perform the operations on one folder (day), one month (up to 31 folders of data), one year (12 folders with up to 31 folders of data) or just all the data and have the function work every time. Have I missed some syntax error (or added one too many lines), or am I going about this the wrong way from the getgo?

2

There are 2 answers

2
Bill Bell On BEST ANSWER

I would not say that you're doing anything wrong. I think there's an easier way.

Consider this folder structure that is truncated in comparison to what you have.

folders

Each of the lowest level folders contains just one file called data.txt.

Once you set Path to the top-level path name you can get an iterator to various subsets of the files under that path.

>>> from pathlib import Path

>>> path = Path('temp/')

All of the files:

>>> list(path.glob('*/*/*'))
[WindowsPath('temp/01/01/data.txt'), WindowsPath('temp/01/02/data.txt'), WindowsPath('temp/02/01/data.txt'), WindowsPath('temp/02/02/data.txt'), WindowsPath('temp/03/01/data.txt'), WindowsPath('temp/03/02/data.txt')]

All of the February files:

>>> list(path.glob('02/*/*'))
[WindowsPath('temp/02/01/data.txt'), WindowsPath('temp/02/02/data.txt')]

All of the files for the second day of the month:

>>> list(path.glob('*/02/*'))
[WindowsPath('temp/01/02/data.txt'), WindowsPath('temp/02/02/data.txt'), WindowsPath('temp/03/02/data.txt')]

All of the files for the second day of February:

>>> list(path.glob('02/02/*'))
[WindowsPath('temp/02/02/data.txt')]

Edit: Supplementary:

from pathlib import Path

top_path = input('Please input the full name of the top path:')
path = Path(top_path)
year = input('What year do you want to consider? (Enter * for all years)')
month = input('What month do you want to consider? (Enter * for all months)')
day = input('What day do you want to consider? (Enter * for all days)')

for item in path.glob('{}/{}/{}'.format(year, month, day)):
    <do something with item here>
7
Roel Schroeven On

I think you should seriously consider the approach of Bill Bell's answer, but for completeness here's what (I think) is the cause of the error you get: os.listdir returns unqualified file names, so when you dig deeper into the directory hierarchy you need to specify each directory along the way.

Something like this:

for year_dir in os.listdir(os.curdir):
    year_dir = os.path.join(os.curdir, year_dir)
    for month_dir in os.listdir(year_dir):
        month_dir = os.path.join(os.year_dir, month_dir)
        for day_dir in os.listdir(month_dir):
            ...

Alternatives:

  • Use glob.glob() instead of os.listdir(): glob does qualify the file names
  • Use recursive glob, possibly in combination with pathlib.Path (as in Bill Bell's answer)
  • Use os.walk()