How can I recursively iterate through a directory in Python while ignoring some subdirectories?

213 views Asked by At

I have a directory structure on my filesystem, like this:

folder_to_scan/
    important_file_a
    important_file_b
    important_folder_a/
        important_file_c
    important_folder_b/
        important_file_d
    useless_folder/
        ...

I want to recursively scan through folder_to_scan/, and get all the file names. At the same time, I want to ignore useless_folder/, and anything under it.

If I do something like this:

path_to_search = Path("folder_to_scan")
[pth for pth in path_to_search.rglob("*") if pth.is_file() and 'useless_folder' not in [parent.name for parent in pth.parents]]

It will work (probably - I didn't bother trying), but the problem is, useless_folder/ contains millions of files, and rglob will still traverse all of them, take ages, and only apply the filter when constructing the final list.

Is there a way to tell Python not to waste time traversing useless folders (useless_folder/ in my case)?

3

There are 3 answers

0
Carlos Horn On BEST ANSWER

You can easily write your own file iterator using recursion.

def useless(path):
    # your logic to discard folders goes here
    ...

def my_files_iter(path):
    if path.is_file():
        yield path
    elif path.is_dir():
        if useless(path):
            return
        for child_path in path.iterdir():
            yield from my_files_iter(child_path)
0
Thomas On

If you just want to exclude one folder, you could use a glob string like "./[!useless_folder]*":

from pathlib import Path

PATH = Path.cwd() / "folder_to_scan"

for file in PATH.rglob("./[!useless_folder]*"):
    print(file)

However, if you have a .something file (e.g. .ipynb_checkpoints) within useless_folder or a subfolder, it won't be filtered out. The glob string may need more tweaking.


Or, you could use os.walk as shown in unutbu's answer from here combined with a generator as suggested by others in their answers:

import os
from pathlib import Path
from typing import Generator, Union, List

PATH = Path.cwd() / "folder_to_scan"

def traverse_files(
      path: Union[str, Path],
      exclude_folders: Union[List[str], str, None] = None) \
          -> Generator[Path, None, None]:

    for root, directories, files in os.walk(str(path), topdown=True):

        if exclude_folders:
            directories[:] = [d for d in directories if d not in exclude_folders]

        for file in files:
            yield Path(root) / file

        # and if folders are of interest as well...
        # for directory in directories:
        #     yield Path(root) / directory

for file in files(PATH, "useless_folder"):
    print(file)
0
lucas-escobar On

Doing this should work. It will create a generator function that you can iterate through. This ensures that everything is not loaded into memory at once.

def traverse_directory(path):
    for item in path.iterdir():
        if item.is_dir() and item.name != 'useless_folder':
            yield from traverse_directory(item)
        elif item.is_file():
            yield item

You can then use this function like so (making a list of all .txt files for example):

path_to_search = Path("folder_to_scan")
txt_file_list = [f for f in traverse_directory(path_to_search) if f.suffix == '.txt']