using os.walk method to get directory paths containing 2 types of files

1.5k views Asked by At

I want to list all folders containing docx files using os().walk method in Python 2.7. I managed to do that with code written below, but I want to know if it is possible to limit this list to show only folders containing exactly two specific file types (for example "docx" and "pdf")?

import os
import walk

a = open("output.txt", "w")
for path, subdirs, files in os.walk(r'C:\Users\Stephen\Desktop'):
    for filename in files:
        if filename.endswith(('.docx')):
            f = os.path.join(path, filename)
            a.write(str(f) + os.linesep) 
2

There are 2 answers

2
Martijn Pieters On BEST ANSWER

Just skip directories where you don't have at least those two extensions; per-directory file lists are limited so it's cheap to use any() to test for specific extensions:

for path, subdirs, files in os.walk(r'C:\Users\Stephen\Desktop'):
    if not (any(f.endswith('.pdf') for f in files) and 
            any(f.endswith('.docx') for f in files)):
        # no PDF or Word files here, skip
        continue
    # directory contains *both* PDF and Word documets

When the list of extensions to test for gets a bit longer, you may want to just create a set of all available extensions:

for path, subdirs, files in os.walk(r'C:\Users\Stephen\Desktop'):
    extensions = {os.path.splitext(f)[-1] for f in files}
    if not extensions >= {'.pdf', '.docx', '.odt', '.wpf'}:
        # directory doesn't contain *all* required file types 
        continue

>= tests if the right-hand set is a subset of the left (so extensions is a superset of the right-hand set); so extensions should at least contain all for extensions named on the right:

>>> {'.foo', '.docx', '.pdf', '.odt'} >= {'.pdf', '.docx', '.odt', '.wpf'}  # missing .wpf
False
>>> {'.foo', '.wpf', '.docx', '.pdf', '.odt'} >= {'.pdf', '.docx', '.odt', '.wpf'} # complete
True
1
Organis On

This?

import os

a = open("output.txt", "w")
for path, subdirs, files in os.walk(r'C:\Users\Stephen\Desktop'):
    docx = False
    pdf = False
    rest = True
    for filename in files:
        if filename.endswith(('.docx')):
            docx = True
        elif filename.endswith(('.pdf')):
            pdf = True
        else:
            rest = False
            break
    if docx and pdf and rest:
        f = os.path.join(path, filename)
        a.write(str(f) + os.linesep)