Python os.walk topdown true with regular expression

2k views Asked by At

I am confused as to why the following ONLY works with topdown=False and returns nothing when set to True ?

The reason I want to use topdown=True is because it is taking a very long time to traverse through the directories. I believe that going topdown will increase the time taken to produce the list.

for root, dirs, files in os.walk(mypath, topdown=False): #Why doesn't this work with True?
    dirs[:] = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
        for dir in dirs:
            print(dir)
3

There are 3 answers

0
Sandman On BEST ANSWER

In your code you were looking for matching names([dmp]\d{8}) to traverse into, while you should be looking for non-matching directories to traverse into while adding matching names to a global list.

I modified your code and this works:

import os
import re

all_dirs = []
for root, dirs, files in os.walk("root", topdown=True):
    subset = []
    for d in dirs:
        if not re.match('[dmp]\d{8}$', d):
            # step inside
            subset.append(d)
        else:
            # add to list
            all_dirs.append(os.path.join(root, d))
    dirs[:] = subset

print all_dirs

This returns:

['root/temp1/myfiles/d12345678',
'root/temp1/myfiles/m11111111',
'root/temp2/mydirs/moredirs/m22222222',
'root/temp2/mydirs/moredirs/p00000001']

0
NeoWang On

That is because your root directory doesn't match the regex, so after the first iteration, dirs is set to empty.

If what you want is to find all subdirectories which match the pattern, you should either:

  1. use topdown = False, or
  2. do not prune the directories
1
Raniz On

The problem is that you're modifying the contents of dirs while traversing. When using topdown=True this will impact what directories are traversed next.

Look at this code that shows you what is happening:

import os, re

for root, dirs, files in os.walk("./", topdown=False):
    print("Walked down {}, dirs={}".format(root, dirs))
    dirs[:] = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
    print("After filtering dirs is now: " + str(dirs))
    for dir in dirs:
        print(dir)

I've just got one directory to traverse - Temp/MyFiles/D12345678 (I'm on Linux). With topdown=False the above produces this output:

Walked down ./Temp/MyFiles/D12345678, dirs=[]
After filtering dirs is now: []
Walked down ./Temp/MyFiles, dirs=['D12345678']
After filtering dirs is now: ['D12345678']
D12345678
Walked down ./Temp, dirs=['MyFiles']
After filtering dirs is now: []
Walked down ./, dirs=['Temp']
After filtering dirs is now: []

But with topdown=True we get this:

Walked down ./, dirs=['Temp']
After filtering dirs is now: []

Since you're removing all subdirectories from dirs you're telling os.walk that you don't want to traverse further into any subdirectories and therefore iteration stops. When using topdown=False the modified value of dirs isn't used to determine what to traverse next so therefore it works.

To fix it, replace dirs[:] = with dirs =

import os, re

for root, dirs, files in os.walk("./", topdown=True):
    dirs = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
    for dir in dirs:
        print(dir)

This gives us:

D12345678

Update:

If you're absolutely certain that a directory will not contain any subdirectories of interest to you you can remove them from dirs before traversing any further. If, for example, you know that "./Temp/MyDirs2" will never contain any subdirectories of interest you can empty dirs when we get there to speed it up:

import os, re

uninteresting_roots = { "./Temp/MyDirs2" }

for root, dirs, files in os.walk("./", topdown=True):
    if root in uninteresting_roots:
        # Empty dirs and end this iteration
        del dirs[:]
        continue
    dirs = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
    for dir in dirs:
        print(dir)

Other than that there is no way you can know which directories that you don't need to traverse into because to know if they contain interesting subdirectories you have to traverse into them.