Subsetting a list of files within a folder to apply python function

35 views Asked by At

I have a function that requires a path to a folder with only images and nothing else. However, I have 3000+ folders with images but also other files (random .txt files, RTSTRUC files, etc.).

for each folder, I was using sh.copy() to move the images into a tmp folder and then pointing my function at that. But I ran into long runtimes. I tried shortening with sh.copyfiles() also found it be slow. I tried parallelizing with the multiprocessing library but ran into problems as each subprocess was using the same temporary folder and there was inadvertent mixing and matching!

Any ideas?

for file in glob.glob(dicom_dir + "/*IMG*"):
    shutil.copy(file,tmp_dicom_folder)

gives prohibitive runtimes.

Is there a way for me to parallelize without running into this issue using temporary folders? Making a temp folder for each process sounds..........messy.

1

There are 1 answers

0
tdelaney On

Use symlinks instead of copies. Create a temporary directory for each process you want to run in parallel. If this is CPU intensive image processing, a little less than one process per CPU is a reasonable starting point, but could be changed for all sorts of reasons such as GPU usage.

List the files you want processed and scatter symlinks to them into the directories. Run a copy of your program on each of these symlinked directories and each will get a subset of the workload.

import itertools
import multiprocessing as mp
import tempfile
from pathlib import Path
import subprocess as subp

dicom_dir = Path("tmp/test")

# assuming each image process takes 1 cpu and you don't want
# to commit all of them... Note: There may be better ways to
# get an accurate count.

cpus = int(mp.cpu_count() * .80) or 1

# build tmp directory for each cpu, populate with symlinks to
# image files and run one process per directory.

with tempfile.TemporaryDirectory() as tmpdir:
    root = Path(tmpdir)
    cpudirs = [root/f"sym_{x}" for x in range(cpus)]
    for cpudir in cpudirs:
        cpudir.mkdir()
    count = 0
    for count, (target, cpudir) in enumerate(
            zip(dicom_dir.glob("*IMG*"), 
            itertools.cycle(cpudirs))):
        (cpudir/target.name).symlink_to(target.absolute())
    if count < len(cpudirs):
        # there were fewer IMG than cpus, remove what we didn't use
        del cpudirs[count:]
    # TODO: This is a naive way to run the commands and would be slow
    #       if there is a lot of stdout/err to consume.
    processes = [subp.Popen(["the command", cpudir.absolute()])
        for cpudir in cpudirs]
    for proc in processes:
        proc.communicate()