c++ boost Get files from Disk in Parallel

749 views Asked by At

I'm looking for a fast way to get a List of Files with certain attributes, in parallel from Disk.

Attributes: file size, absolute file path

Currently i'm using boost filesystem and a recursive call with directory iterators. It's fine for small datasets but for a Million files in say 50.000 Folders its not great.

Usage Environment: OS: FreeBSD, Linux, Windows Filesystems: ZFS, ext4, NTFS

Basic Idea:

  1. Thread Pool
  2. SubTreeWalker Object
  3. Partition root folder among threads
  4. subtreewalker asks threadpool for each new dir in subdir if there are lazy threads
  5. if 4 == true, assign directory to subtreewalker object in lazy thread.

What do you think of the basic idea, is it sound? Are there any implications of parallel access to the B+ Tree of the filesystem?

2

There are 2 answers

1
JvO On

This may perform worse than a linear search since you will create more random-access reads on the disk by hammering it with several threads. If you want to speed up processing of the files (I assume you want to do something with them, not just look at them), I suggest creating a thread that scans the directories, queue the found files in a list and use one or more secondary threads to pop files of one by one of the queue and process them. This saves the wait before you finish scanning.

Re-arranging the file structure may also speed up the process, if possible. One million files in 500,000 directories sounds extremely ineffecient.

0
MSalters On

There are two common bugs in directory scanning. You propose the second (threading), probably because you're suffering from the first. That's doing a depth-first search. Directories are pretty contiguous on disk. Doing a breadth-first search takes advantage of this. Doing a depth-first search leads to much slower scans as directories are repeatedly visited.