bash: limiting subshells in a for loop with file list

1.2k views Asked by At

I've been trying to get a for loop to run a bunch of commands sort of simultaneously and was attempting to do it via subshells. Ive managed to cobble together the script below to test and it seems to work ok.

#!/bin/bash
for i in {1..255}; do
  (
    #commands
  )&

done
wait

The only problem is that my actual loop is going to be for i in files* and then it just crashes, i assume because its started too many subshells to handle. So i added

#!/bin/bash
for i in files*; do
  (
    #commands
  )&
if (( $i % 10 == 0 )); then wait; fi
done
wait

which now fails. Does anyone know a way around this? Either using a different command to limit the number of subshells or provide a number for $i?

Cheers

4

There are 4 answers

0
whoan On BEST ANSWER

You can find useful to count the number of jobs with jobs. e.g.:

wc -w <<<$(jobs -p)

So, your code would look like this:

#!/bin/bash
for i in files*; do
  (
    #commands
  )&
  if (( $(wc -w <<<$(jobs -p)) % 10 == 0 )); then wait; fi
done
wait

As @chepner suggested:

In bash 4.3, you can use wait -n to proceed as soon as any job completes, rather than waiting for all of them

6
chepner On

Define the counter explicitly

#!/bin/bash
for f in files*; do
  (
    #commands
  )&
  (( i++ % 10 == 0 )) && wait
done
wait

There's no need to initialize i, as it will default to 0 the first time you use it. There's also no need to reset the value, as i %10 will be 0 for i=10, 20, 30, etc.

1
kojiro On

xargs/parallel

Another solution would be to use tools designed for concurrency:

printf '%s\0' files* | xargs -0 -P6 -n1 yourScript

The -P6 is the maximum number of concurrent processes that xargs will launch. Make it 10 if you like.

I suggest xargs because it is likely already on your system. If you want a really robust solution, look at GNU Parallel.

Filenames in array

For another answer explicit to your question: Get the counter as the array index?

files=( files* )
for i in "${!files[@]}"; do
    commands "${files[i]}" &
    (( i % 10 )) || wait
done

(The parentheses around the compound command aren't important because backgrounding the job will have the same effects as using a subshell anyway.)

Function

Just different semantics:

simultaneous() {
    while [[ $1 ]]; do
        for i in {1..11}; do
            [[ ${@:i:1} ]] || break
            commands "${@:i:1}" &
        done
        shift 10 || shift "$#"
        wait
    done
}
simultaneous files*
0
gniourf_gniourf On

If you have Bash≥4.3, you can use wait -n:

#!/bin/bash

max_nb_jobs=10

for i in file*; do
    # Wait until there are less than max_nb_jobs jobs running
    while mapfile -t < <(jobs -pr) && ((${#MAPFILE[@]}>=max_nb_jobs)); do
        wait -n
    done
    {
        # Your commands here: no useless subshells! use grouping instead
    } &
done
wait

If you don't have wait -n available, you can use something like this:

#!/bin/bash

set -m

max_nb_jobs=10

sleep_jobs() {
   # This function sleeps until there are less than $1 jobs running
   local n=$1
   while mapfile -t < <(jobs -pr) && ((${#MAPFILE[@]}>=n)); do
      coproc read
      trap "echo >&${COPROC[1]}; trap '' SIGCHLD" SIGCHLD
      [[ $COPROC_PID ]] && wait $COPROC_PID
   done
}

for i in files*; do
    # Wait until there are less than 10 jobs running
    sleep_jobs "$max_nb_jobs"
    {
        # Your commands here: no useless subshells! use grouping instead
    } &
done
wait

The advantage of proceeding like this, is that we make no assumptions on the time taken to finish the jobs. A new job is launched as soon as there's room for it. Moreover, it's all pure Bash, so doesn't rely on external tools and (maybe more importantly), you may use your Bash environment (variables, functions, etc.) without exporting them (arrays can't be easily exported so that can be a huge pro).