Can gnu parallel be used with the cat command and large arrays?

988 views Asked by At

I am working on processing a long list of data from a file using bash. There are over 300,000 lines in this file, so using gnu parallel could cut down processing time significantly.

In addition to the main data file, I am using a second, smaller file that contains data that will be used by each iteration of my code. This file contains approximately 60,000 lines, with each line containing two columns. My current strategy is to read each line of the smaller file and copy the data from the columns into two separate arrays. These two arrays are then used in each iteration of the code.

I cannot seem to get gnu parallel to read my arrays as actual arrays, however, despite following the code illustrated in bash how to pass array as an argument to a function , and numerous other permutations of said code.

A simplified version of my code is below. So dar, it only returns a bunch of blank lines. I would very much appreciate it if someone could explain exactly how to parse arrays into parallel.

SCAFF_LENGTH_FILE="${HOME}/ReferenceSequences/P.miniata.Scaffold.lengths.txt"
INPUT_VCF="${HOME}/data/HaplotypeCalling/variants_allOvarySamples.filtered.vcf"
declare -a array_scaffName
declare -a array_scaffLength
z=0
while read -a data LINE; do
    array_scaffName[$z]=${data[0]}
    array_scaffLength[$z]=${data[1]}
    z=$(( $z + 1 ))
done < ${SCAFF_LENGTH_FILE}

WORKING_DIR="${HOME}/*filepath*/codeTest"
TEMP_FILE_DIR="${WORKING_DIR}/TEMP_FILES"
cd $WORKING_DIR
function exon_parse {
    FILE_NUMBER=$1
    TEMP_FILE_DIR=$2
    INPUT_VCF=$3

    scaffName=$4[@]
    scaffName_array=("${!scaffName}")

    scaffLength=$4[@]
    scaffLength_array=("${!scaffLength}")

    echo ${scaffName_array[4]}
    echo ${scaffLength_array[4]}

    }
export -f exon_parse

seq 5 | parallel exon_parse {} $TEMP_FILE_DIR ${INPUT_VCF} array_scaffName array_scaffLength

NB: I use the code seq 5, because my main data file has been broken down into smaller sub-files to aid processing. I would ultimately like to developed nested gnu parallel script that selects each sub-file in parallel, and then uses a code like:

cat fileName | parallel 'processes' {} other_inputs

to process the lines of data within each sub-file in parallel

1

There are 1 answers

2
Ole Tange On

The most obvious solution is to move the array inside the function:

INPUT_VCF="${HOME}/data/HaplotypeCalling/variants_allOvarySamples.filtered.vcf"

WORKING_DIR="${HOME}/*filepath*/codeTest"
TEMP_FILE_DIR="${WORKING_DIR}/TEMP_FILES"
cd $WORKING_DIR
function exon_parse {
    FILE_NUMBER=$1
    TEMP_FILE_DIR=$2
    INPUT_VCF=$3
    SCAFF_LENGTH_FILE="${HOME}/ReferenceSequences/P.miniata.Scaffold.lengths.txt"
    declare -a array_scaffName
    declare -a array_scaffLength
    z=0
    while read -a data LINE; do
        array_scaffName[$z]=${data[0]}
        array_scaffLength[$z]=${data[1]}
        z=$(( $z + 1 ))
    done < ${SCAFF_LENGTH_FILE}

    scaffName=$array_scaffName[@]
    scaffName_array=("${!scaffName}")

    scaffLength=$array_scaffLength[@]
    scaffLength_array=("${!scaffLength}")

    echo ${scaffName_array[4]}
    echo ${scaffLength_array[4]}

    }
export -f exon_parse

seq 5 | parallel exon_parse {} $TEMP_FILE_DIR ${INPUT_VCF} array_scaffName array_scaffLength

But you can import the array, too:

                import_array () {
                  local func=$1; shift;
                  export $func='() {
                    '"$(for arr in $@; do
                          declare -p $arr|sed '1s/declare -./&g/'
                        done)"'
                  }'
                }

                declare -a indexed='([0]="one" [1]="two")'

                import_array my_importer indexed

                parallel --env my_importer \
                  'my_importer; echo "{}" "${indexed[{}]}"' ::: "${!indexed[@]}"