How to run piped command with Python subprocess without deadlocking?

28 views Asked by At

I have a python script that runs

for cmd in chr_commands:
    info("Run: "+cmd)
    subprocess.run(cmd)

Where each cmd is equivalent to

samtools view -F 0x0004 input.bam chr3:0-500000 2>> log.txt |awk 'BEGIN {OFS="\t"} {bpstart=$4;  bpend=bpstart; split ($6,a,"[MIDNSHP]"); n=0;\
  for (i=1; i in a; i++){\
    n+=1+length(a[i]);\
      if (substr($6,n,1)=="S"){\
        if (bpend==$4)\
          bpstart-=a[i];\
        else \
          bpend+=a[i]; \
        }\
        else if( (substr($6,n,1)!="I")  && (substr($6,n,1)!="H") )\
          bpend+=a[i];\
        }\
        if (($2 % 32)>=16)\
          print $3,bpstart,bpend,"-",$1,$10,$11;\
        else\
          print $3,bpstart,bpend,"+",$1,$10,$11;}' |   sort -k1,1 -k2,2n  | awk \
            'BEGIN{chr_id="NA";bpstart=-1;bpend=-1; fastq_filename="NA";num_records=0;fastq_records="";fastq_record_sep="";record_log_str = ""}\
            { if ( (chr_id!=$1) || (bpstart!=$2) || (bpend!=$3) )\
              {\
              if (fastq_filename!="NA") {if (num_records < 1000){\
                record_log_str = record_log_str chr_id"\t"bpstart"\t"bpend"\t"num_records"\tNA\n"} \
              else{print(fastq_records)>fastq_filename;close(fastq_filename); system("gzip -f "fastq_filename); record_log_str = record_log_str chr_id"\t"bpstart"\t"bpend"\t"num_records"\t"fastq_filename".gz\n"} \
              }\
            chr_id=$1; bpstart=$2; bpend=$3;\
            fastq_filename=sprintf("REGION_%s_%s_%s.fastq",$1,$2,$3);\
              num_records = 0;\
              fastq_records="";\
              fastq_record_sep="";\
            }\
            fastq_records=fastq_records fastq_record_sep "@"$5"\n"$6"\n+\n"$7; \
              fastq_record_sep="\n"; \
              num_records++; \
            } \
            END{ \
            if (fastq_filename!="NA") {if (num_records < 1000){\
              record_log_str = record_log_str chr_id"\t"bpstart"\t"bpend"\t"num_records"\tNA\n"} \
            else{printf("%s",fastq_records)>fastq_filename;close(fastq_filename); system("gzip -f "fastq_filename); record_log_str = record_log_str chr_id"\t"bpstart"\t"bpend"\t"num_records"\t"fastq_filename".gz\n"} \
            }\
          print record_log_str > "chr3.info" \
  }'

There are between 22-1000 of these commands to run. Each cmd should take a few seconds. Initially, it runs as expected, but after 2-5 loops the job starts hanging. It shows 100% CPU in htop but runs indefinitely.

If instead I write all of chr_commands to example.sh and run it from terminal with bash example.sh, it takes less than a minute to complete. I expect something deadlocks, but I don't know how to prevent it.

Originally the code used map_async which ran into the same issue where some jobs finished quickly but others would never stop. I have also tried with , shell=True, subprocess.call, subprocess.Popen with wait().

Not sure if it makes a difference, but the python code is in a singularity image that I run with Snakemake.

0

There are 0 answers