Bio::DB::Sam - Get number of mappings for all reads in bam file

1k views Asked by At

I want to calculate transcript expressions and therefore I need to get the number of mappings for all reads in the bam file. My current procedure is to go overall transcripts and get the reads that mapped on it using Bio::DB::Sam. The results are stored in a hash with read_name as key (10 letters) and number_of_mappings as value (integer).

Here is the code that I am using:

use Bio::DB:Sam;
use strict;

my %global_read_occurrences;


sub getGlobalReadOccurrences {

 my ($ids, $bam_file) = @_;

 $sam = Bio::DB::Sam -> new (-bam => $bam_file);

 foreach my $id (@{$ids}){
   my $alignments = $sam -> get_features_by_location(-seq_id => $transcript_id, -iterator => 1);


  while (my $alignment = $alignments -> next_seq){

   my $read_name = $alignment -> query -> name;

   if (exists($global_read_occurrences{$read_name})){
    $global_read_occurrences{$read_name}++;
   }
   else {
    $global_read_occurrences{$read_name} = 1;
   }
  }
 }
}

My questions: Is there any other possibility where I can get the number of global mappings per read directly and where I don't have to go over all transcripts? I couldn't find any subs in Bio::DB::Sam like $sam -> getNumberOfMappings($read_name);

I am using bam files with more than 50 million mapped reads, so the hash is going to need huge memory resources (sometimes about 40 GB) Is this actually possible or does this come from somewhere else? And is there any other possibility to store the data with less mem?

Thanks a lot!

1

There are 1 answers

0
Lincoln Stein On

BAM files are usually sorted by chromosomal location and not by the name of the read, so the read mappings can be located anywhere in the file. The easiest thing for you to do is to go to the SAM file and run a simple shell command:

 cut -f1,1 myfile.sam | sort | uniq -c

This will produce a file like this:

  2 HWI-EAS299_4_30M2BAAXX:2:99:965:826
  2 HWI-EAS299_4_30M2BAAXX:2:99:966:1932
  2 HWI-EAS299_4_30M2BAAXX:2:99:971:146
  2 HWI-EAS299_4_30M2BAAXX:2:9:997:1263
  2 HWI-EAS299_4_30M2BAAXX:2:99:972:281
  2 HWI-EAS299_4_30M2BAAXX:2:99:973:1904
  1 HWI-EAS299_4_30M2BAAXX:2:99:976:186
  2 HWI-EAS299_4_30M2BAAXX:2:99:986:687
  6 HWI-EAS299_4_30M2BAAXX:2:99:987:165
  2 HWI-EAS299_4_30M2BAAXX:2:99:99:1582
  2 HWI-EAS299_4_30M2BAAXX:2:99:99:160
  2 HWI-EAS299_4_30M2BAAXX:2:99:998:1139

The first column is the count of mappings. The second is the read name.