Use Htslib for VCF files extracting alternative allele information

603 views Asked by At

I am working with c++ to handle VCF files, to do this I use a vcf library from htslib (https://github.com/samtools/htslib/blob/develop/htslib/vcf.h). I know there might be some better libraries, but I'm also working with other file formats for which htslib also have libraries, so I would like to stick to htslib.

I have found some examples of code to open read in the file and create the correct structure. header and using some of the information from the VCF file here: https://gist.github.com/gatoravi/cad922bdf2b625a91126 and http://wresch.github.io/2014/11/18/process-vcf-file-with-htslib.html

But if we stick to the first example i have "decoded" the code to the following with my comments to the code:

int main(int argc,char **argv){
  std::cerr << "Usage:subset.vcf " << std::endl;
  
  // htslib internally represents VCF as bcf1_t data structures

  htsFile *test_vcf = NULL;

  // creates header
  bcf_hdr_t *test_header = NULL;
  
  // initialize and allocate bcf1_t object
  bcf1_t *test_record = bcf_init();

  test_vcf = vcf_open("subset.vcf", "r");

  // returning a bcf_hdr_t struct 
  test_header = bcf_hdr_read(test_vcf);
  if(test_header == NULL) {throw std::runtime_error("Unable to read header.");}
  
  while(bcf_read(test_vcf, test_header, test_record) == 0){
    // std::cout << "pos " << test_record->pos << std::endl; //column 2 in VCF with the coordinate looks like its -1
    // std::cout << "length " << test_record->rlen << std::endl; // I assume its the length of the ALT
    // std::cout << "chrom " << test_record->rid; (-1) format or bcf_hdr_id2name(test_header, test_record->rid)
    // std::cout << "qual " << test_record->qual; //column 6
    // std::cout << "allele " << test_record->n_allele << std::endl; // number of alleles
    // std::cout << "info " << test_record->n_info << std::endl; // I dont know what this is
    // std::cout << "nfmt " << test_record->n_fmt << std::endl;
    // "sample " << test_record->n_sample // i dont know what this is
    std::cout << "chr" << bcf_hdr_id2name(test_header, test_record->rid) << ":" <<test_record->pos+1 << std::endl;

    std::cout << "------------------" << std::endl;
  }
  bcf_hdr_destroy(test_header);
  bcf_destroy(test_record); 
  bcf_close(test_vcf);
  return 0;
}

In this code above, I have in the while loop out-commented multiple std::cout to make it clear with my comments what some of the functionalities are - i.e "rid" is the chromosome. As far as i can read for the vcf library the names "rid" or "nfmt" are all pre-defined. Running this code i am able to print multiple things such as the chromosome names, the position amongst other things. But I have a few issues:

My VCF file has the general structure of #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT, with an small example of a few of the lines only showing the first 6 columns:

14  19058352    rs144287685 A   G   100
14  19066089    rs374285188 C   A,T 100
14  19075627    .   G   A,T 100
14  19075912    .   A   C,T 100
14  19237205    .   T   TATGTTATG   100

My problem is when using the library i wish to print out both the reference (column4) and alternative (column 5), so for line 1 : REF = A & ALT = G, for line 5: REF = T & ALT = TATGTTATG.

Could anyone help me possibly to understand exactly what I need to do to extract these two fields? I cannot see in the library description how to use "test_record->" to extract these?

I hope my question makes somewhat sense. Thanks for your time and help.

1

There are 1 answers

0
Alberto Casas Ortiz On

I know it is a little late since you posted this 2 months ago, but maybe this can help other people. I have been struggling with htslib too lately and I managed to get the ALT and REF values.

The values for ALT and REF are stored in the bcf1_t structure, in the field called d:

typedef struct bcf1_t {
    hts_pos_t pos;
    hts_pos_t rlen;
    int32_t rid;
    float qual;
    uint32_t n_info:16, n_allele:16;
    uint32_t n_fmt:8, n_sample:24;
    kstring_t shared, indiv;
    bcf_dec_t d;   //<----- HERE
    int max_unpack;
    int unpacked;
    int unpack_size[3];
    int errcode;
} bcf1_t;

That field is not filled by default when you initialize a bcf1_t object, so first you have to call to the function bcf_unpack. The first parameter of the function is a pointer to the record, and the second one depends on the values you want it to unpack. In your case, for ALT and REF I think the first parameter should be test_record, and the second parameter BCF_UN_STR. In the source code of htslib you have all the available values commented.

int bcf_unpack(bcf1_t *b, int which);

Now you can take a look to the d field. The type of this field is another structure called bcf_dec_t. Here you have to take a look to the field als.

typedef struct bcf_dec_t {
    int m_fmt, m_info, m_id, m_als, m_allele, m_flt;
    int n_flt;
    int *flt;
    char *id, *als;     // ID and REF+ALT block (\0-separated)
    char **allele;
    bcf_info_t *info;
    bcf_fmt_t *fmt;
    bcf_variant_t *var;called
    int n_var, var_type;
    int shared_dirty;
    int indiv_dirty;
} bcf_dec_t;

As it says in the documentation, als contains the values REF and ALT separated by '\0'. So, if your values are: REF = T and ALT = TATGTTATG, als contains the following array of characters: "T\0TATGTTATG\0".

You can parse that array of chars for obtaining REF and ALT. I do it using this function I coded that takes als as input, and returns a vector containing ALT and REF separated. I know this may not be the most optimised function and that there must be a way of doing it using htslib, but it works:

std::vector<std::string> extractAltRef(char *als) {
    std::vector<std::string> res;
    std::string str = "";
    int i = 0;
    int j = 0;
    while(j != 2) {
        if(als[i] == '\0') {
            res.push_back(str);
            str.clear();
            j++;
        } else {
            str += als[i];
        }
        i++;
    }
    return res;
}

So in your code, for obtaining the REF and ALT values, if you use my function, you should do something like this (not tested):

// Pointer initializations...
bcf_unpack(test_record, BCF_UN_STR);
while(bcf_read(test_vcf, test_header, test_record) == 0){
    std::vector<std::string> altRef = extractAltRef(test_record->d.als);
    std::cout << "REF " << altRef[0] << std::endl;
    std::cout << "ALT " << altRef[1] << std::endl;
}
// Free memory...

If you don't use my code, you'll have to find a way of separating REF and ALT.

I hope this helps,

Alberto.