Chromosomal position to gene position conversion in a sample PLINK tped file using reference GTF file

119 views Asked by At

At first, this thread might look related to genetics, but the problem is actually shell scripting and programming based. I am new to coding, so I was suggested to find a help in SO.

I try to intersect NCBI GTF files with PLINK tped files with purpose to switch chromosomal location to gene location in tped files (files with identificator, positions and nucleotides)

So, I did following steps:

1) Got gtf from 

 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

(as the file is large and complex, the best way to look at file structure is by downloading it)

2) unzipped and extracted only genes annotation with 

    awk -F"\t" '$3=="gene"' GCF_000001405.39_GRCh38.p13_genomic.gtf > aaa.gtf

3) converted NC chromosome symbols from NC_xxx to X with 

    sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' aaa.gtf > bbb.gtf

4) converted to bed format (file with positions)

gtf2bed < bbb.gtf > ccc.bed
5) intersected with sample bed file (obtained from Plink tped file) 
format of sample bed file:

chr1    183189  183189
chr1    609407  609407

bedtools intersect -wb -a ccc.bed -b sample.bed >  ddd.bed

5) removed unwanted symbols (" and ;) from final intersected file with

    cat ddd.bed | sed 's/"//g' | sed 's/;//g' > eee.bed  


6) printed out only the needed columns from intersected file

    cat eee.bed  | awk '{print $4,$33,$35,$34,$36,$37}'  > fff.bed   

In the end I have file with different content of columns:

KLHL17 0 C C  
KLHL17 0 A A   
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669

However the desired structure should be like this:

PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G

Is it because of inconsistency in NCBI GTF files? It should be succesfully intersected with my sample file, so I can switch NC position to gene name and position in sample file and save tped file structure.

Thank you!

0

There are 0 answers