Chromosomal position to gene position conversion in a sample PLINK tped file using reference GTF file

113 views Asked by user15480777 At 18 May 2021 at 17:02

At first, this thread might look related to genetics, but the problem is actually shell scripting and programming based. I am new to coding, so I was suggested to find a help in SO.

I try to intersect NCBI GTF files with PLINK tped files with purpose to switch chromosomal location to gene location in tped files (files with identificator, positions and nucleotides)

So, I did following steps:

1) Got gtf from 

 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

(as the file is large and complex, the best way to look at file structure is by downloading it)

2) unzipped and extracted only genes annotation with 

    awk -F"\t" '$3=="gene"' GCF_000001405.39_GRCh38.p13_genomic.gtf > aaa.gtf

3) converted NC chromosome symbols from NC_xxx to X with 

    sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' aaa.gtf > bbb.gtf

4) converted to bed format (file with positions)

gtf2bed < bbb.gtf > ccc.bed
5) intersected with sample bed file (obtained from Plink tped file) 
format of sample bed file:

chr1    183189  183189
chr1    609407  609407

bedtools intersect -wb -a ccc.bed -b sample.bed >  ddd.bed

5) removed unwanted symbols (" and ;) from final intersected file with

    cat ddd.bed | sed 's/"//g' | sed 's/;//g' > eee.bed  


6) printed out only the needed columns from intersected file

    cat eee.bed  | awk '{print $4,$33,$35,$34,$36,$37}'  > fff.bed

In the end I have file with different content of columns:

KLHL17 0 C C  
KLHL17 0 A A   
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669

However the desired structure should be like this:

PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G

Is it because of inconsistency in NCBI GTF files? It should be succesfully intersected with my sample file, so I can switch NC position to gene name and position in sample file and save tped file structure.

Thank you!

Original Q&A

TechQA.

Chromosomal position to gene position conversion in a sample PLINK tped file using reference GTF file

There are 0 answers

Related Questions in SHELL

Related Questions in BIOINFORMATICS

Related Questions in NCBI

Related Questions in BED

Popular Questions

Popular Tags

Trending Questions