Is it possible to annotate ALL myIDs from Ensembl to symbol using R (biomaRt)?

401 views Asked by At

I have a human datasets with genes ensembl and I want to annotate IDs to Symbol instead of ensembl in one of these datasets I have exactly 20176 genes I used two methods, but in boths I got NAs in some genes

  • First method:

library(biomaRt)

library(org.Hs.eg.db)

keytypes(org.Hs.eg.db)

Data <- read.csv("Data.csv", header = T, row.names = 1)
Data$SYMBOL <- mapIds (org.Hs.eg.db, keys = row.names(Data), keytype = "ENSEMBL", column = "SYMBOL")

but I found exactly 3845 NAs:

sum(is.na(Data))
  • Second Method:

    `library("EnsDb.Hsapiens.v86")

keytypes(EnsDb.Hsapiens.v86) mapIds <- mapIds(EnsDb.Hsapiens.v86, keys = genes$'row.names(Data)', keytype = "GENEID", column = "SYMBOL")`

but also I found 761 NAs.

I'm wondering if there's a newer version of EnsDb.Hsapiens to use it to get all gene Symbols without any NAs or even another package.

my genes name : https://docs.google.com/document/d/1VVtveHXbOXt8m02ttcAmjHxF59YTFFgOEvyBhyqw13w/edit?usp=sharing

1

There are 1 answers

0
Chris On

After downloading your shared data, steps taken:

ensem <- read.csv('~/Downloads/Ensembl.txt', header=TRUE, sep ='\n')
# here I cheated after finding
# https://www.biotools.fr/human/ensembl_symbol_converter
# pasted in without header and downloaded
ensem_symbol <- read.csv('ens_symbol_nohead.txt', header = FALSE, sep ='\n')
# returns Ensemble\Symbol
# [Wiktor](https://stackoverflow.com/questions/33210280/r-strsplit-on-backslash)
ensem_symb_split <- strsplit(x = ensem_symbol$V1, split ='\\\\|[^[:print:]]', perl = FALSE)
en_sy_tst_rbind <- do.call(rbind, ensem_symb_split)
en_sy_df <- as.data.frame(en_sy_tst_rbind)

At the above site they don't say explicitly what is returned if a match is not found, one would think NA:

not_defined_sym <- nchar(en_sy_df[, 2])
en_sy_df[which(not_defined_sym == 2), ]
                   V1 V2
498   ENSG00000039537 C6
523   ENSG00000042832 TG
551   ENSG00000047457 CP
554   ENSG00000047597 XK
749   ENSG00000062485 CS
1717  ENSG00000091483 FH
1719  ENSG00000091513 TF
2886  ENSG00000106804 C5
3533  ENSG00000112936 C7
3552  ENSG00000113141 IK
3604  ENSG00000113600 C9
4062  ENSG00000117525 F3
4835  ENSG00000125730 C3
9759  ENSG00000166278 C2
10169 ENSG00000168453 HR
10275 ENSG00000169083 AR
11001 ENSG00000173599 PC
11122 ENSG00000174611 KY
12380 ENSG00000185010 F8
13462 ENSG00000198125 MB
13605 ENSG00000198734 F5
13635 ENSG00000198814 GK
18737 ENSG00000257017 HP

# final test for all annotated
 en_sy_df[which(not_defined_sym == ''), ]
[1] V1 V2
<0 rows> (or 0-length row.names)
# all key:values complete

It appears the recommendation is to update to the version running the site above.