Make them unique based on the initial segment of names in a FASTA file using R

55 views Asked by At

I aim to exclude sequences based on the uniqueness of the initial part of the names in the FASTA file. However, I encountered an error when I created and used the code below.

The FASTA file typically contains 4,333 sequences, and my objective is to retain only one sequence per organism. For instance, if there are three sequences for Homo sapiens, I intend to select just one. I utilized the code below to make the names unique based on the initial part, but it failed to achieve the desired uniqueness. Could you please guide me on how to enhance the code? Many thanks for your help

>Homo_sapiens_NP_004698.3_protein ATG12 isoform 1
MAEEPQSVLQLPTSIAAGGEGLTDVSPETTTPEPPSSAAVSPGTEEPAGDTKKKIDILLKAVGDTPIMKTKKWAVERTRT
IQGLIDFIKKFLKLVASEQLFIYVNQSFAPSPDQEVGTLYECFGSDGKLVLHYCKSQAWG
>Homo_sapiens_AAH12266.2_autophagy related 12 homolog (S. cerevisiae)
MTSREHQVSLCNCVPLLRRLLCDAPWRKARPLHALSRYFRSRVSPSKMAEEPQSVLQLPTSIAAGGEGLTDVSPETTTPE
PPSSAAVSPGTEEPAGDTKKKIDILLKAVGDTPIMKTKKWAVERTRTIQGLIDFIKKFLKLVASEQLFIYVNQSFAPSPD
QEVGTLYECFGSDGKLVLHYCKSQAWG
>Homo_sapiens_BAG36125.1_protein product
MAEEPQSVLQLPTSIAAGGEGLTDVSPETTTPEPPSSAAVSPGTEEPAGDTKKKIDILLKAVGVTPIMKTKKWAVERTRT
IQGLIDFIKKFLKLVASEQLFIYVNQSFAPSPDQEVGTLYECFGSDGKLVLHYCKSQAWG
>Pan_troglodytes_PNI35014.1_protein CK820_G0038523
MAEEPQSVLQLPPSSAAGGEGLTDVSPETTTPEPPSSAAVSPGTEEPAGDTKKKIDILLKAVGDTPIMKTKKWAVERTRT
IQGLIDFIKKFLKLVASEQLFIYVNQSFAPSPDQEVGTLYECFGSDGKLVLHYCKSQAWG
>Pan_paniscus_XP_034816260.1_protein ATG12 isoform X1
MTSREHQVSLCNCVPLLRRLLCDAPWRKARPLHALSRYFRSRVFPSKMAEEPQSVLQLPPSSAAGGEGLTDVSPETTTPE
PPSSAAVSPGTEEPAGDTKKKIDILLKAVGDTPIMKTKKWAVERTRTIQGLIDFIKKFLKLVASEQLFIYVNQSFAPSPD
QEVGTLYECFGSDGKLVLHYCKSQAWG

>Pan_paniscus_XP_000000H_protein ATG12 isoform X1
MTSREHQVSLCNCVPLLRRLLCDAPWRKARPLHALSRYFRSRVFPSKMAEEPQSVLQLPPSSAAGGEGLTDVSPETTTPE
PPSSAAVSPGTEEPAGDTKKKIDILLKAVGDTPIMKTKKWAVERTRTIQGLIDFIKKFLKLVASEQLFIYVNQSFAPSPD
QEVGTLYECFGSDGKLVLHYCKSQAWG

#Code used
library(Biostrings)
library(stringr)

modified_sequences <- readAAStringSet( "path/modified_sequences.fasta")
modified_sequences

first_part <- sub("^>([^_]+)_.*$", "\\1", names(modified_sequences))
unique_first_part <- make.unique(first_part)
name_mapping <- setNames(unique_first_part, first_part)
keep_indices <- !duplicated(unique_first_part)
unique_sequences <- modified_sequences[keep_indices]
writeXStringSet(unique_sequences, file = "path/unique_sequences.fasta")
unique_sequences

1

There are 1 answers

1
Julian Selke On

Using the example data you provided I could not reproduce an actual error but the code indeed failed to return unique sequences, i.e., one sequence per genus (first part of the ids).

I noticed that the sequence ids do not start with ">" in the AAStringSet object in R, so the sub function call has no effect since it does not match the specified pattern. To make sure that this is not an effect of using different package versions, these are the ones I used:

library(Biostrings)
library(stringr)

> packageVersion("Biostrings")
[1] ‘2.68.1’
> packageVersion("stringr")
[1] ‘1.5.0’

Still, your code does not produce unique names, because you are using make.unique and subsequently !duplicated. Since you first made the names unique you will not have any duplicated names, hence select all of the sequences.

Here is a modified version of your code that does select the first sequence associated with a genus ("Homo" and "Pan").

library(Biostrings)
library(stringr)

modified_sequences <- readAAStringSet( "path/modified_sequences.fasta")
modified_sequences

first_part <- sub("^([^_]+)_.*$", "\\1", names(modified_sequences))

keep_indices <- !duplicated(first_part)

unique_sequences <- modified_sequences[keep_indices]
unique_sequences

writeXStringSet(unique_sequences, file = "path/unique_sequences.fasta")

The unique sequences retained are:

AAStringSet object of length 2:
    width seq                                                  names               
[1]   140 MAEEPQSVLQLPTSIAAGGEGLTDV...GTLYECFGSDGKLVLHYCKSQAWG Homo_sapiens_NP_0...
[2]   140 MAEEPQSVLQLPPSSAAGGEGLTDV...GTLYECFGSDGKLVLHYCKSQAWG Pan_troglodytes_P...

EDIT:

If you want to keep one sequence per species and assuming the naming is consitent, i.e., always follows the pattern <Genus><underscore><species><underscore><...>, you can modify the sub call like so:

first_part <- sub("^([^_]+_[^_]+)_.*$", "\\1", names(modified_sequences))

and retain these sequences:

AAStringSet object of length 3:
    width seq                                                  names               
[1]   140 MAEEPQSVLQLPTSIAAGGEGLTDV...GTLYECFGSDGKLVLHYCKSQAWG Homo_sapiens_NP_0...
[2]   140 MAEEPQSVLQLPPSSAAGGEGLTDV...GTLYECFGSDGKLVLHYCKSQAWG Pan_troglodytes_P...
[3]   187 MTSREHQVSLCNCVPLLRRLLCDAP...GTLYECFGSDGKLVLHYCKSQAWG Pan_paniscus_XP_0...