Suggestions on Analyzing Protein Sequences Similarity

154 views Asked by At

I want to write code to analyze short protein sequences and determine their similarity. I have no reference sequence but rather I want to write some sort of for loop to compare them all to each other to see how many duplicate sequences I have, as well as regions where they are similar.

I currently have all of their sequences in a csv.

I have taken a bioinformatics course and have done something similar with Illumina sequencing data but I started from an SRA table and had fasta files.

Also, I am trying to use CD hit but but I am running into problems with the makefile and the compatibility of my compiler. I installed homebrew to get around the issue but I am still running into the problem and the make CXX=g++-9 CC=gcc-9 comand won't work.

I was wondering if there was more update to the method than CD-Hit because I have noticed that no one has really used CD Hit since 2020.

Also the only coding languages I know are R and Shell but I am currently learning Python.

1

There are 1 answers

0
player777 On

https://bioinfo.lifl.fr/yass/index.php I have used it for SARS-CoV-2, found similarity to many viruses