Difficulty Extracting GenBank Accession Number Using Species and Strain Name, using webscraping (Using BeautifulSoup or Selenium)

Question

Difficulty Extracting GenBank Accession Number Using Species and Strain Name, using webscraping (Using BeautifulSoup or Selenium)

56 views Asked by Umar At 18 March 2024 at 07:45

I need to extract specific information from a webpage using BeautifulSoup and / or Selenium. I'm trying to extract information related to a particular organism from a webpage, but I'm encountering difficulties.

I tried this

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

from selenium.webdriver.common.by import By

# Find elements containing the text "JCM 5058"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'JCM 5058')]")

if elements:
  print("Text 'JCM 5058' found on the webpage.")
  # Loop through elements and extract text
  text_to_print = ""
  for element in elements:
    text_to_print += element.text + "\n"  # Add newline for readability
  # Print the extracted text
  print(text_to_print)

else:
  print("Text 'JCM 5058' not found on the webpage.")

and I got like this

Text 'JCM 5058' found on the webpage.

JCM 5058
("Streptomyces anthocyanicus"[Organism] AND ("Streptomyces anthocyanicus"[Organism] OR JCM 5058[All Fields])) AND (latest[filter] AND all[filter] NOT anomalous[filter])
Streptomyces anthocyanicus JCM 5058 AND (latest[filter] AND all[f... (6)

but Matched section look like this in web page

ASM1465115v1

Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]

I want to extract or print all this information as such or in a table.

Original Q&A

There are 1 answers

**Umar** · Answer 1 · 2024-03-18T10:24:51+00:00

I got the answer, while working arround, but dont know is it correct approach or not,

from selenium import webdriver
from bs4 import BeautifulSoup

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')

# Find all div elements containing assembly information
assembly_divs = soup.find_all("div", class_="rprt")

# Loop through each div and check if it contains the desired information
for div in assembly_divs:
    if "JCM 5058" in div.get_text():
        # Print the assembly information
        print(div.get_text().strip())
        break
else:
    print("No matched section found on the webpage.")

# Close the browser
driver.quit()

will print this

Select item 81211415.ASM1465115v1Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)Infraspecific name: Strain: JCM 5058Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)Date: 2020/09/12Assembly level: ScaffoldGenome representation: fullRelation to type material: assembly from type materialGenBank assembly accession: GCA_014651155.1 (latest) RefSeq assembly accession: GCF_014651155.1 (latest) IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]

another easy way is

from selenium import webdriver
from selenium.webdriver.common.by import By

# Open a Chrome browser
driver = webdriver.Chrome()

# Load the webpage
driver.get("https://www.ncbi.nlm.nih.gov/assembly/?term=Streptomyces+anthocyanicus+JCM+5058")

# Find the element containing the GenBank assembly accession using XPath
genbank_element = driver.find_element(By.XPATH, "//dl[contains(., 'JCM 5058')]/following-sibling::dl[6]")

# Extract the GenBank assembly accession text
genbank_accession = genbank_element.text.split(": ")[1]

# Print the GenBank assembly accession
print(genbank_accession)

# Close the browser
driver.quit()

print

GCA_014651155.1 (latest)

TechQA.

Difficulty Extracting GenBank Accession Number Using Species and Strain Name, using webscraping (Using BeautifulSoup or Selenium)

There are 1 answers

Related Questions in PYTHON

Related Questions in SELENIUM-WEBDRIVER

Related Questions in BEAUTIFULSOUP

Related Questions in BIOPYTHON

Popular Questions

Trending Questions