Difficulty Extracting GenBank Accession Number Using Species and Strain Name, using webscraping (Using BeautifulSoup or Selenium)

56 views Asked by At

I need to extract specific information from a webpage using BeautifulSoup and / or Selenium. I'm trying to extract information related to a particular organism from a webpage, but I'm encountering difficulties.

I tried this

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

from selenium.webdriver.common.by import By

# Find elements containing the text "JCM 5058"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'JCM 5058')]")

if elements:
  print("Text 'JCM 5058' found on the webpage.")
  # Loop through elements and extract text
  text_to_print = ""
  for element in elements:
    text_to_print += element.text + "\n"  # Add newline for readability
  # Print the extracted text
  print(text_to_print)

else:
  print("Text 'JCM 5058' not found on the webpage.")

and I got like this

Text 'JCM 5058' found on the webpage.

JCM 5058
("Streptomyces anthocyanicus"[Organism] AND ("Streptomyces anthocyanicus"[Organism] OR JCM 5058[All Fields])) AND (latest[filter] AND all[filter] NOT anomalous[filter])
Streptomyces anthocyanicus JCM 5058 AND (latest[filter] AND all[f... (6)

but Matched section look like this in web page

ASM1465115v1

Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]

I want to extract or print all this information as such or in a table.

1

There are 1 answers

0
Umar On

I got the answer, while working arround, but dont know is it correct approach or not,

from selenium import webdriver
from bs4 import BeautifulSoup

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')

# Find all div elements containing assembly information
assembly_divs = soup.find_all("div", class_="rprt")

# Loop through each div and check if it contains the desired information
for div in assembly_divs:
    if "JCM 5058" in div.get_text():
        # Print the assembly information
        print(div.get_text().strip())
        break
else:
    print("No matched section found on the webpage.")

# Close the browser
driver.quit()

will print this

Select item 81211415.ASM1465115v1Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)Infraspecific name: Strain: JCM 5058Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)Date: 2020/09/12Assembly level: ScaffoldGenome representation: fullRelation to type material: assembly from type materialGenBank assembly accession: GCA_014651155.1 (latest) RefSeq assembly accession: GCF_014651155.1 (latest) IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]

another easy way is

from selenium import webdriver
from selenium.webdriver.common.by import By

# Open a Chrome browser
driver = webdriver.Chrome()

# Load the webpage
driver.get("https://www.ncbi.nlm.nih.gov/assembly/?term=Streptomyces+anthocyanicus+JCM+5058")

# Find the element containing the GenBank assembly accession using XPath
genbank_element = driver.find_element(By.XPATH, "//dl[contains(., 'JCM 5058')]/following-sibling::dl[6]")

# Extract the GenBank assembly accession text
genbank_accession = genbank_element.text.split(": ")[1]

# Print the GenBank assembly accession
print(genbank_accession)

# Close the browser
driver.quit()

print

GCA_014651155.1 (latest)