How to scrape the different content with the same html attributes and values?

Question

How to scrape the different content with the same html attributes and values?

388 views Asked by Tom At 23 October 2020 at 05:21

I'm able to scrape a bunch of data from a webpage, but I'm struggling with extracting the specific content from subsections that have the exact same attributes and values. Here is the html:

   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>

                                            <li class="">
                                                           ADHD
                                                   </li>
                                           <li class="">
                                                           Alcohol Use
                                                   </li>
                                           <li class="">
                                                           Anger Management
                                                   </li>

Using that html as a reference I have the following:

import requests
from bs4 import BeautifulSoup
import html5lib
import re

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, 'html5lib')

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})

for x in specialties:
   Specialty_1 = x.find('li', {'class': 'highlight'}).text
   Specialty_2 = x.find('li', {'class': 'highlight'}).text
   Specialty_3 = x.find('li', {'class': 'highlight'}).text

So the ideal outcome is to have: Specialty_1 = Relationship Issues; Specialty_2 = Depression; Specialty_3 = Spirituality

AND

Issue_1 = ADHD; Issue_2 = Alcohol Use; Issue_3 = Anger Management

Would appreciate any and all help!

Original Q&A

There are 3 answers

Dan Weber On 23 October 2020 at 05:23

You can just use xpath if you know it will be in the same element structure in the tree. Most of the time you can right click an element in chrome devtools to get both a selector and an xpath string.

Andrej Kesely On 23 October 2020 at 07:27

If you want variable number of variables, use a dictionary. For example:

from bs4 import BeautifulSoup


html_doc = '''   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

out = {'Specialty_{}'.format(i): specialty.get_text(strip=True) for i, specialty in enumerate(soup.select("li.highlight"), 1)}

print(out)

Prints:

{'Specialty_1': 'Relationship Issues', 
 'Specialty_2': 'Depression', 
 'Specialty_3': 'Spirituality'}

**QHarr** · Accepted Answer · 2020-10-23T06:37:34+00:00

You could develop Andrej's dictionary idea and use if else based on class being present to determine prefix and extend the select to include the additional section. You need to reset the numbering for the new section e.g. with a flag

results = {}
flag = False
counter = 1

for j in soup.select(".specialties-list li, .attributes-issues li"):
    if j['class']:
        results[f'Specialty_{counter}'] =  j.text.strip()
    else:   
        if not flag:
            counter = 1
            flag = True
        results[f'Issue_{counter}'] = j.text.strip()
    counter +=1 
        
print(results)

TechQA.

How to scrape the different content with the same html attributes and values?

There are 3 answers

Related Questions in PYTHON

Related Questions in HTML

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in HTML5LIB

Popular Questions

Popular Tags

Trending Questions