How to scrape the different content with the same html attributes and values?

388 views Asked by At

I'm able to scrape a bunch of data from a webpage, but I'm struggling with extracting the specific content from subsections that have the exact same attributes and values. Here is the html:

   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>

                                            <li class="">
                                                           ADHD
                                                   </li>
                                           <li class="">
                                                           Alcohol Use
                                                   </li>
                                           <li class="">
                                                           Anger Management
                                                   </li>

Using that html as a reference I have the following:

import requests
from bs4 import BeautifulSoup
import html5lib
import re

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, 'html5lib')

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})

for x in specialties:
   Specialty_1 = x.find('li', {'class': 'highlight'}).text
   Specialty_2 = x.find('li', {'class': 'highlight'}).text
   Specialty_3 = x.find('li', {'class': 'highlight'}).text

So the ideal outcome is to have: Specialty_1 = Relationship Issues; Specialty_2 = Depression; Specialty_3 = Spirituality

AND

Issue_1 = ADHD; Issue_2 = Alcohol Use; Issue_3 = Anger Management

Would appreciate any and all help!

3

There are 3 answers

0
QHarr On BEST ANSWER

You could develop Andrej's dictionary idea and use if else based on class being present to determine prefix and extend the select to include the additional section. You need to reset the numbering for the new section e.g. with a flag

results = {}
flag = False
counter = 1

for j in soup.select(".specialties-list li, .attributes-issues li"):
    if j['class']:
        results[f'Specialty_{counter}'] =  j.text.strip()
    else:   
        if not flag:
            counter = 1
            flag = True
        results[f'Issue_{counter}'] = j.text.strip()
    counter +=1 
        
print(results)
0
Dan Weber On

You can just use xpath if you know it will be in the same element structure in the tree. Most of the time you can right click an element in chrome devtools to get both a selector and an xpath string.

0
Andrej Kesely On

If you want variable number of variables, use a dictionary. For example:

from bs4 import BeautifulSoup


html_doc = '''   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

out = {'Specialty_{}'.format(i): specialty.get_text(strip=True) for i, specialty in enumerate(soup.select("li.highlight"), 1)}

print(out)

Prints:

{'Specialty_1': 'Relationship Issues', 
 'Specialty_2': 'Depression', 
 'Specialty_3': 'Spirituality'}