I'm scraping text from a webpage using lxml and requests. All of the text that I want is under <p> tags. When I use contents = tree.xpath('//*[@id="storytext"]/p/text()'), contents only includes text that is not in <em> or <strong> tags. But when I use contents = tree.xpath('//*[@id="storytext"]/p/text() | //*[@id="storytext"]/p/strong/text() | //*[@id="storytext"]/p/em/text()'), the text in <em> and <strong> tabs is separated from the rest of the text in that <p> tag.

I would like to:

  1. scrape each <p> as a unit, including all its text (whether plain or <em> or <strong>), and

  2. keep the <em> and <strong> tags so that I can use them later to format the text I've scraped.

Sample html: <div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>

Desired output: "Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.

1 Answers

0
QHarr On

If only those between you could use bs4 and replace to remove the p open and close tags

from bs4 import BeautifulSoup as bs

html = '''
<div id="storytext"><p>"Go <em>away!</em>" His voice was drowned out by the mixer. She didn't even <em>hear</em> him. He could scrub it all day, probably, and Esti would just say <em>can't you do anything</em>? He scowled fiercely at the dirt.</p></div>
'''

soup = bs(html,'lxml')

for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))

Using requests to source html

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
for item in soup.select('p'):
    print(str(item).replace('<p>','').replace('</p>',''))