I'm trying to extract the text content from the HTML code below as a complete sentence however I am not able to. I tried using both Beautifulsoup.prettify() and Beautifulsoup.get_text() but those gave me 3 sentences. I would like to read the HTML below as a single proper sentence like

Recognized by Microsoft & Google, Inc., offices.

<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>

2 Answers

0
ghr On

You can use an HTML parser like BeautifulSoup to extract the text without tags (soup.text), then strip the text of duplicate whitespaces/newlines etc:

input_str = '''
<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")
text = " ".join(soup.text.split())
print(text)

Output:

Recognized by Microsoft & Google, Inc., offices.

Edit: based on your comments, in order to get a list of strings as an output (one for each li tag, you can do:

input_str = '''<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")

result = []
for li in soup.find_all('li'):
    text = " ".join(li.text.split())
    result.append(text)

print(result)

Output:

['This is sentence one in a order', 'This is sentence two in a order', 'This is sentence Three in a order', 'This is sentence four in a order']
0
Community On

I really don't understand what you need , but It'll help you to extract Content from the url of the website

import requests
import xlsxwriter 
from bs4 import BeautifulSoup

#Text File where the content will be written
file = open("test.txt","w")

#Url from where the data will be extracted
urls ="https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python"
page = requests.get(urls)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('p'): #extracting all content of <P> tag from the url
    #You can put the desired tag according to your need
 file.write(link.get_text())  
file.close()