How to use lxml for web scraping?

528 views Asked by At

I want to write a python script that fetches my current reputation on stack overflow --https://stackoverflow.com/users/14483205/raunanza?tab=profile

This is the code I have written.

from lxml import html 
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content) 

Now, what to do to fetch my reputation. (I can't understand how to use xpath even
after googling it.)

3

There are 3 answers

0
Ananth On BEST ANSWER

You need to make some modifications in your code to get the xpath. Below is the code:

from lxml import HTML 
import requests

page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
tree = html.fromstring(page.content) 
title = tree.xpath('//*[@id="avatar-card"]/div[2]/div/div[1]/text()')
print(title) #prints 3

You can easily get the xpath of element in chrome console(inspect option). enter image description here

To learn more about xpath you can refer: https://www.w3schools.com/xml/xpath_examples.asp

0
Tasnuva Leeya On

Simple solution using lxml and beautifulsoup:

from lxml import html
from bs4 import BeautifulSoup
import requests
page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile').text
tree = BeautifulSoup(page, 'lxml')
name = tree.find("div", {'class': 'grid--cell fw-bold'}).text
title = tree.find("div", {'class': 'grid--cell fs-title fc-dark'}).text
print("Stackoverflow reputation of {}is: {}".format(name, title))
# output: Stackoverflow reputation of Raunanza is: 3
0
mulaixi On

If you don't mind using BeautifulSoup, you can directly extract the text from the tag which contains your reputation. Of course you need to check page structure first.

from bs4 import BeautifulSoup
import requests

page = requests.get('https://stackoverflow.com/users/14483205/raunanza?tab=profile')
soup = BeautifulSoup(page.content, features= 'lxml')

for tag in soup.find_all('strong', {'class': 'ml6 fc-medium'}):
    print(tag.text)
#this will output as 3