I want to scrape product info from this website: http://megabuy.vn/Default.aspx.
My solution is to scrape the website according to the website structure. So at first, I have to scrape all links about the general category before going deeper to subcategory and then to each particular product.
I have trouble scraping all links general categories like:
- thiet bi van phong
- may hut am
- do da dung nha bep
etc...
I think the problem is that these links are under java script tag.
Here is my code:
from bs4 import BeautifulSoup
import requests
import re
def web_scrape(url):
web_connect = requests.get(url)
text = web_connect.text
soup = BeautifulSoup(text,"html.parser")
return soup
homepage = web_scrape("http://megabuy.vn/Default.aspx")
listgianhang = homepage.findAll("a", class_=re.compile("ContentPlaceholder"))
len(listgianhang)
I got the result: 0
out:
The reason why you cannot get the a tag by class is the tag's class is generated by JavaScript, the raw html code is like this:
The real code do not contain the class attribute.