How to scrape a tag under javascript tag?

40 views Asked by At

I want to scrape product info from this website: http://megabuy.vn/Default.aspx.

My solution is to scrape the website according to the website structure. So at first, I have to scrape all links about the general category before going deeper to subcategory and then to each particular product.

I have trouble scraping all links general categories like:

  • thiet bi van phong
  • may hut am
  • do da dung nha bep

etc...

I think the problem is that these links are under java script tag.

Here is my code:

from bs4 import BeautifulSoup
import requests
import re
def web_scrape(url):
    web_connect = requests.get(url)
    text = web_connect.text
    soup = BeautifulSoup(text,"html.parser")
    return soup
homepage = web_scrape("http://megabuy.vn/Default.aspx")
listgianhang = homepage.findAll("a", class_=re.compile("ContentPlaceholder"))
len(listgianhang)

I got the result: 0

1

There are 1 answers

0
宏杰李 On
import requests, bs4, re

r = requests.get('http://megabuy.vn/Default.aspx')

soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id='ctl00_ContentPlaceHolder1_TopMenu1_dlMenu')
for a in table('a',href=re.compile(r'^http')):
    link = a.get('href')
    text = a.text
    print(link, text)

out:

http://megabuy.vn/gian-hang/thiet-bi-van-phong THIẾT BỊ VĂN PHÒNG
http://megabuy.vn/gian-hang/may-fax  Máy Fax
http://megabuy.vn/gian-hang/may-fax/hsx/Panasonic Panasonic
http://megabuy.vn/gian-hang/may-chieu-man-chieu-phu-kien  Máy chiếu Màn chiếu Phụ kiện
http://megabuy.vn/gian-hang/may-chieu-projector  Máy chiếu projector
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/Optoma Optoma
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/Sony Sony
http://megabuy.vn/gian-hang/may-chieu-projector/hsx/ViewSonic ViewSonic
http://megabuy.vn/gian-hang/may-chieu-man-chieu-phu-kien  Xem thêm
http://megabuy.vn/gian-hang/may-photocopy  Máy photocopy
http://megabuy.vn/gian-hang/may-photocopy-  Máy photocopy 
http://megabuy.vn/gian-hang/may-photocopy-/hsx/Canon Canon
http://megabuy.vn/gian-hang/may-photocopy-/hsx/Ricoh Ricoh

The reason why you cannot get the a tag by class is the tag's class is generated by JavaScript, the raw html code is like this:

             <a href="http://megabuy.vn/gian-hang/thiet-bi-van-phong" style="text-decoration:none;">
              <h2>
               THIẾT BỊ VĂN PHÒNG
              </h2>

The real code do not contain the class attribute.