You know that when information is parsed from a particular site, depending on how the code is written, it switches to the next page after the information from one or another has already been taken. Typically, this happens when a value is set. If the value is 21, then 21 pages will be parsed. Here is the code that prints information from a site with anime and animated series.
import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote
import warnings
warnings.filterwarnings("ignore")
BASE_URL = 'https://hd8.4lordserials.xyz/anime-serialy'
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'
items = []
max_page = 21
for page in range(1, max_page + 1):
url = f'{BASE_URL}/page/{page}/' if page > 1 else BASE_URL
print(url)
rs = session.get(url, verify=False)
rs.raise_for_status()
soup = BeautifulSoup(rs.content, 'html.parser')
for item in soup.select('.th-item'):
title = item.select_one('.th-title').text
url = item.a['href']
items.append({
'title': title,
'url': url,
})
with open('out.json', 'w', encoding='utf-8') as f:
json.dump(items, f, indent=4, ensure_ascii=False)
There are 21 pages in total. But what if there are 22 pages? Or 23? I won't re-enter the value. How can I make page switching happen automatically? That is, so that the user does not set a value, but simply everything happens on its own, and so that the code displays as many pages as there are on the site.
The home page/BASE_URL itself contains information about the total number of pages, first, scrape the maximum page number and iterate over it to get the data from all the available pages.
Here's the implementation:
output:
The file
out.json:I hope it solves your problem.