So I'm trying to get all the picture files names for a wikimedia image search, but I'm only getting 10 results.
As an example, I tried running:
import json
from io import StringIO
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
import urllib.request
import requests
import time
import shutil
from bs4 import BeautifulSoup
from newspaper import Article
import sys
import html2text
import xmltodict
from xml.etree import ElementTree
import urllib
headers = {'Accept': 'application/json', 'Content-Type': 'application/json', }
plants_df = pd.DataFrame()
pic_searches = ['blue+marble']
df_all = pd.DataFrame()
for pic_search in pic_searches:
url = str(r'https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo|categories&+\ generator=search&gsrsearch=File:') + str(pic_search) + str('&format=jsonfm&origin=*& + \ iiprop=extmetadata&iiextmetadatafilter=ImageDescription|ObjectName') + \
response = urllib.request.urlopen(url).read()
soup = BeautifulSoup(response, 'html.parser')
spans = soup.find_all('span', {'class': 's2'})
lines = [span.get_text() for span in spans]
new_list = [item.replace('"', '') for item in lines]
new_list2 = [x for x in new_list if x.startswith('File')]
new_list3 = [x[5:] for x in new_list2]
new_list4 = [x.replace(' ','_') for x in new_list3]
print(new_list4)
I got the result ['Blue_Marble_2021.png', 'Blue_Marble_2022.jpg', 'Blue_Marble_Comparsion.png', 'Blue_Marble_Eastern_Hemisphere.jpg', 'Blue_Marble_Western_Hemisphere.jpg', 'Blue_Marble_transparent.png', 'The_Blue_Marble.jpg', 'The_Blue_Marble_(5052124705).jpg', 'The_Blue_Marble_White_Balancing.jpg', 'The_Earth_seen_from_Apollo_17.jpg']. But this is only 10 file names. When I type blue marble into wikimedia commons image search, hundreds of results come up. How can I get all the image file names?
MediaWiki API queries are paginated. This means that each API call will return a maximum number of results, and you will need to include additional parameters in subsequent requests in order to retrieve the remaining results.
The official documentation has an example that demonstrates how to submit the continuation requests.
Since you are already importing
requests
, I would suggest using that library instead ofurllib.request.urlopen
for this. You definitely should not be using BeautifulSoup to parse these responses - you can specifyformat=json
and use json instead.It will be easier to handle the continuation requests if you use a dictionary for the query params instead of manually crafting a string.
Example using Requests: