I'm working on a script to scrape podcast feed URLs from the iTunes API using the search and lookup endpoints. However, I've noticed that for some podcasts, the API does not provide the feed URL. This limitation prevents me from obtaining the necessary information for those podcasts.
What I've Tried:
In my code, I make a request to the iTunes API's lookup endpoint with the podcast ID and retrieve the feed URL for most podcasts. However, for some podcasts, the feedUrl field is missing from the API response. To address this, I want to explore alternative methods to obtain the missing feed URLs.
And I almost forgot, here is my code :
import re
import requests
import json
import sqlite3
import time
def getrss(url):
feed_url = ''
genres = ''
match = re.search(r'id(\d+)', url)
if match:
podID = match.group(1)
else:
match = re.search(r'\d+', url)
if match:
podID = match.group()
else:
print("Aucun identifiant de podcast trouvé")
return
params = {
'id': int(podID),
'entity': 'podcast'
}
response = requests.get('https://itunes.apple.com/lookup', params=params)
data = response.json()
results = data.get('results', [])
if results:
for result in results:
if 'feedUrl' in result and 'genres' in result:
feed_url = result['feedUrl']
genres = result.get('genres', [])
genres = ', '.join(genres)
break
rss_feed = feed_url
return rss_feed, genres
# Connexion à la base de données SQLite
conn = sqlite3.connect("podcasts.db")
cursor = conn.cursor()
# Création d'une table pour stocker les données des podcasts
cursor.execute("CREATE TABLE IF NOT EXISTS podcasts (name TEXT, genres TEXT, rss_feed TEXT, UNIQUE(name, genres))")
url = "https://itunes.apple.com/fr/rss/toppodcasts/limit=200/json"
response = requests.get(url)
data = response.json()
if "feed" in data and "entry" in data["feed"]:
podcasts = data["feed"]["entry"]
for podcast in podcasts:
name = podcast.get("im:name", {}).get("label")
href = podcast.get("id", {}).get("label")
genres = ""
if name and href:
rss_feed, genres = getrss(href)
if rss_feed:
try:
# Insertion des données du podcast dans la base de données, en ignorant les doublons
cursor.execute("INSERT OR IGNORE INTO podcasts (name, genres, rss_feed) VALUES (?, ?, ?)", (name, genres, rss_feed))
if cursor.rowcount > 0:
time.sleep(0.1)
conn.commit()
except sqlite3.IntegrityError:
print("Ignorer l'entrée en double :", name, "-", genres)
else:
print("Ignorer l'entrée en raison d'un flux rss caché :", name, "-", genres)
else:
print("Ignorer l'entrée en raison de champs manquants :", podcast)
print("Podcasts enregistrés dans la base de données.")
else:
print("Aucun podcast trouvé.")
# Fermeture de la connexion à la base de données
conn.close()
Expectations:
I expected the iTunes API to consistently provide the feed URL for all podcasts. However, some podcasts do not have this information available through the API. Consequently, I need to find a solution to retrieve the missing feed URLs using alternative approaches.
Actual Results:
For podcasts where the feed URL is missing from the iTunes API response, I currently have no way to obtain the necessary information. This limitation hinders my progress in scraping podcast feed URLs effectively.
Given the circumstances, I came across getrssfeed.com, a website that manages to find the feed URL even when the iTunes API doesn't provide it. I'm looking for suggestions, insights, or alternative methods to overcome this issue and retrieve the missing feed URLs reliably. Any help or guidance would be greatly appreciated.
Apple Podcasts allows show providers to hide their RSS feed using Podcasts Connect. The documentation for this is under the Distribution heading on this support page: https://podcasters.apple.com/support/900-availability-rights-and-release-date