So I'm trying to scrape this table https://en.wikipedia.org/wiki/Korean_drama#List_of_highest-rated_Korean_dramas_in_cable_television The network column is troubling me.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://en.wikipedia.org/wiki/Korean_drama")
bsObj = BeautifulSoup(html, features="lxml")
kdramas = bsObj.find("span", {
"id": "List_of_highest-rated_Korean_dramas_in_cable_television"})
list_kdramas = kdramas.parent.next_sibling.next_sibling.next_sibling.next_sibling
table = list_kdramas.find_all('tr')
final = []
for i in range(1, len(table)):
temp = [] # temporary array for storing the subvalues of each row
row = table[i].find_all('td')
for k in range(len(row)-1):
try:
temp.append(row[k].get_text())
except AttributeError:
temp.append(row[k].find('a').get_text())
final.append(temp)
for i in final:
if len(i) == 5:
print("Rank:{}, Show: {}, Channel: {}, Rating: {}, Date:{} ".format(
i[0], i[1], i[2], i[3], i[4]))
else:
print("Rank:{}, Show: {}, Rating: {}, Date: {}".format(
i[0], i[1], i[2], i[3]))
One of the columns named network isn't showing up in my output for some of the tv shows which is why I have to check each the length of i in my finals array to make sure the format doesn't get messed up.
This is the output (first 5 were only shown), and you can see some of them don't have any channels
Rank:1 Show: The World of the Married Channel: JTBC, Rating: 28.371% Date:16 May 2020
Rank:2 Show: SKY Castle Rating: 23.779% Date: 1 February 2019
Rank:3 Show: Crash Landing on You Channel: tvN, Rating: 21.683% Date:16 February 2020
Rank:4 Show: Reply 1988 Rating: 18.803% Date: 16 January 2016
Rank:5 Show: Guardian: The Lonely and Great God Rating: 18.680% Date: 21 January 2017
This script will expand
<td rowspan="..">across multiple rows, so you can get the correct information:Prints: