Why is BeautifulSoup sometimes finding all elements with find_all and sometimes not?

208 views Asked by At

So I'm trying to scrape this table https://en.wikipedia.org/wiki/Korean_drama#List_of_highest-rated_Korean_dramas_in_cable_television The network column is troubling me.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://en.wikipedia.org/wiki/Korean_drama")
bsObj = BeautifulSoup(html, features="lxml")
kdramas = bsObj.find("span", {
    "id": "List_of_highest-rated_Korean_dramas_in_cable_television"})
list_kdramas = kdramas.parent.next_sibling.next_sibling.next_sibling.next_sibling
table = list_kdramas.find_all('tr')
final = []

for i in range(1, len(table)):
    temp = []  # temporary array for storing the subvalues of each row
    row = table[i].find_all('td')
    for k in range(len(row)-1):
        try:
            temp.append(row[k].get_text())
        except AttributeError:
            temp.append(row[k].find('a').get_text())

    final.append(temp)
for i in final:
    if len(i) == 5:
        print("Rank:{}, Show: {}, Channel: {}, Rating: {}, Date:{} ".format(
            i[0], i[1], i[2], i[3], i[4]))
    else:
        print("Rank:{}, Show: {}, Rating: {}, Date: {}".format(
            i[0], i[1], i[2], i[3]))

One of the columns named network isn't showing up in my output for some of the tv shows which is why I have to check each the length of i in my finals array to make sure the format doesn't get messed up.

This is the output (first 5 were only shown), and you can see some of them don't have any channels

Rank:1 Show: The World of the Married Channel: JTBC, Rating: 28.371% Date:16 May 2020
 
Rank:2 Show: SKY Castle Rating: 23.779% Date: 1 February 2019

Rank:3 Show: Crash Landing on You Channel: tvN, Rating: 21.683% Date:16 February 2020
 
Rank:4 Show: Reply 1988 Rating: 18.803% Date: 16 January 2016

Rank:5 Show: Guardian: The Lonely and Great God Rating: 18.680% Date: 21 January 2017

2

There are 2 answers

0
Andrej Kesely On BEST ANSWER

This script will expand <td rowspan=".."> across multiple rows, so you can get the correct information:

import requests
from bs4 import BeautifulSoup


url = 'https://en.wikipedia.org/wiki/Korean_drama'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('#List_of_highest-rated_Korean_dramas_in_cable_television').find_next('table')


def expand_rowspans(table):
    while table.select_one('td[rowspan]'):
        td = table.select_one('td[rowspan]')
        n = td.find_parent('tr').find_all('td', recursive=False).index(td)
        rs = int(td.attrs.pop('rowspan'))
        for tr in td.find_parent('tr').find_next_siblings('tr')[:rs-1]:
            tr.select_one('td:nth-child({})'.format(n)).insert_after(BeautifulSoup(str(td), 'html.parser'))


expand_rowspans(table)

for row in table.select('tr:has(td)'):
    tds = [td.get_text(strip=True) for td in row.select('td')]
    print("Rank:{:<3} Show: {:<40} Channel: {:<10} Rating: {:<10} Date: {:<10}".format(*tds))

Prints:

Rank:1   Show: The World of the Married                 Channel: JTBC       Rating: 28.371%    Date: 16 May 2020
Rank:2   Show: SKY Castle                               Channel: JTBC       Rating: 23.779%    Date: 1 February 2019
Rank:3   Show: Crash Landing on You                     Channel: tvN        Rating: 21.683%    Date: 16 February 2020
Rank:4   Show: Reply 1988                               Channel: tvN        Rating: 18.803%    Date: 16 January 2016
Rank:5   Show: Guardian: The Lonely and Great God       Channel: tvN        Rating: 18.680%    Date: 21 January 2017
Rank:6   Show: Mr. Sunshine                             Channel: tvN        Rating: 18.129%    Date: 30 September 2018
Rank:7   Show: Itaewon Class                            Channel: JTBC       Rating: 16.548%    Date: 21 March 2020
Rank:8   Show: 100 Days My Prince                       Channel: tvN        Rating: 14.412%    Date: 30 October 2018
Rank:9   Show: Hospital Playlist                        Channel: tvN        Rating: 14.142%    Date: 28 May 2020
Rank:10  Show: Signal                                   Channel: tvN        Rating: 12.544%    Date: 12 March 2016
Rank:11  Show: The Lady in Dignity                      Channel: JTBC       Rating: 12.065%    Date: 19 August 2017
Rank:12  Show: Hotel del Luna                           Channel: tvN        Rating: 12.001%    Date: 1 September 2019
Rank:13  Show: Reply 1994                               Channel: tvN        Rating: 11.509%    Date: 28 December 2013
Rank:14  Show: Prison Playbook                          Channel: tvN        Rating: 11.195%    Date: 18 January 2018
Rank:15  Show: The Crowned Clown                        Channel: tvN        Rating: 10.851%    Date: 4 March 2019
Rank:16  Show: My Kids Give Me a Headache               Channel: JTBC       Rating: 10.715%    Date: 17 March 2013
Rank:17  Show: Encounter                                Channel: tvN        Rating: 10.329%    Date: 24 January 2019
Rank:18  Show: Memories of the Alhambra                 Channel: tvN        Rating: 10.025%    Date: 20 January 2019
Rank:19  Show: Another Miss Oh                          Channel: tvN        Rating: 9.991%     Date: 28 June 2016
Rank:20  Show: The Light in Your Eyes                   Channel: JTBC       Rating: 9.731%     Date: 19 March 2019
Rank:21  Show: Strong Girl Bong-soon                    Channel: JTBC       Rating: 9.668%     Date: 15 April 2017
Rank:22  Show: Lawless Lawyer                           Channel: tvN        Rating: 8.937%     Date: 1 July 2018
Rank:23  Show: What's Wrong with Secretary Kim          Channel: tvN        Rating: 8.665%     Date: 26 July 2018
Rank:24  Show: Graceful Family                          Channel: MBN        Rating: 8.478%     Date: 17 October 2019
Rank:25  Show: Misty                                    Channel: JTBC       Rating: 8.452%     Date: 24 March 2018
Rank:26  Show: Misaeng: Incomplete Life                 Channel: tvN        Rating: 8.240%     Date: 20 December 2014
Rank:27  Show: Familiar Wife                            Channel: tvN        Rating: 8.210%     Date: 20 September 2018
Rank:28  Show: Dear My Friends                          Channel: tvN        Rating: 8.087%     Date: 2 July 2016
Rank:29  Show: Live                                     Channel: tvN        Rating: 7.730%     Date: 6 May 2018
Rank:30  Show: Arthdal Chronicles                       Channel: tvN        Rating: 7.705%     Date: 22 September 2019
Rank:31  Show: Stranger 2                               Channel: tvN        Rating: 7.627%     Date: (currently airing)
Rank:32  Show: The Good Detective                       Channel: JTBC       Rating: 7.609%     Date: 25 August 2020
Rank:33  Show: My Mister                                Channel: tvN        Rating: 7.352%     Date: 17 May 2018
Rank:34  Show: It's Okay to Not Be Okay                 Channel: tvN        Rating: 7.348%     Date: 9 August 2020
Rank:35  Show: Oh My Ghost                              Channel: tvN        Rating: 7.337%     Date: 22 August 2015
Rank:36  Show: Something in the Rain                    Channel: JTBC       Rating: 7.281%     Date: 19 May 2018
Rank:37  Show: Second 20s                               Channel: tvN        Rating: 7.233%     Date: 17 October 2015
Rank:38  Show: Cheese in the Trap                       Channel: tvN        Rating: 7.102%     Date: 1 March 2016
Rank:39  Show: Voice 2                                  Channel: OCN        Rating: 7.086%     Date: 16 September 2018
Rank:40  Show: A Korean Odyssey                         Channel: tvN        Rating: 6.942%     Date: 4 March 2018
Rank:41  Show: Live Up to Your Name                     Channel: tvN        Rating: 6.907%     Date: 1 October 2017
Rank:42  Show: The Cursed                               Channel: tvN        Rating: 6.721%     Date: 17 March 2020
Rank:43  Show: Romance Is a Bonus Book                  Channel: tvN        Rating: 6.651%     Date: 17 March 2019
Rank:44  Show: The K2                                   Channel: tvN        Rating: 6.636%     Date: 12 November 2016
Rank:45  Show: Watcher                                  Channel: OCN        Rating: 6.585%     Date: 25 August 2019
Rank:46  Show: Stranger                                 Channel: tvN        Rating: 6.568%     Date: 30 July 2017
Rank:47  Show: Hi Bye, Mama!                            Channel: tvN        Rating: 6.519%     Date: 19 April 2020
Rank:48  Show: Tunnel                                   Channel: OCN        Rating: 6.490%     Date: 21 May 2017
Rank:49  Show: Queen: Love and War                      Channel: TV Chosun  Rating: 6.348%     Date: 9 February 2020
Rank:50  Show: Avengers Social Club                     Channel: tvN        Rating: 6.330%     Date: 16 November 2017
0
Alexandra Dudkina On

That happens because of the structure of the table:

tr, td {
  border: 1px solid darkgrey;
}
<table>
  <tr>
    <td>column 1, row 1</td>
    <td rowspan="2">column 2, row 1</td>
  </tr>
  <tr>
    <td>column 1, row 2</td>
  </tr>
  <tr>
    <td>column 1, row 3</td>
    <td>column 2, row 3</td>
  </tr>
  <tr>
    <td>column 1, row 4</td>
    <td>column 2, row 4</td>
  </tr>
</table>

In column "Network" some cells expand to several rows because of the attribute "rowspan" of the element "td". This attribute defines how many rows should td element cover. But in subsequent rows appropriate td element is missing (that's why channel is missing in your results as well).

To get rowspan value you could use code

rowspan = int(row[k].get('rowspan'))