I am trying to minimize my codes, making it more efficient. However, I got hit by this KeyError truck, which I can't figure out what went wrong. Please Help me out Chiefs, and point me why my expression is not OK? PS I am amateur level.

With these codes:

recommended = soup.select('table:has(font:contains("推荐主题")), '
                          'table:has(font:contains("版块主题"))')
for item in recommended:
    for i in item.select(".folder:has(a)"):

I will have DOM of:

<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>

But when I add one more line,

for item in recommended:
    for i in item.select(".folder:has(a)"):
        url_tail = i['href']

I will get this KeyError of:

    return self.attrs[key]
KeyError: 'href'

What I am trying to get out of it are the href links, Thank you all.

3 Answers

2
QHarr On Best Solutions

@facelessuser has explained nicely the error (+) and given my first choice selector. It looks like there may be two other attribute = value selector possibilities as plan Bs

Either:

[href^="thread-"]

Or:

[title="新窗口打开"]

Which can be used in a list comprehension such as

links =  [item['href'] for item in soup.select('[href^='thread-']')]

Your select may be off item rather than soup. You can always throw in the parent class if that ends up too broad a match .folder [title="新窗口打开"]

2
facelessuser On

.folder:has(a) is selecting the td element, as that element is the one with the class .folder and has a child of a. It is not selecting the a element, just checking that the element with .folder has a a element.

Something like .folder a is probably what you want.

0
hygull On

You can try like this.

As I don't have complete HTML or Url you are hitting, I just tried to retrieve the href's values from the HTML text you have pasted.

1) Import & create BeautifulSoup object »

>>> from bs4 import BeautifulSoup
>>> 
>>> html_text = """<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
... <td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>"""
>>> 
>>> soup = BeautifulSoup(html_text, "html.parser")
>>>
>>> soup
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
<td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
>>> 

2) Find all tds »

>>> tds = soup.find_all("td", class_="folder")
>>> tds
[<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439293-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439292-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>, <td class="folder"><a href="thread-10439290-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>]
>>> 

3) Inspect (Just to test)»

>>> tds[0]
<td class="folder"><a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a></td>
>>> 
>>> tds[0].a
<a href="thread-10439294-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_new.gif"/></a>
>>> 
>>> tds[0].a.get("href")
'thread-10439294-1-1.html'
>>> 

4) Finally, retrieve links (2 ways) »

>>> # Using loop
... 
>>> for td in tds:
...     print(td.a.get("href"))
... 
thread-10439294-1-1.html
thread-10439293-1-1.html
thread-10439292-1-1.html
thread-10439290-1-1.html
>>> 
>>> for td in tds:
...     print(td.a["href"])
... 
thread-10439294-1-1.html
thread-10439293-1-1.html
thread-10439292-1-1.html
thread-10439290-1-1.html
>>> 
>>>