I am trying to capture 3 list (li) items inside a specific unordered list.Using the findALL function I am able to get what I want. However, although the returned list contains the 3 li's, everything within the returned findALL list is perceived as 1 element.

I have tried to use findChild function and it sees 7 elements. what I am exactly trying to do is to retrieve the links so I can retrieve their contents and also the texts contained in the ordered list I have using the findALL or findChild or anything else

This is originally what I have done:

 focus=soup.findAll('ul',{'class':'sub-menu'})
 #output

 #[<ul class="sub-menu">
 #<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 #item-20588" id="menu-item-20588"><a href="http://www.air- 
 #shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 
 #2019</a></li>
 #<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 #item-22412" id="menu-item-22412"><a href="http://www.air- 
 #shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow 
 #Calendar 2019</a></li>
 #<li class="menu-item menu-item-type-taxonomy menu-item-object-category 
 #menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
 #shows.org.uk/category/display-team-schedule/">Latest Display Team 
 #Dates</a></li>
 #</ul>]

The length of the list is 1. However, using findChild I have the following:

for i in soup.findChild('ul',{'class':'sub-menu'}):
      print (i)
      print('==='*10)

#output

==============================
#<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
#item-20588" id="menu-item-20588"><a href="http://www.air- 
#shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 
#2019</a></li>
==============================

==============================
#<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
#item-22412" id="menu-item-22412"><a href="http://www.air- 
#shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow 
#Calendar 2019</a></li>
==============================

==============================
#<li class="menu-item menu-item-type-taxonomy menu-item-object-category 
#menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
#shows.org.uk/category/display-team-schedule/">Latest Display Team 
#Dates</a></li>
==============================

All I want is to be able to get the urls in the href and the texts within these 3 ordered lists.

I am looking to have something like this:

www.air-shows.org.uk/2018/07/european-airshow-calendar-2019
UK Airshow Calendar 2019

www.air-shows.org.uk/2018/07/european-airshow-calendar-2019
European Airshow Calendar 2019

2 Answers

1
QHarr On Best Solutions

You could also use the following (I am assuming in the actual page you don't have \n in the text or hrefs. This also assumes lists of equal lengths generated from .sub-menu li,.sub-menu a)

from bs4 import BeautifulSoup as bs

html = '''
<html>
 <head></head>
 <body>
  <ul class="sub-menu"> 
   <li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 item-20588" id="menu-item-20588"><a href="http://www.air- 
 shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 2019</a></li> 
   <li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 item-22412" id="menu-item-22412"><a href="http://www.air- 
 shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow Calendar 2019</a></li> 
   <li class="menu-item menu-item-type-taxonomy menu-item-object-category 
 menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
 shows.org.uk/category/display-team-schedule/">Latest Display Team Dates</a></li> 
  </ul>
 </body>
</html>
 '''

soup = bs(html, 'lxml')

all_items = soup.select('.sub-menu li,.sub-menu a')
events = [item.text for item in all_items[0::2]]
links = [item['href'] for item in all_items[1::2]]
print(events, links)
2
Kajal Kundu On

Here you go.

from bs4 import BeautifulSoup
html='''
<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
item-20588" id="menu-item-20588"><a href="http://www.air- 
shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 2019</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
item-22412" id="menu-item-22412"><a href="http://www.air- 
shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow Calendar 2019</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category 
menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
shows.org.uk/category/display-team-schedule/">Latest Display Team Dates</a></li>'''

soup=BeautifulSoup(html,"html.parser")
for item in soup.find_all('a',href=True):
    print("link : " + item['href'])
    print("text : " + item.text)