Cleaning text with beautiful soup

712 views Asked by At

Okay so I'm working on processing an html file using beautiful soup and I have done the following:

url = "https://en.wikipedia.org/wiki/"+'Category:American_football'
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-subcategories" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')

and my output looks like the following:

"\nSubcategories\nThis category has the following 26 subcategories, out of 26 total.\n\xc2\xa0\n\xe2\x96\xba  American football by city\xe2\x80\x8e (5 C)\n\n\n\xe2\x96\xba  American football by continent\xe2\x80\x8e (6 C)\n\n\n\xe2\x96\xba  American football by country\xe2\x80\x8e (41 C, 1 P)\n\n*\n\xe2\x96\xba  American football-related lists\xe2\x80\x8e (6 C, 16 P)\n\nA\n\xe2\x96\xba  American football occupations\xe2\x80\x8e (2 C, 6 P)\n\nC\n\xe2\x96\xba  American football competitions\xe2\x80\x8e (15 C, 13 P)\n\nE\n\xe2\x96\xba  American football equipment\xe2\x80\x8e (16 P)\n\nH\n\xe2\x96\xba  History of American football\xe2\x80\x8e (8 C, 14 P)\n\nI\n\xe2\x96\xba  American football incidents\xe2\x80\x8e (1 C, 45 P)\n\nM\n\xe2\x96\xba  American football media\xe2\x80\x8e (12 C, 16 P)\n\nO\n\xe2\x96\xba  American football organisations\xe2\x80\x8e (1 C, 7 P)\n\nP\n\xe2\x96\xba  American football people\xe2\x80\x8e (11 C)\n\n\n\xe2\x96\xba  American football plays\xe2\x80\x8e (68 P)\n\n\n\xe2\x96\xba  American football positions\xe2\x80\x8e (1 C, 41 P)\n\nR\n\xe2\x96\xba  American football records and statistics\xe2\x80\x8e (4 C, 8 P)\n\nS\n\xe2\x96\xba  Seasons in American football\xe2\x80\x8e (14 C)\n\n\n\xe2\x96\xba  Semi-professional American football\xe2\x80\x8e (1 C, 9 P)\n\n\n\xe2\x96\xba  American football strategy\xe2\x80\x8e (1 C, 29 P)\n\nT\n\xe2\x96\xba  American football teams\xe2\x80\x8e (10 C, 10 P)\n\n\n\xe2\x96\xba  American football terminology\xe2\x80\x8e (4 C, 127 P)\n\n\n\xe2\x96\xba  American football trophies and awards\xe2\x80\x8e (9 C, 26 P)\n\nV\n\xe2\x96\xba  Variations of American football\xe2\x80\x8e (5 C, 12 P)\n\n\n\xe2\x96\xba  American football venues\xe2\x80\x8e (2 C, 2 P)\n\nW\n\xe2\x96\xba  Women's American football\xe2\x80\x8e (3 C, 3 P)\n\n\xce\x99\n\xe2\x96\xba  American football logos\xe2\x80\x8e (3 C, 211 F)\n\n\xce\xa3\n\xe2\x96\xba  American football stubs\xe2\x80\x8e (6 C, 218 P)\n\n\n"

I'm trying to figure out how strip out everything but the acutual text names: i.e.

\xe2\x80\x8e (6 C, 218 P)\n\n\n

Is there a trick to get rid of this using the beautiful soup library or how should I go about further refining the text?

1

There are 1 answers

0
AudioBubble On

Navigate to the as you want.

soup = bs4.BeautifulSoup(raw)
for cat in soup.findAll("a", {"class": "CategoryTreeLabel"}):
    print(cat.text)

Output:

American football by city
American football by continent
American football by country
American football-related lists
American football occupations
American football competitions
American football equipment
History of American football
American football incidents
American football media
American football organisations
American football people
American football plays
American football positions
American football records and statistics
Seasons in American football
Semi-professional American football
American football strategy
American football teams
American football terminology
American football trophies and awards
Variations of American football
American football venues
Women's American football
American football logos
American football stubs