Okay so I'm working on processing an html file using beautiful soup and I have done the following:
url = "https://en.wikipedia.org/wiki/"+'Category:American_football'
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-subcategories" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
and my output looks like the following:
"\nSubcategories\nThis category has the following 26 subcategories, out of 26 total.\n\xc2\xa0\n\xe2\x96\xba American football by city\xe2\x80\x8e (5 C)\n\n\n\xe2\x96\xba American football by continent\xe2\x80\x8e (6 C)\n\n\n\xe2\x96\xba American football by country\xe2\x80\x8e (41 C, 1 P)\n\n*\n\xe2\x96\xba American football-related lists\xe2\x80\x8e (6 C, 16 P)\n\nA\n\xe2\x96\xba American football occupations\xe2\x80\x8e (2 C, 6 P)\n\nC\n\xe2\x96\xba American football competitions\xe2\x80\x8e (15 C, 13 P)\n\nE\n\xe2\x96\xba American football equipment\xe2\x80\x8e (16 P)\n\nH\n\xe2\x96\xba History of American football\xe2\x80\x8e (8 C, 14 P)\n\nI\n\xe2\x96\xba American football incidents\xe2\x80\x8e (1 C, 45 P)\n\nM\n\xe2\x96\xba American football media\xe2\x80\x8e (12 C, 16 P)\n\nO\n\xe2\x96\xba American football organisations\xe2\x80\x8e (1 C, 7 P)\n\nP\n\xe2\x96\xba American football people\xe2\x80\x8e (11 C)\n\n\n\xe2\x96\xba American football plays\xe2\x80\x8e (68 P)\n\n\n\xe2\x96\xba American football positions\xe2\x80\x8e (1 C, 41 P)\n\nR\n\xe2\x96\xba American football records and statistics\xe2\x80\x8e (4 C, 8 P)\n\nS\n\xe2\x96\xba Seasons in American football\xe2\x80\x8e (14 C)\n\n\n\xe2\x96\xba Semi-professional American football\xe2\x80\x8e (1 C, 9 P)\n\n\n\xe2\x96\xba American football strategy\xe2\x80\x8e (1 C, 29 P)\n\nT\n\xe2\x96\xba American football teams\xe2\x80\x8e (10 C, 10 P)\n\n\n\xe2\x96\xba American football terminology\xe2\x80\x8e (4 C, 127 P)\n\n\n\xe2\x96\xba American football trophies and awards\xe2\x80\x8e (9 C, 26 P)\n\nV\n\xe2\x96\xba Variations of American football\xe2\x80\x8e (5 C, 12 P)\n\n\n\xe2\x96\xba American football venues\xe2\x80\x8e (2 C, 2 P)\n\nW\n\xe2\x96\xba Women's American football\xe2\x80\x8e (3 C, 3 P)\n\n\xce\x99\n\xe2\x96\xba American football logos\xe2\x80\x8e (3 C, 211 F)\n\n\xce\xa3\n\xe2\x96\xba American football stubs\xe2\x80\x8e (6 C, 218 P)\n\n\n"
I'm trying to figure out how strip out everything but the acutual text names: i.e.
\xe2\x80\x8e (6 C, 218 P)\n\n\n
Is there a trick to get rid of this using the beautiful soup library or how should I go about further refining the text?
Navigate to the
a
s you want.Output: