regular expression to extract data from html page

2k views Asked by At

I want to extract all anchor tags from html pages. I am using this in Linux.

lynx --source http://www.imdb.com | egrep "<a[^>]*>"

but that is not working as expected, since result contains unwanted results

<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>

I want just

<a href >...</a>

any good way ?

6

There are 6 answers

0
kurumi On

why can't you use options like --dump ?

lynx --dump --listonly http://www.imdb.com
0
Dennis Williamson On

Try grep -Eo:

$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">

But please read the answer that MAK linked to.

0
tchrist On

If you have a -P option in your grep so that it accepts PCRE patterns, you should be able to use better regexes. Sometimes a minimal quantifier like *? helps. Also, you’re getting the whole input line, not just the match itself; if you have a -o option to grep, it will list only the part that matches.

egrep -Po '<a[^<>]*>'

If your grep doesn’t have those options, try

perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'

Which now crosses line boundaries.

To do a real parse of HTML requires regexes subtantially more more complex than you are apt to wish to enter on the command line. Here’s one example, and here’s another. Those may not convince you to try a non-regex approach, but they should at least show you how much harder it is in the general case than in specific ones.

This answer shows why all things are possible, but not all are expedient.

0
jfs On

Here's some examples of why you should not use regex to parse html.

To extract values of 'href' attribute of anchor tags, run:

$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/@href"))' http://imdb.com | sort -u

Install lxml module if needed: $ sudo apt-get install python-lxml.

Output

http://askville.amazon.com
http://idfilm.blogspot.com/2011/02/another-class.html
http://imdb.com
http://imdb.com/
http://imdb.com/a2z
http://imdb.com/a2z/
http://imdb.com/advertising/
http://imdb.com/boards/
http://imdb.com/chart/
http://imdb.com/chart/top
http://imdb.com/czone/
http://imdb.com/features/hdgallery
http://imdb.com/features/oscars/2011/
http://imdb.com/features/sundance/2011/
http://imdb.com/features/video/
http://imdb.com/features/video/browse/
http://imdb.com/features/video/trailers/
http://imdb.com/features/video/tv/
http://imdb.com/features/yearinreview/2010/
http://imdb.com/genre
http://imdb.com/help/
http://imdb.com/helpdesk/contact
http://imdb.com/help/show_article?conditions
http://imdb.com/help/show_article?rssavailable
http://imdb.com/jobs
http://imdb.com/lists
http://imdb.com/media/index/rg2392693248
http://imdb.com/media/rm3467688448/rg2392693248
http://imdb.com/media/rm3484465664/rg2392693248
http://imdb.com/media/rm3719346688/rg2392693248
http://imdb.com/mymovies/list
http://imdb.com/name/nm0000207/
http://imdb.com/name/nm0000234/
http://imdb.com/name/nm0000631/
http://imdb.com/name/nm0000982/
http://imdb.com/name/nm0001392/
http://imdb.com/name/nm0004716/
http://imdb.com/name/nm0531546/
http://imdb.com/name/nm0626362/
http://imdb.com/name/nm0742146/
http://imdb.com/name/nm0817980/
http://imdb.com/name/nm2059117/
http://imdb.com/news/
http://imdb.com/news/celebrity
http://imdb.com/news/movie
http://imdb.com/news/ni7650335/
http://imdb.com/news/ni7653135/
http://imdb.com/news/ni7654375/
http://imdb.com/news/ni7654598/
http://imdb.com/news/ni7654810/
http://imdb.com/news/ni7655320/
http://imdb.com/news/ni7656816/
http://imdb.com/news/ni7660987/
http://imdb.com/news/ni7662397/
http://imdb.com/news/ni7665028/
http://imdb.com/news/ni7668639/
http://imdb.com/news/ni7669396/
http://imdb.com/news/ni7676733/
http://imdb.com/news/ni7677253/
http://imdb.com/news/ni7677366/
http://imdb.com/news/ni7677639/
http://imdb.com/news/ni7677944/
http://imdb.com/news/ni7678014/
http://imdb.com/news/ni7678103/
http://imdb.com/news/ni7678225/
http://imdb.com/news/ns0000003/
http://imdb.com/news/ns0000018/
http://imdb.com/news/ns0000023/
http://imdb.com/news/ns0000031/
http://imdb.com/news/ns0000128/
http://imdb.com/news/ns0000136/
http://imdb.com/news/ns0000141/
http://imdb.com/news/ns0000195/
http://imdb.com/news/ns0000236/
http://imdb.com/news/ns0000344/
http://imdb.com/news/ns0000345/
http://imdb.com/news/ns0004913/
http://imdb.com/news/top
http://imdb.com/news/tv
http://imdb.com/nowplaying/
http://imdb.com/photo_galleries/new_photos/2010/
http://imdb.com/poll
http://imdb.com/privacy
http://imdb.com/register/login
http://imdb.com/register/?why=footer
http://imdb.com/register/?why=mymovies_footer
http://imdb.com/register/?why=personalize
http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb
http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/
http://imdb.com/search
http://imdb.com/search/
http://imdb.com/search/name?birth_monthday=02-12
http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude
http://imdb.com/sections/dvd/
http://imdb.com/sections/horror/
http://imdb.com/sections/indie/
http://imdb.com/sections/tv/
http://imdb.com/showtimes/
http://imdb.com/tiger_redirect?FT_LIC&licensing/
http://imdb.com/title/tt0078748/
http://imdb.com/title/tt0279600/
http://imdb.com/title/tt0377981/
http://imdb.com/title/tt0881320/
http://imdb.com/title/tt0990407/
http://imdb.com/title/tt1034389/
http://imdb.com/title/tt1265990/
http://imdb.com/title/tt1401152/
http://imdb.com/title/tt1411238/
http://imdb.com/title/tt1411238/trivia
http://imdb.com/title/tt1446714/
http://imdb.com/title/tt1452628/
http://imdb.com/title/tt1464174/
http://imdb.com/title/tt1464540/
http://imdb.com/title/tt1477837/
http://imdb.com/title/tt1502404/
http://imdb.com/title/tt1504320/
http://imdb.com/title/tt1563069/
http://imdb.com/title/tt1564367/
http://imdb.com/title/tt1702443/
http://imdb.com/tvgrid/
http://m.imdb.com
http://pro.imdb.com/r/IMDbTabNB/
http://resume.imdb.com
http://resume.imdb.com/
https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627
http://twitter.com/imdb
http://wireless.amazon.com
http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx
http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat
http://www.audible.com
http://www.boxofficemojo.com
http://www.dpreview.com
http://www.endless.com
http://www.fabric.com
http://www.imdb.com/board/bd0000089/threads/
http://www.imdb.com/licensing/
http://www.imdb.com/media/rm1037220352/rg261921280
http://www.imdb.com/media/rm2695346688/tt1449283
http://www.imdb.com/media/rm3987585536/tt1092026
http://www.imdb.com/name/nm0000092/
http://www.imdb.com/photo_galleries/new_photos/2010/index
http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude
http://www.imdb.com/sections/indie/
http://www.imdb.com/title/tt0079470/
http://www.imdb.com/title/tt0079470/quotes?qt0471997
http://www.imdb.com/title/tt1542852/
http://www.imdb.com/title/tt1606392/
http://www.imdb.de
http://www.imdb.es
http://www.imdb.fr
http://www.imdb.it
http://www.imdb.pt
http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php
http://www.movingimagesource.us/articles/un-tv-20110210
http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist
http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html
http://www.shopbop.com/welcome
http://www.smallparts.com
http://www.twinpeaks20.com/details/
http://www.twitter.com/imdb
http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103
http://www.warehousedeals.com
http://www.withoutabox.com
http://www.zappos.com
0
tobi On

To extract values of 'href' attribute of anchor tags you may also use xmlstarlet after converting HTML to XHTML using HTML Tidy (Mac OS X version released on 25 March 2009):

curl -s www.imdb.com | 
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/@href" -v '.' -n | 
grep '^[[:space:]]*http://' | sort -u | nl
0
greg On

On Mac OS X you may also use the command line tool linkscraper:

linkscraper http://www.imdb.com

see: http://codesnippets.joyent.com/posts/show/10772