I'm trying to extract quite a bit of data from a perfectly structured web page and struggling with Mojo::DOM
methods. I would really appreciate it if anyone could point me in the right direction.
The truncated HTML with interesting data follows:
<div class="post" data-story-id="3964117" data-visited="false">//extracting story-id
<h2 class="post_title page_title"><a href="http://example.com/story/some_url" class="to-comments">header.</a></h2>
//useless data and tags
<a href="http://example.com/story/some_url" class="b-story__show-all">
<span>useless data</span>
</a>
<div class="post_tags">
<ul>
<li class="post_tag post_tag_strawberry hidden"><a href="http://example.com/search.php?n=32&r=3"> </a></li>
<li class="post_tag"><a href="http://example.com/tag/tag1/hot">tag1</a></li>
<li class="post_tag"><a href="http://example.com/tag/tag2/hot">tag2</a></li>
<li class="post_tag"><a href="http://example.com/tag/tag1/hot">tag3</a></li>
</ul>
</div>
<div class="post_actions_box">
<div class="post_rating_box">
<ul data-story-id="3964117" data-vote="0" data-can-vote="true">
<li><span class="post_rating post_rating_up control"> </span></li>
<li><span class="post_rating_count control label">1956</span></li> //1956 - interesting value
<li><span class="post_rating post_rating_down control"> </span></li>
</ul>
</div>
<div class="post_more_box">
<ul>
<li>
<span class="post_more control"> </span>
</li>
<li>
<a class="post_comments_count label to-comments" href="http://example.com/story/some_url#comments">132 <i> </i></a>
</li>
</ul>
</div>
</div>
</div>
What I have right now is
use strict;
use warnings;
use Data::Dumper;
use Mojo::DOM;
my $file = "index2.html";
local( $/, *FH ) ;
open( FH, $file ) or die "sudden flaming death\n";
my $text = <FH>;
my $dom = Mojo::DOM->new;
$dom->parse($text);
my $ids = $dom->find('div.post')
->each (sub {print $_->attr('data-story-id'), "\n";});
$dom->find('a.to-comments')->each (sub {print $_->text, "\n";});
This mess extracts data-story-id
from the src and header value (tested the same with href value), but all my other attempts fail.
3964117
Header
132
"post_rating_count control label" is not extracted. I could get the first href values with searching for a.to-comments
and returning attr('href')
, but for some reason it also returnes me values of a link in the end of the segment with class="post_comments_count label to-comments"
. The same happens with header value extraction.
In the final end I am looking for an array with data structure with following fields:
- story-id (this is is a success)
- href (somehow, matching more than needed.)
- header (somehow, matching more than needed.)
- list of tags as a string (no idea how to do that)
What is more, I feel it is possible to optimize the code and make it look a bit better, but my kung-fu is not so strong.
Your HTML is malformed as I said in my comment. I've guessed where the missing
<div>
might go but I'm probably wrong. I've assumed the last</div>
in the data corresponds to the first<div>
, so that the whole block constitutes a single postThe main problem you have is trying to do everything inside an
each
method call on yourMojo::Collection
objects. It's far easier to use Perl to iterate of each collection, like thisoutput