Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

318 views Asked by At

Relative begginer with Perl, with my first question here, trying the following:

I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below). The HTML data looks like this (showing only the part I'm interested in):

<!-- 
 <blahblah>
< lots of stuff here, before the interesting part>
--> 

      <div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
           aria-labelledby="PP_Class">
         <div class="panel-body">
            <dl class="NMetadata">
               <dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
               <dd xmlns="http://www.w3.org/1999/xhtml">
                  <ul>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=341&amp;lang=en">
                           <span lang="en">descriptor_1</span>
                        </a>
                     </li>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=5158&amp;lang=en">
                           <span lang="en">descriptor_2</span>
                        </a>
                     </li>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=7983&amp;lang=en">
                           <span lang="en">descriptor_3</span>
                        </a>
                     </li>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=933&amp;lang=en">
                           <span lang="en">descriptor_4</span>
                        </a>
                     </li>
                  </ul>
               </dd>
               <dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
               <dd xmlns="http://www.w3.org/1999/xhtml">
                  <ul>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CT_CODED=BUDG&amp;lang=en">
                           <span lang="en">Subject_1</span>
                        </a>
                     </li>
                  </ul>
               </dd>
               <dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
               <dd xmlns="http://www.w3.org/1999/xhtml">
                  <ul>
                     <li>01.60.20.00 <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CC_1_CODED=01&amp;lang=en">
                           <span lang="en">Designation_level_1</span>
                        </a> / <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CC_2_CODED=0160&amp;lang=en">
                           <span lang="en">Designation_level_2</span>
                        </a> / <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CC_3_CODED=016020&amp;lang=en">
                           <span lang="en">Designation_level_3</span>
                        </a>
                     </li>
                  </ul>
               </dd>
            </dl>
         </div>
      </div>
   </div>

<!-- 
<still more stuff here>
--> 

I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:


    - EUROVOC descriptor:
    - Subject matter:
    - Directory code:

Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:


    CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n

"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.

I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:


    #!/usr/bin/perl
    # returns "Classification" descriptors for given CELEX and Language

    use strict;
    use warnings;

    use Mojo::UserAgent;

    if ($#ARGV ne "1") {
        print "Wrong number of arguments!\n";
        print "Syntax: clookup.pl Lang_ID celex_No.\n";
        exit -1;
    }

    my $lang = $ARGV[0];   
    my $celex = $ARGV[1];
    my $lclang = lc $lang;

    # fetch the eurlex page

    my $ua = Mojo::UserAgent->new;
    my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;


    ################ let's extract interesting parts:


    my $text = $dom->at('#PPClass_Contents')->all_text;
    print "$text\n";

EDIT (added): You can try my Perl script using two arguments:

  • lang_code ("DE","EN","IT", etc.)

  • Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).

For example (if you name my script "clookup.pl"): $ perl clookup.pl EN E2014C0303

So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?

Or, is there something simpler or faster (using Perl)?

1

There are 1 answers

8
simbabque On BEST ANSWER

You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.

$dom->at('#PPClass_Contents')->find('dd')

This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.

$dom->at('#PPClass_Contents')->find('dd')->each(sub {
    $_; # this is the current element
});

Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.

$_->find('span')

We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.

 join '|', map { $_->text } $_->find('span')->each

To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.

my @columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
    push @columns, join '|', map { $_->text } $_->find('span')->each;
});

Producing the final tab-separated output is now trivial.

print join "\t", @columns;

I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:

32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances