Doctype sniffing with CSS3, and specifically with Mojo::DOM

221 views Asked by At

I can use Mojo::DOM and its CSS3 selectors to figure out the DOCTYPE of an HTML document? Related to my other question, How should I process HTML META tags with Mojo::UserAgent? where I want to set the character set of a document, I need to know what to look at and doctype sniffing seems to be the way to do. HTML and HTML 5 have different meta tags for charsets in HTML when the document setting overrides the server setting (or non-setting).

I don't have a problem accomplishing the task since I can grab the raw response and play with regexes to look at the DOCTYPE. Since the browser DOMs seem to be able to get the DOCTYPE, I'm infected with the idea that I should be able to get it. However, the lack of examples leads me to think nobody does it in the way I think I should do it.

I tried lots of stupid ways but my CSS kung fu is weak:

use v5.20;

use feature qw(signatures);
no warnings qw(experimental::signatures);

use Mojo::DOM;

my $html = do { local $/; <DATA> };

my $dom = Mojo::DOM->new( $html );

say "<title> is => ", $dom->find( 'head title' )->map( 'text' )->each;

say "Doctype with find is => ", $dom->find( '!doctype' )->map( 'text' )->each;

say "Doctype with nodes is => ", $dom->[0];

__DATA__

<!DOCTYPE html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>Level 1</h1>
</body>
</html>

When I dump the $dom object, I see the DOCTYPE in the tree:

$VAR1 = bless( do{\(my $o = bless( {
                      'tree' => [
                                  'root',
                                  [
                                    'text',
                                    '',
                                    ${$VAR1}->{'tree'}
                                  ],
                                  [
                                    'doctype',
                                    ' html',
                                    ${$VAR1}->{'tree'}
                                  ],

Now how do I get at that?

2

There are 2 answers

1
Sebastian Riedel On

Determining the encoding of an HTML5 document is very complex. I'm afraid Mojo::DOM is only a fragment parser, and therefore we've decided that a full implementation of the encoding sniffing algorithm would be out of scope. Most of the web is thankfully UTF-8 encoded, and i imagine that's why this question doesn't come up very often.

0
brian d foy On

I still think there's hope for a better way to do this, but perhaps I'm putting too much responsibility on Mojo::UserAgent. I can build a transaction and add a finish event to the response. In that event, I sniff the content with a regular expression and add an X- header with the doc type. I could probably pass the info in some other way, but that's not the point (still taking suggestions though!)

use v5.14;

use Mojo::UserAgent;

@ARGV = qw(http://blogs.perl.org);

my $ua = Mojo::UserAgent->new;

my $tx = $ua->build_tx( GET => $ARGV[0] );
$tx->res->on( finish => sub {
    my $res = shift;
    my( $doctype ) = $res->body =~ m/\A \s* (<!DOCTYPE.*?>)/isx;
    if( $doctype ) {
        say "Found doctype => $doctype";
        $res->headers->header( 'X-doctype', $doctype );
        }
    });
$tx = $ua->start($tx);

say "-----Headers-----";
say $tx->res->headers->to_string =~ s/\R+/\n/rg;

Here's the output:

Found doctype => <!DOCTYPE html>
-----Headers-----
Connection: Keep-Alive
Server: Apache/2.2.12 (Ubuntu)
Content-Type: text/html
Content-Length: 20624
Accept-Ranges: bytes
X-doctype: <!DOCTYPE html>
Last-Modified: Wed, 16 Sep 2015 13:08:26 GMT
ETag: "26d42e8-5090-51fdcfe768680"
Date: Wed, 16 Sep 2015 13:40:02 GMT
Keep-Alive: timeout=15, max=100
Vary: Accept-Encoding

Now I have to think about various things to parse the DOCTYPE values and decide based on those what to do with content.