I can use Mojo::DOM and its CSS3 selectors to figure out the DOCTYPE of an HTML document? Related to my other question, How should I process HTML META tags with Mojo::UserAgent? where I want to set the character set of a document, I need to know what to look at and doctype sniffing seems to be the way to do. HTML and HTML 5 have different meta tags for charsets in HTML when the document setting overrides the server setting (or non-setting).
I don't have a problem accomplishing the task since I can grab the raw response and play with regexes to look at the DOCTYPE. Since the browser DOMs seem to be able to get the DOCTYPE, I'm infected with the idea that I should be able to get it. However, the lack of examples leads me to think nobody does it in the way I think I should do it.
I tried lots of stupid ways but my CSS kung fu is weak:
use v5.20;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my $dom = Mojo::DOM->new( $html );
say "<title> is => ", $dom->find( 'head title' )->map( 'text' )->each;
say "Doctype with find is => ", $dom->find( '!doctype' )->map( 'text' )->each;
say "Doctype with nodes is => ", $dom->[0];
__DATA__
<!DOCTYPE html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>Level 1</h1>
</body>
</html>
When I dump the $dom
object, I see the DOCTYPE in the tree:
$VAR1 = bless( do{\(my $o = bless( {
'tree' => [
'root',
[
'text',
'',
${$VAR1}->{'tree'}
],
[
'doctype',
' html',
${$VAR1}->{'tree'}
],
Now how do I get at that?
Determining the encoding of an HTML5 document is very complex. I'm afraid Mojo::DOM is only a fragment parser, and therefore we've decided that a full implementation of the encoding sniffing algorithm would be out of scope. Most of the web is thankfully UTF-8 encoded, and i imagine that's why this question doesn't come up very often.