How do some sites detect my HTTP request isn't coming from a browser?

84 views Asked by At

Just as "coding practice", I want to write a simple web browser in Perl. I'm using LWP::UserAgent to retrieve a page, then want to parse and display it. Problem is, I already encounter problems with step one: retrieving the page.

The problem is that some pages (that, I think, use cloudfare) block traffic from "bots". So I try to retrieve the page and I get a "403 Forbidden" error.

Here's a snippet of the code I'm using:

my $host   = "www.somehost.com";
my $scheme = "https";
my $pageContent = "";

my $browserObj = LWP::UserAgent->new();
$browserObj->cookie_jar( {} );
$browserObj->timeout(600);
push @{ $browserObj->requests_redirectable }, 'POST';

$browserObj->add_handler("request_send",  sub { shift->dump; return });

$response = $browserObj->get( "$scheme://$host" );
if( $response->is_success ) {
  $pageContent = $response->decoded_content();
} else {
  print "Unable to retrieve $host.\nError: " . $response->status_line;
}

So this returns a "403 Forbidden". Looking at what is actually sent, I notice the headers are much different than what, for instance, my browser (Firefox) sends. So I copied the headers that Firefox sends:

my $host = "www.somehost.com";
my $scheme = "https";
my $pageContent = "";

my $browserObj = LWP::UserAgent->new();
$browserObj->cookie_jar( {} );
$browserObj->timeout(600);
push @{ $browserObj->requests_redirectable }, 'POST';

$browserObj->add_handler("request_send",  sub { shift->dump; return });

# Send same headers Firefox sends:
my @header = (
           'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
           'Host' => $host,
           'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
           'Accept-Encoding' => 'gzip, deflate, br',
           'Accept-Language' => 'en-US,en;q=0.5',
           'Connection' => 'keep-alive',
           'DNT' => '1',
           'Sec-Fetch-Dest' => 'document',
           'Sec-Fetch-Mode' => 'navigate',
           'Sec-Fetch-Site' => 'none',
           'Sec-Fetch-User' => '?1',
           'Sec-GPC' => '1',
           'Upgrade-Insecure-Requests' => '1',
         );

$response = $browserObj->get( "$scheme://$host", @header );
if( $response->is_success ) {
  $pageContent = $response->decoded_content();
} else {
  print "Unable to retrieve $host.\nError: " . $response->status_line;
}

That still gives me a "403 Forbidden" error. So now the only difference I see is the order of the headers. Firefox sends the headers in a different order than my Perl script does. Since you can't set the order of the headers with LWP::UserAgent (or the underlying HTTP::Request) I use a different approach:

use strict;
use IO::Socket::SSL;

my $host   = "www.somedomain.com";
my $port   = 443;
my $sock = IO::Socket::SSL->new("$host:$port") || die $!;

print $sock "GET / HTTP/1.1\r\n";
print $sock "Host: $host\r\n";
print $sock "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0\r\n";
print $sock "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8\r\n";
print $sock "Accept-Language: en-US,en;q=0.5\r\n";
print $sock "Accept-Encoding: gzip, deflate, br\r\n";
print $sock "DNT: 1\r\n";
print $sock "Sec-GPC: 1\r\n";
print $sock "Connection: keep-alive\r\n";
print $sock "Upgrade-Insecure-Requests: 1\r\n";
print $sock "Sec-Fetch-Dest: document\r\n";
print $sock "Sec-Fetch-Mode: navigate\r\n";
print $sock "Sec-Fetch-Site: none\r\n";
print $sock "Sec-Fetch-User: ?1\r\n\r\n";

print while <$sock>;

close $sock;

Comparing what my script sends and the raw headers from Firefox (through it's developer tools), I see no difference. Both requests send the same headers and in the exact same order. Yet my script still returns "403 Forbidden".

It's not a cookie thing. If I start a private browser in Firefox and retrieve the page (so no cookies are set yet for this domain), it works fine.

So how are these websites detecting my script?

0

There are 0 answers