How to access data from requests triggered by a website, using headless chrome (or similar) in php?

25 views Asked by At

I'm trying to scrape data off of a website. There are certain datetimes that I'm interested but there are two problems:

  1. They're not present on a site in the original HTML, they're loaded later.

  2. When loaded, they're displayed in relative and imprecise human-readable form. So 7.3.2023 14:22 becomes 'in seven days'. So simply waiting for the page to finish loading is a no-go as well.

When I open the Network panel in Chrome Dev Tools, I can pinpoint a request that sends over data in the proper form.

Is there a way to programmatically access the content of these requests using headless Chrome or some other software? The best case would be using a tool from the PHP ecosystem but I guess going with javascript or something else is possible too, just inconvenient.

And no, I can't access the URL of the request directly. The webpage sends a ton of data I can't reasonably reproduce, not to mention there surely gonna be security preventing access from other origins than the original site.

1

There are 1 answers

0
Noximo On

Alright, this was quite a journey but I managed to nail it down.

The solution uses this excellent library: https://github.com/jakubkulhan/chrome-devtools-protocol

Here's my code:

$ctx = Context::withTimeout(Context::background(), 10);
$launcher = new Launcher();
$launcher->setExecutable('chromium');
$instance = $launcher->launch($ctx, '--no-sandbox', '--remote-allow-origins=*');
try {
    $session = $instance->createSession($ctx);
    try {
        $requestIds = [];

        $session->page()->enable($ctx);
        $session->network()->enable($ctx, EnableRequest::builder()->build());

        $session->network()->addResponseReceivedListener(function (ResponseReceivedEvent $ev) use (&$requestIds) {                   
                $requestIds[] = $ev->requestId;                    
        });
        $session->page()->navigate(
            $ctx,
            NavigateRequest::builder()
                ->setUrl($url)
                ->build()
        );
        $session->page()->awaitLoadEventFired($ctx);

        foreach ($requestIds as $id) {
            $responseBody = $session->network()->getResponseBody($ctx, GetResponseBodyRequest::builder()->setRequestId($id)->build());
            $responseText = $responseBody->body;
        }
    } finally {
        $session->close();
    }

} finally {
    $instance->close();
}