I am currently trying to crawl MVN Repository using puppeteer on AWS Lambda. However, my test function would run for 15 minutes and proceed to fail after that (See below). It seems like the browser is opened but it doesn't crawl.
Here is my current code:
import json
import asyncio
from pyppeteer import launch
import pyppeteer
import zipfile
import boto3
import time
# import pandas as pd
import os
import logging
import subprocess
from pyppeteer.launcher import Launcher
logger = logging.getLogger()
logger.setLevel(logging.INFO)
pyppeteer.DEBUG = True
async def main(name, url):
browser = await launch(headless=True, args=["--no-sandbox"], executablePath="/opt/python/headless-chromium")
page = await browser.newPage()
await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36')
await page.goto(url)
def lambda_handler(event, context):
asyncio.get_event_loop().run_until_complete(main('lol','https://mvnrepository.com/artifact/com.adobe.xmp/xmpcore'))
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
The layers for this are:
- A zipped pyppeteer of version 0.2.2
- A headless chrome version of v1.0.0-55 stable release
- A Python version of 3.7
The following is the output after the function has timed out:
Test Event Name
dd
Response
{
"errorMessage": "2022-04-22T06:28:32.470Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Task timed out after 900.10 seconds"
}
Function Logs
START RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd Version: $LATEST
[INFO] 2022-04-22T06:13:32.424Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Found credentials in environment variables.
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:51625/devtools/browser/1651a2a3-9b53-4f0a-883f-4850a6d693ed
END RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd
REPORT RequestId: e9be66b9-1fd0-4df9-a0b4-9815067169cd Duration: 900104.69 ms Billed Duration: 900000 ms Memory Size: 10240 MB Max Memory Used: 364 MB Init Duration: 490.52 ms
2022-04-22T06:28:32.470Z e9be66b9-1fd0-4df9-a0b4-9815067169cd Task timed out after 900.10 seconds
Request ID
e9be66b9-1fd0-4df9-a0b4-9815067169cd
Apart from the method I tried earlier, I also followed the following tutorials but to no avail:
- https://medium.com/limehome-engineering/running-pyppeteer-on-aws-lambda-with-serverless-62313b3fe3e2
- https://github.com/pyppeteer/pyppeteer/issues/108
- Pyppeteer: Browser closed unexpectedly in AWS Lambda
P.S. I am able to run the above script with no issues on my localhost
I built a similar configuration but using pyppeteer 1.0.2. When I tried to generate a PDF file from the URL you mentioned (mvnrepository), I got an ugly captcha issue: screen. Have you tried crawling other websites? This could be the problem.
Please let me know if you found a workaround.