I would like to download posts through the 9gag.com Api, which apparently exists but isn't documented anywhere. If you check this link in your browser, you will get a json response: https://9gag.com/v1/tag-posts/tag/wtf/type/fresh
So using httr2 this should be easy. However:
library(httr2)
req <- request("https://9gag.com/v1/tag-posts/tag/samantha/type/fresh")
req_dry_run(req)
#> GET /v1/tag-posts/tag/wtf/type/fresh HTTP/1.1
#> Host: 9gag.com
#> User-Agent: httr2/0.1.1 r-curl/4.3.2 libcurl/7.82.0
#> Accept: */*
#> Accept-Encoding: deflate, gzip, br, zstd
result <- req_perform(req)
#> Error: HTTP 403 Forbidden.
I checked the network tab in Firefox Inspect and copied the GET request that is sent when I enter the URL:
curl 'https://9gag.com/v1/tag-posts/tag/wtf/type/fresh' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Connection: keep-alive' -H 'Cookie: ts1=456f0f77f4669644e64d425b5f7ad509ba4f2385; ____ri=1792; ____lo=DE' -H 'Upgrade-Insecure-Requests: 1' -H 'Sec-Fetch-Dest: document' -H 'Sec-Fetch-Mode: navigate' -H 'Sec-Fetch-Site: none' -H 'Sec-Fetch-User: ?1' -H 'If-Modified-Since: Fri, 18 Mar 2022 11:04:32 GMT' -H 'Cache-Control: max-age=0' -H 'TE: trailers'
Looks rather similar, except for the headers. So let's try with all these headers:
req <- request("https://9gag.com/v1/tag-posts/tag/samantha/type/fresh") %>%
req_headers(
'User-Agent' = 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'Accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language' = 'en-US,en;q=0.5',
'Accept-Encoding' = 'gzip, deflate, br',
'Connection' = 'keep-alive',
'Cookie' = 'ts1=456f0f77f4669644e64d425b5f7ad509ba4f2385; ____ri=1792; ____lo=DE',
'Upgrade-Insecure-Requests' = '1',
'Sec-Fetch-Dest' = 'document',
'Sec-Fetch-Mode' = 'navigate',
'Sec-Fetch-Site' = 'none',
'Sec-Fetch-User' = '?1',
'If-Modified-Since' = 'Fri, 18 Mar 2022 03:03:44 GMT',
'Cache-Control' = 'max-age=0',
'TE' = 'trailers'
)
result <- req_perform(req)
#> Error: HTTP 403 Forbidden.
Created on 2022-03-18 by the reprex package (v2.0.1)
Unfortunately, the request is still turned down. Does anybody know why?