Note: I've replaced the last 5 chars of the session IDs with 'x's for obvious reasons
I'm scraping a web site. I can see, in the browser, that logging in sets a cookie value called PHPSESSID
. No problem, I can scrape that:
superagent
.post(loginUrl)
.send(loginDetails)
.end(function(err, res){
var setCookieValue = res.headers['set-cookie'][0]
var sessionID = cookieParser.parse(setCookieValue).PHPSESSID
console.log(sessionID)
Returns:
37c3bog3tf6erp2i6ss5vxxxxx
Which looks like a PHP session ID. Great! Now to use the session ID:
superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID'=sessionID)
.end(err, res)
Redirects me to the login page. But the session ID I got manually from the browser, in the exact same format, works fine:
var fakeSessionID = 'a1oslk341uoht8p6009q5xxxxx'
superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+fakeSessionID)
Will return the loggedInURL, with the full HTML of a logged in user.
Why isn't the session ID I'm scraping working?
- The format is identical
- The character count is the same (26 characters)
There is nothing asides from the session ID that's different between the working and non-working code.
What could be making the difference?
PHP has some dubious extra security for sessions such as checking
Referer
.Some sites may additionally check
User-Agent
.