Scraping: session ID from browser works, but session ID from scraping doesn't

2k views Asked by At

Note: I've replaced the last 5 chars of the session IDs with 'x's for obvious reasons

I'm scraping a web site. I can see, in the browser, that logging in sets a cookie value called PHPSESSID. No problem, I can scrape that:

superagent
    .post(loginUrl)
    .send(loginDetails)
    .end(function(err, res){
        var setCookieValue = res.headers['set-cookie'][0]
        var sessionID = cookieParser.parse(setCookieValue).PHPSESSID
        console.log(sessionID)

Returns:

37c3bog3tf6erp2i6ss5vxxxxx

Which looks like a PHP session ID. Great! Now to use the session ID:

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID'=sessionID)
.end(err, res)

Redirects me to the login page. But the session ID I got manually from the browser, in the exact same format, works fine:

var fakeSessionID = 'a1oslk341uoht8p6009q5xxxxx'
superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+fakeSessionID)

Will return the loggedInURL, with the full HTML of a logged in user.

Why isn't the session ID I'm scraping working?

  • The format is identical
  • The character count is the same (26 characters)

There is nothing asides from the session ID that's different between the working and non-working code.

What could be making the difference?

3

There are 3 answers

1
Kornel On BEST ANSWER

PHP has some dubious extra security for sessions such as checking Referer.

Some sites may additionally check User-Agent.

1
Matt Fellows On

You code looks like you aren't replacing the string "sessionID" with the actual sessionID value...

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID=sessionID')
.end(err, res)

Should be something like?

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+sessionID)
.end(err, res)

I think...

2
Michael Blankenship On

You might try throwing a different user-agent attribute in the header in the call to superagent for both GET and POST:

  .set('User-Agent','Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0')