Scraping: session ID from browser works, but session ID from scraping doesn't

Question

Scraping: session ID from browser works, but session ID from scraping doesn't

2k views Asked by mikemaccana At 21 June 2015 at 22:06

Note: I've replaced the last 5 chars of the session IDs with 'x's for obvious reasons

I'm scraping a web site. I can see, in the browser, that logging in sets a cookie value called PHPSESSID. No problem, I can scrape that:

superagent
    .post(loginUrl)
    .send(loginDetails)
    .end(function(err, res){
        var setCookieValue = res.headers['set-cookie'][0]
        var sessionID = cookieParser.parse(setCookieValue).PHPSESSID
        console.log(sessionID)

Returns:

37c3bog3tf6erp2i6ss5vxxxxx

Which looks like a PHP session ID. Great! Now to use the session ID:

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID'=sessionID)
.end(err, res)

Redirects me to the login page. But the session ID I got manually from the browser, in the exact same format, works fine:

var fakeSessionID = 'a1oslk341uoht8p6009q5xxxxx'
superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+fakeSessionID)

Will return the loggedInURL, with the full HTML of a logged in user.

Why isn't the session ID I'm scraping working?

The format is identical
The character count is the same (26 characters)

There is nothing asides from the session ID that's different between the working and non-working code.

What could be making the difference?

Original Q&A

There are 3 answers

Matt Fellows On 21 June 2015 at 22:33

You code looks like you aren't replacing the string "sessionID" with the actual sessionID value...

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID=sessionID')
.end(err, res)

Should be something like?

superagent
.get(loggedInURL)
.set('Cookie', 'PHPSESSID='+sessionID)
.end(err, res)

I think...

Michael Blankenship On 21 June 2015 at 22:22

You might try throwing a different user-agent attribute in the header in the call to superagent for both GET and POST:

  .set('User-Agent','Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0')

**Kornel** · Accepted Answer · 2015-06-21T22:47:12+00:00

Kornel On 21 June 2015 at 22:47 BEST ANSWER

PHP has some dubious extra security for sessions such as checking Referer.

Some sites may additionally check User-Agent.

TechQA.

Scraping: session ID from browser works, but session ID from scraping doesn't

There are 3 answers

Related Questions in JAVASCRIPT

Related Questions in PHP

Related Questions in NODE.JS

Related Questions in SCREEN-SCRAPING

Related Questions in SESSION-COOKIES

Popular Questions

Popular Tags

Trending Questions