Using R to download file from https with login credentials

6.4k views Asked by At

I am trying to write a code that will allow me to download a .xls file from a secured https website which requires a login. This is very difficult for me, as i have no experience with web-coding--all my R experience comes from econometric work with readily available datasets.

i followed this thread to help write some code, but i think im running into trouble because the example is http, and i need https.

this is my code:

install.packages("RCurl")
library(RCurl)

curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer =  TRUE, curl = curl)

html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)

viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))

params <- list(
    'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
    'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
    'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
    '_VIEWSTATE' = viewstate)

html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)

when i get to running the piece that starts "html <- getURL(..." i get:

> html <- getURL('https://jump.valueline.com/login.aspx', curl = curl)
Error in function (type, msg, asError = TRUE)  : 
SSL certificate problem: unable to get local issuer certificate

is there a workaround for this? how am i able to access the local issuer certificate?

I read that adding '.opts = list(ssl.verifypeer = FALSE)' into the curlSetOpt would remedy this, but when i add that, the getURL runs, but then postForm line gives me

> html <- postForm('https://jump.valueline.com/login.aspx', .params = params, curl = curl)
Error: Internal Server Error

Besides that, does this code look like it will work given the website i am trying to access? I went into the inspector, and changed all the params to be correct for my webpage, but since i'm not well versed in webcoding i'm not 100% i caught the right things (particularly the VIEWSTATE). Also, is there a better, more efficient way i could approach this?

automating this process would be huge for me, so your help is greatly appreciated.

2

There are 2 answers

1
hadley On

Try httr:

library(httr)
html <- content(GET('https://jump.valueline.com/login.aspx'), "text")

viewstate <- as.character(sub('.*id="_VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))

params <- list(
  'ct100$ContentPlaceHolder$LoginControl$txtUserID' = 'MY USERNAME',
  'ct100$ContentPlaceHolder$LoginControl$txtUserPw' = 'MY PASSWORD',
  'ct100$ContentPlaceHolder$LoginControl$btnLogin' = 'Sign In',
  '_VIEWSTATE' = viewstate
)
POST('https://jump.valueline.com/login.aspx', body = params)

That still gives me a server error, but that's probably because you're not sending the right fields in the body.

2
Ken Yeoh On
html <- getURL('https://jump.valueline.com/login.aspx', curl = curl, ssl.verifypeer = FALSE)

This should work for you. The error you're getting is probably because libcurl doesn't know where to look for to get a certificate for SSL.