How can i web scrape information from this website in R?

358 views Asked by At

This website http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp is for searching nyc building application information. Under the "Application Searches" section, there is "BIS Job Number:", so the information I want to extract is from the new page after I enter a job number and then click "go".

For example, from the dataset https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2, I pick job number 220286232, and then go to the first website, put the number in "BIS Job Number:" and click go. Now I get a new page . The information i want is "Applicant of Record Information" (including applicant contact information).

I'm stuck here. How can I extract these applicant information under each job number?

I am very new to web scraping. I learned how I can extract information from entire page by using rvest, but I'm not familiar with web scraping across different websites.

Thank you.

Update: I tried to use Socrata API, but I found the applicant contact information doesn't have their own API fields.If there is no API field for the information (but other information on that page has fields), does it mean I can't use API to solve this problem?

Thank you!

1

There are 1 answers

4
knb On

On that page, top right, click on the "API" tab. A new modal dialog box will pop up "Access this Dataset via SODA API", copy the link, in this case https://data.cityofnewyork.us/resource/rvhx-8trz.json . This is an URL which provides the data directly in the machine-readable JSON format. But only 1000 records at a time will be fetched.

So maybe add appropriate $offset parameters. See the Socrata documentation. The City of New York seems to use this software for their Open Data platform.

Maybe call them this way in your R script :

https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=0
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=500
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=1000
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=...

(untested for higher offsets)

Use jsonlite for converting JSON into R data frames.