Return specific data from a Wikipedia Page using API

85 views Asked by At

I want to parse Geographic pages (i.e. landmarks, places of interest) on Wikipedia to return a json file that only contains only the page title, and the GIS coordinates scraped from the page(s).

So for example, looking at the page: https://en.wikipedia.org/wiki/The_Sanctuary

Using the api: https://en.wikipedia.org/w/api.php?action=query&titles=The%20Sanctuary&prop=revisions&rvprop=content&format=json returns all the data from the page content.

However, I just want to return the following elements:

"title":"The Sanctuary" coord|51.41000|N|1.83173|W

Please can anyone advise how to correctly structure the web service call?

This is a first attempt at scraping content from pages for me, so any guidance greatly appreciated

1

There are 1 answers

1
Tgr On BEST ANSWER

Rule of thumb for scraping is to not do it. Many things are available in the API (use the API sandbox to discover them). For most other interesting data someone already wrote a library.

In this case, action=query&titles=The_Sanctuary&prop=coordinates will get you what you want:

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "788970": {
                "pageid": 788970,
                "ns": 0,
                "title": "The Sanctuary",
                "coordinates": [
                    {
                        "lat": 51.41,
                        "lon": -1.83173,
                        "primary": "",
                        "globe": "earth"
                    }
                ]
            }
        }
    }
}