I'm trying to get a data dump of every Wikipedia ID that has an associated latitude and longitude. The end goal is to be able to quickly match a list of a Wikipedia ID's mined from an NER task for a large corpus. I want to avoid performing multiple API calls because I have lists of 2 million+ locations to match, so I'm looking to keep a local dataset to query.
Note: Previous Stack Overflow responses recommend downloading this dataset, but unfortunately that data dump seems to not contain most of the locations found in our novels and is rather incomplete (for example, "Mount_Etna" and "Sicily_(Italy)" are not included). I'd really like to understand how to get our own data using a query.
We are trying to use SPARQL to get a JSON file with this information. The goal is a JSON file with every unique entity that has an associate latitude and longitude. Here is what we have so far, which does not successfully run.
from SPARQLWrapper import SPARQLWrapper, JSON
import sys
user_agent = "Wikidata-Service Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
endpoint_url = "https://query.wikidata.org/sparql"
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setReturnFormat(JSON)
# something here is clearly wrong
sparql.setQuery("""
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX psv: <http://www.wikidata.org/prop/statement/value/>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT DISTINCT ?item ?itemLabel ?lat ?long
WHERE {
# also not sure that P625 is what I want
?item wdt:P625 .
?item p:P625 ?statement .
?statement psv:P625 ?coordinate_node .
?coordinate_node wikibase:geoLatitude ?lat .
?coordinate_node wikibase:geoLongitude ?long .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
)
# this is only to print out results
try:
count = 0
ret = sparql.queryAndConvert()
for r in ret["results"]["bindings"]:
count += 1
#print(r)
print("Results: ", str(count))
except Exception as e:
print(e)