Wiki-distance: distance between Wiki topics and categories?

740 views Asked by At

Is there something a [directional?] notion/implementation of distance between Wikipedia categories/pages?

For example consider: A) "Saint Louis University" B) "university"

Clearly "A" is a type of "B". How can you extract this from Wiki? If you extract all the categories connect to A, you'd see that it gives

Category:1818 establishments in Missouri Territory 
Category:Articles containing Latin-language text 
Category:Association of Catholic Colleges and Universities
Category:Commons category with local link same as on Wikidata
Category:Coordinates on Wikidata 
Category:Educational institutions established in 1818
Category:Instances of Infobox university using image size
Category:Jesuit universities and colleges in the United States
Category:Roman Catholic Archdiocese of St. Louis
Category:Roman Catholic universities and colleges in Missouri

and it does not contain anything that would directly connect to B (https://en.wikipedia.org/wiki/University). But essentially if you look further, you should be able to find a multi-hop path between A and B, possibly multiple hops. What are the popular ways of accomplishing this?

3

There are 3 answers

0
Wasi Ahmad On

If you have the entire Wikipedia category taxonomy, then you can compute the distance (shortest path length) between two categories. If one category is the ancestor of other, it is straight forward.

Otherwise you can find the Least Common Subsumer which is defined as follows.

Least common subsumer of two concepts A and B is the most specific concept which is an ancestor of both A and B.

Then compute the distance between them via LCS.

I encourage you to go through similarity measures where you will find state-of-art techniques to compute semantic similarity between words.

Resource: My project on extracting Wikipedia category/concept might help you.

One very good related example

Compute semantic similarity between words using WordNet. WordNet organizes English words in hierarchical fashion. See this wordnet similarity for java demo. It uses eight different state-of-techniques to compute semantic similarity between words.

0
Daniel On

Some ideas/resources I collected. Will update this if I find more.

-- Using DBPedia: knowledge base curated based on Wiki. They provide an SparQL end-point to query this KB. But one has to simulate the desired similarity/distance behavior via their SparQL interface. Some ideas are here and here, but they seem to be outdated.

-- Using UMBEL: http://umbel.org/ which is a knowledge graph of concepts. I think the size of this knowledge graph is relatively small. But the I suspect that its precision is probably high. That being said, I'm not sure how this relates to Wikipedia at all. They have this api for calculating the distance measure between any pair of their concepts (at the moment of writing this post, their similarity API is down. So not a feasible solution at the moment).

-- Using http://degreesofwikipedia.com/ I don't the details of their algorithm and how they do, but they provide a distance between Wiki-concepts. And also this is directional. For example this and this.

4
Tgr On

You might be looking for the "is a" relationship: Q734774 (the Wikidata item for Saint Louis University) is a university, a building and a private not-for-profit educational institution. You can use SPARQL to query it: