I want to annotate the corpus using freebase types. But almost every instance in freebase has several types. So I decide to choose the most common types as the instance's type. Is there a way to get the list of the count of the instance? I found this query but it seems not right because the result only has like 400 types. But I think the real types are way more than that.
[{
"id": null,
"name": null,
"type": "/freebase/type_profile",
"/freebase/type_profile/instance_count": []
}]
I question the premise, but let's talk about that at the end after answering your question.
That's (close to) the correct query. When I ask for the count with by adding
"return" : "count"
, I get 17,972 which sounds about right. Perhaps your query framework is adding a"limit" : 400
somehow?Since you want the most common, why don't we modify the query to sort them. Due to a quirk in the sorting, nulls sort last (or first in our reversed sort), so we'll also add a qualifier to filter them out. We could use
>0
, but since presumably you aren't interested in low frequency types, let's use>1000
instead.The final query looks like this:
which will return an ordered list of 849 types sorted in descending order by instance count.
You'll probably want to do a little hand curation of the resulting list to eliminate things like
/common/topic
,/common/document
,/book/isbn
,/book/pagination
, etc. Mediator types won't also have/common/topic
, so you could filter on that first (but depending on the types of things in your corpus, they may all be topics (ie entities) to start with.Now back to the premise that most frequent == best. Depending on your application, you may actually want more specific (which usually means lower frequency) types, rather than broader, high frequency types. For example, Deceased Person rather than Person, or Politician, Author, or Athlete, in preference to Person. You may want to consider using least frequent type (which is used at least some threshold times). The other thing that you may want to do is blacklist non-commons types (ie types rooted at
/base/...
or/user/...
) which haven't been as carefully curated.EDIT - word of warning:
Those counts were last updated in 2012. That should be fine for an exercise like this where you just want a rough ordering, but if you need current stats, you'll need to either count occurrences in the Freebase data dump or figure out the separate Stats API which I'm not sure is public/documented http://freebase-site.googlecode.com/svn/trunk/www/lib/queries/stats.sjs