How can I get a list of all film ids from Freebase?

1.2k views Asked by At

On a project I was working on a couple of years back, I was building a set of data about movies from Freebase. A simple shell script downloaded the "film.tsv" file (from http://download.freebase.com/datadumps/latest/browse/film/film.tsv). I then used the "id" field in that file to build the necessary MQL requests for each of the films (retrieving the other properties I was interested in e.g. actors, genres).

After looking at the developer's guide today I realise that Freebase has moved on a fair bit and significantly I see that the dump file I used before is no longer available. I also see that the dump file format is now RDF and from what I can tell the dump files are now only available as a single 22GB archive.

If at all possible I would like to avoid downloading a 22G file each time I want to rebuild my data set so is it possible to retrieve individual dump files anymore e.g. like the film.tsv file?

If not is there an alternative way to obtain a full list of movie ids?

2

There are 2 answers

1
Shawn Simister On BEST ANSWER

There's no replacement planned for film.tsv right now. You can get the current list of film IDs from the RDF dump like this:

zgrep $'\ttype\.object\.type\tfilm\.film' freebase-rdf.gz

Then when you need to update the list you query the MQL Read API for a list of new films that have been added since your last update:

[{
  "type": "/film/film",
  "id": null,
  "name": null,
  "timestamp": null,
  "timestamp>=": "2013-12",
  "sort": "-timestamp"
}]

Since the API returns 200 results at a time you'll need to use a cursor to get the full list of results.

0
Günter Zöchbauer On

You can try MQL by just opening the following link.

https://www.googleapis.com/freebase/v1/mqlread?query=[{%22type%22:%20%22/film/film%22,%22id%22:%20null,%22limit%22:300}]&cursor=

You will have to make many requests though.

At each response you receive a cursor that you use as parameter for cursor= at the next request. AFAIK the default limit is 200. You can't increase the limit at will. Maybe the query can be optimized so that the response does not contain the type.

You can edit the query here http://tinyurl.com/pn5o52w At the top right corner you have a 'link' button with a 'MQLRead link' shows you the url to execute. I added the 'cursor=' parameter manually. I thought the query editor offers an option for this but I couldn't find it.