How far can one retrieve data from GitHub Archive?

771 views Asked by At

The GitHub Archive project states

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

This archive is also queryable through Google Big Query. However, it looks like that I'm either missing something or only a portion of the data is available.

Indeed, running the following query only returns 1636 WatchEvents (started or stopped), whereas the Rails repository accounts more than 14300 watchers.

SELECT actor_attributes_login, created_at, payload_action
FROM [githubarchive:github.timeline]
where repository_name = "rails"
and type="WatchEvent"
order by created_at asc;

It looks like the oldest retrieved piece of data is more or less 2.5 months old.

Would the data be truncated (which might seem strange for an archive)? Is there a limit/quota I wouldn't know of related to the use of BigQuery?

github-archive

1

There are 1 answers

2
igrigorik On BEST ANSWER

That's correct. The project/crawler went live on March 11th of this year, hence the current archive starts on that day. There is a note about this on the githubarchive.org page, but I guess I should make it more visible and explicit.

There is a thread with the GitHub team about making more of their history available, but I don't have an ETA for it yet. fingers crossed :-)