I'm attempting to measure programming language popularity via:
- The number of stars on repos in combination with...
- The programming languages used in the repo and...
- The total bytes of code in each language (recognizing that some languages are more/less verbose)
Conveniently, there is a massive trove of Github data provided by Github Archive, and hosted by BigQuery. The only problem is that I don't see "language" available in any of the payloads for the various event types in Github Archive.
Here's the BigQuery query I've been running trying to find if, and where, language may be populated in the Github Archive data:
SELECT *
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT(payload, "$.repository.language") is null
LIMIT 100
Can someone please provide insight into whether I'll be able to utilize Github Archive data in this way, and how I can go about doing so? Or will I need to pursue some other approach? I see that there is also a github_repos public dataset on BigQuery, and it does have some language metrics, but the languages metrics seem to be over all time. I'd prefer to get some sort of monthly metric eventually (i.e., of "active" repos in a given month, what were the most popular languages).
Any advice is appreciated!
With BigQuery and GitHub Archive and GHTorrent -
To get the languages by pull requests, last December (copy pasted from http://mads-hartmann.com/2015/02/05/github-archive.html):
To find the number of stars per project:
For a quick language vs bytes view, you can use GHTorrent:
Or to look at the actual files, see the contents of GitHub on BigQuery.
Now you can mix these queries to get the results you want!