How to measure language popularity via Github Archive data?

861 views Asked by At

I'm attempting to measure programming language popularity via:

  1. The number of stars on repos in combination with...
  2. The programming languages used in the repo and...
  3. The total bytes of code in each language (recognizing that some languages are more/less verbose)

Conveniently, there is a massive trove of Github data provided by Github Archive, and hosted by BigQuery. The only problem is that I don't see "language" available in any of the payloads for the various event types in Github Archive.

Here's the BigQuery query I've been running trying to find if, and where, language may be populated in the Github Archive data:

SELECT *
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT(payload, "$.repository.language") is null
LIMIT 100

Can someone please provide insight into whether I'll be able to utilize Github Archive data in this way, and how I can go about doing so? Or will I need to pursue some other approach? I see that there is also a github_repos public dataset on BigQuery, and it does have some language metrics, but the languages metrics seem to be over all time. I'd prefer to get some sort of monthly metric eventually (i.e., of "active" repos in a given month, what were the most popular languages).

Any advice is appreciated!

2

There are 2 answers

1
Felipe Hoffa On BEST ANSWER

With BigQuery and GitHub Archive and GHTorrent -

To get the languages by pull requests, last December (copy pasted from http://mads-hartmann.com/2015/02/05/github-archive.html):

SELECT COUNT(*) c, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') IS NOT NULL
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10

http://i.imgur.com/PmDxoEX.png

To find the number of stars per project:

SELECT COUNT(*) c, repo.name 
FROM [githubarchive:month.201612]
WHERE type='WatchEvent'
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10

http://i.imgur.com/yXDHUlB.png

For a quick language vs bytes view, you can use GHTorrent:

SELECT language, SUM(bytes) bytes
FROM [ghtorrent-bq:ght.project_languages]
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

http://i.imgur.com/8RvrVBA.png

Or to look at the actual files, see the contents of GitHub on BigQuery.

Now you can mix these queries to get the results you want!

0
Mikhail Berlyant On
SELECT 
  JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
  COUNT(1) AS usage
FROM [githubarchive:month.201601] 
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC