I'm new to SQL and GitHubArchieve and trying to get the list of languages and stargazers of the popular repositories on GitHub.
The information I'm looking for are repo id, repo languages (languages + percentage), repo stargazers (and their timezones).
So, my solution to this was by retrieving the most popular repositories and then parsing the languages_url
and stargazers_url
to get their information.
I started with the following query to retrieve the popular repositories:
SELECT
repo.name,
repo.id,
repo.url,
payload,
COUNT(*) watch_count,
JSON_EXTRACT_SCALAR(payload, '$.watch.action.repository.languages_url') AS repo_languages_url
FROM [githubarchive:month.201601],
WHERE type = 'WatchEvent'
GROUP BY 1,2,3,4,6
HAVING watch_count >= 2000 ORDER BY watch_count DESC
LIMIT 1000
The payload for the watch events was just: {"action":"started"}
so I didn't get any information about the languages and the contributors of the repositories.
Any suggestions to get the information I'm looking for? Is there any way for retrieving that information without parsing?
Thanks in advance