I am trying to get some data from the GitHub Archive using Google Big Query. The current amount of data I am requesting is too much for BigQuery to process (at least in the free tier) so I am trying to limit the scope of my request.
I want to limit the data such that only repositories that currently have more than 1000 stars will return historical data. Its more complicated than just saying repository_watchers > 1000 because that will exclude the historical data for the first 1000 stars a repository got.
SELECT repository_name, repository_owner, created_at, type, repository_url, repository_watchers
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent"
ORDER BY created_at DESC
EDIT: Solution I used (based on answer by @Brian)
select y.repository_name, y.repository_owner, y.created_at, y.type, y.repository_url, y.repository_watchers
from [githubarchive:github.timeline] y
join (select repository_url, max(repository_watchers)
from [githubarchive:github.timeline] x
where x.type = 'WatchEvent'
group by repository_url
having max(repository_watchers) > 1000) x
on y.repository_url = x.repository_url
where y.type = 'WatchEvent'
order by y.repository_name, y.repository_owner, y.created_at desc
Try:
If that syntax is not supported you can use a 3 step solution like this:
Step 1: Find which REPOSITORY_NAME values have at least one record w/ a REPOSITORY_WATCHERS amount > 1000
Step 2: Store that result as a table, call it SUB
Step 3: Run the following against SUB (and your original table)