Google BigQuery SQL Statement

677 views Asked by At

I am trying to get some data from the GitHub Archive using Google Big Query. The current amount of data I am requesting is too much for BigQuery to process (at least in the free tier) so I am trying to limit the scope of my request.

I want to limit the data such that only repositories that currently have more than 1000 stars will return historical data. Its more complicated than just saying repository_watchers > 1000 because that will exclude the historical data for the first 1000 stars a repository got.

SELECT repository_name, repository_owner, created_at, type, repository_url, repository_watchers
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent"
ORDER BY created_at DESC

EDIT: Solution I used (based on answer by @Brian)

select y.repository_name, y.repository_owner, y.created_at, y.type, y.repository_url, y.repository_watchers
  from [githubarchive:github.timeline] y
  join (select repository_url, max(repository_watchers)
          from [githubarchive:github.timeline] x
         where x.type = 'WatchEvent'
         group by repository_url
        having max(repository_watchers) > 1000) x
    on y.repository_url = x.repository_url
  where y.type = 'WatchEvent'
 order by y.repository_name, y.repository_owner, y.created_at desc
1

There are 1 answers

6
Brian DeMilia On BEST ANSWER

Try:

select y.*
  from [githubarchive :github.timeline] y
  join (select repository_name, max(repository_watchers)
          from [githubarchive :github.timeline]
         where x.type = 'WatchEvent'
         group by repository_name
        having max(repository_watchers) > 1000) x
    on y.repository_name = x.repository_name
 order by y.created_at desc

If that syntax is not supported you can use a 3 step solution like this:

Step 1: Find which REPOSITORY_NAME values have at least one record w/ a REPOSITORY_WATCHERS amount > 1000

select repository_name, max(repository_watchers) as curr_watchers
  from [githubarchive :github.timeline]
 where type = 'WatchEvent'
 group by repository_name
having max(repository_watchers) > 1000

Step 2: Store that result as a table, call it SUB

Step 3: Run the following against SUB (and your original table)

select y.*
  from [githubarchive :github.timeline] y
  join sub x
    on y.repository_name = x.repository_name
 order by y.created_at desc