How to obtain java repositories having maximum number of stars and less than 100 commits

81 views Asked by At

I am currently trying to obtain the top 100 java repositories having maximum number of stars and less than 100 commits using GitHub Archive and BigQuery. Could you please help to come up with a query for this purpose.

The initial query I have written to obtain the top 100 records for Java language as follows:

SELECT repository_name
FROM [githubarchive:github.timeline]
WHERE repository_language = 'Java' 
AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('1996-01-01 00:00:00') 
AND PARSE_UTC_USEC(repository_created_at) < PARSE_UTC_USEC('2015-05-30 00:00:00') 
GROUP BY repository_name
LIMIT 100 

1

There are 1 answers

4
Bohemian On

Add a HAVING clause to put a condition on the group:

SELECT repository_name
FROM [githubarchive:github.timeline]
WHERE repository_language = 'Java' 
AND PARSE_UTC_USEC(repository_created_at) BETWEEN PARSE_UTC_USEC('1996-01-01 00:00:00') AND PARSE_UTC_USEC('2015-05-30 00:00:00') 
GROUP BY repository_name
HAVING COUNT(*) < 100  -- Add this clause
ORDER BY COUNT(*) DESC -- Add this to make the LIMIT meaningful
LIMIT 100

I don't know if the 100 of the LIMIT is related to the 100 maximum commit count. If so, it's not needed. If not (and you actually want to limit the rows to 100) then add an ORDER BY to chose which 100 rows you want (rather than 100 random ones).

I also simplified the date range condition to a BETWEEN.


Although the above is valid SQL (tested OK in MySQL, Postgres, SQLServer and Oracle), in case bigquery can't handle order by COUNT(*) (as reported by OP), use a subquery:

SELECT * FROM (
    SELECT repository_name, COUNT(*) commit_count
    FROM [githubarchive:github.timeline]
    WHERE repository_language = 'Java' 
    AND PARSE_UTC_USEC(repository_created_at) BETWEEN PARSE_UTC_USEC('1996-01-01 00:00:00') AND PARSE_UTC_USEC('2015-05-30 00:00:00') 
    GROUP BY repository_name
    HAVING COUNT(*) < 100
) x
ORDER BY commit_count DESC
LIMIT 100