How do I programmatically extract GitHub repositories that contain a code string?

1.4k views Asked by At

I am looking for a way to extract GitHub repositories containing files with a certain code string. I can do manually using the GitHub search bar. For instance, if I'm looking for the usages of the library pymc3 I could look for it in the search bar and then click on Code

enter image description here

How does one do this programmatically?

I tried going over the GitHub Search API documentation. The Search Code functionality allows looking into code but that seems to only search based on an user, organization, or repository. The Search Repositories functionality only looks into the description, title and README.

Update 1:

While browsing this post, I believe I found the answer to identify some repositories that contain a code string.

If I write the following code -

url = "https://api.github.com/search/code?q=pymc3 +in:file"

headers = {
  'Authorization': 'Token xxxxxxxxxxxxxxxxx'
}

response = requests.request("GET", url, headers=headers)

print(response.text)

I get the following result -

"total_count":43642,"incomplete_results":false,"items":[{"name":"pymc3_stoch_vol ...

However, the result gives me a bunch of information such as the git URL, HTML URL and some of the repositories that contain this string. I need to find a way to extract all the repositories that contain this string.

Update 2:

I now understand that GitHub limits results to 100 per page and 1000 results overall.

The only question remains why I didn't find this information on GitHub Search API documentation? Please do let me know if my understanding or the linked answer is wrong.

1

There are 1 answers

4
VonC On BEST ANSWER

This kind of query should be addressed more by GraphQL API, but searching code is still not supported.

Only the new code-search (presented here) might be able to provide that, but:

  • it is still in beta
  • its API is not yet public.

So for now, code search in all GitHub repositories is not supported.