Consider using github-to-sqlite to grab our activity dataset #76

choldgraf · 2022-08-19T08:19:41Z

Context

This tool is basically a two-step process.

In step 1, we grab as much data as we can about issues/PRs/comments/etc using the GitHub API. This is done with:

This function:

github-activity/github_activity/github_activity.py

Lines 64 to 95 in c149ba0

    
           def get_activity( 
        
               target, since, until=None, repo=None, kind=None, auth=None, cache=None 
        
           ): 
        
               """Return issues/PRs within a date window. 
        
               Parameters 
        
               ---------- 
        
               target : string 
        
                   The GitHub organization/repo for which you want to grab recent issues/PRs. 
        
                   Can either be *just* and organization (e.g., `jupyter`) or a combination 
        
                   organization and repo (e.g., `jupyter/notebook`). If the former, all 
        
                   repositories for that org will be used. If the latter, only the specified 
        
                   repository will be used. 
        
               since : string | None 
        
                   Return issues/PRs with activity since this date or git reference. Can be 
        
                   any string that is parsed with dateutil.parser.parse. 
        
               until : string | None 
        
                   Return issues/PRs with activity until this date or git reference. Can be 
        
                   any string that is parsed with dateutil.parser.parse. If none, today's 
        
                   date will be used. 
        
               kind : ["issue", "pr"] | None 
        
                   Return only issues or PRs. If None, both will be returned. 
        
               auth : string | None 
        
                   An authentication token for GitHub. If None, then the environment 
        
                   variable `GITHUB_ACCESS_TOKEN` will be tried. If it does not exist, 
        
                   then attempt to infer a token from `gh auth status -t`. 
        
               cache : bool | str | None 
        
                   Whether to cache the returned results. If None, no caching is 
        
                   performed. If True, the cache is located at 
        
                   ~/github_activity_data. It is organized as orgname/reponame folders 
        
                   with CSV files inside that contain the latest data. If a string it 
        
                   is treated as the path to a cache folder.

And this GraphQL query: https://github.com/executablebooks/github-activity/blob/master/github_activity/graphql.py

In step 2, we parse the resulting data, munge it, and output markdown, statistics, etc.

However, the functionality in step 1 is kind-of hacky and messy, and hard to reason with.

I recently came across a tool recommended by @simonw , which essentially replicates all of this functionality but with a more well-structured and maintainer implementation:

https://github.com/dogsheep/github-to-sqlite

This is a python library that will grab all of the issues, pull requests, and comments (among other things) from a repository and store them in a local sqlite database so that you can do what you want with them. They are structured to be able to work with datasette as well (though we may not have use for that in this package, just FYI).

Two questions that I have and I'm not sure the answer:

How to speed it up. I'm not sure whether github-to-sqlite does any cacheing or allows you to filter by date. If not, then it might take quite a long time to run this interactively.
How to run via a Python API. All the examples use a CLI, and while this is probably fine it would be nice if we could grab / update datasets by running this as part of other scripts.

Proposal

What do folks think about re-using github-to-sqlite for our "grab all of the activity in a repository" step, and focusing this repository on the munging / filtering by date / calculating statistics / generating markdown aspects?

I think this might be a nice way to reduce some unnecessary complexity here and to re-use code from others in the ecosystem. I also like the idea of becoming familiar with datasette structures as is opens the possibility that we could expose this kind of data in the future for others in the community to munge and use.

At this point I'm just exploring the idea and curious what others think!

Tasks and updates

No response

The text was updated successfully, but these errors were encountered:

choldgraf added the enhancement New feature or request label Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using github-to-sqlite to grab our activity dataset #76

Consider using github-to-sqlite to grab our activity dataset #76

choldgraf commented Aug 19, 2022 •

edited

Loading

Consider using github-to-sqlite to grab our activity dataset #76

Consider using github-to-sqlite to grab our activity dataset #76

Comments

choldgraf commented Aug 19, 2022 • edited Loading

Context

Proposal

Tasks and updates

choldgraf commented Aug 19, 2022 •

edited

Loading