Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using github-to-sqlite to grab our activity dataset #76

Open
choldgraf opened this issue Aug 19, 2022 · 0 comments
Open

Consider using github-to-sqlite to grab our activity dataset #76

choldgraf opened this issue Aug 19, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@choldgraf
Copy link
Member

choldgraf commented Aug 19, 2022

Context

This tool is basically a two-step process.

  • In step 1, we grab as much data as we can about issues/PRs/comments/etc using the GitHub API. This is done with:
    • This function:
      def get_activity(
      target, since, until=None, repo=None, kind=None, auth=None, cache=None
      ):
      """Return issues/PRs within a date window.
      Parameters
      ----------
      target : string
      The GitHub organization/repo for which you want to grab recent issues/PRs.
      Can either be *just* and organization (e.g., `jupyter`) or a combination
      organization and repo (e.g., `jupyter/notebook`). If the former, all
      repositories for that org will be used. If the latter, only the specified
      repository will be used.
      since : string | None
      Return issues/PRs with activity since this date or git reference. Can be
      any string that is parsed with dateutil.parser.parse.
      until : string | None
      Return issues/PRs with activity until this date or git reference. Can be
      any string that is parsed with dateutil.parser.parse. If none, today's
      date will be used.
      kind : ["issue", "pr"] | None
      Return only issues or PRs. If None, both will be returned.
      auth : string | None
      An authentication token for GitHub. If None, then the environment
      variable `GITHUB_ACCESS_TOKEN` will be tried. If it does not exist,
      then attempt to infer a token from `gh auth status -t`.
      cache : bool | str | None
      Whether to cache the returned results. If None, no caching is
      performed. If True, the cache is located at
      ~/github_activity_data. It is organized as orgname/reponame folders
      with CSV files inside that contain the latest data. If a string it
      is treated as the path to a cache folder.
    • And this GraphQL query: https://github.com/executablebooks/github-activity/blob/master/github_activity/graphql.py
  • In step 2, we parse the resulting data, munge it, and output markdown, statistics, etc.

However, the functionality in step 1 is kind-of hacky and messy, and hard to reason with.

I recently came across a tool recommended by @simonw , which essentially replicates all of this functionality but with a more well-structured and maintainer implementation:

This is a python library that will grab all of the issues, pull requests, and comments (among other things) from a repository and store them in a local sqlite database so that you can do what you want with them. They are structured to be able to work with datasette as well (though we may not have use for that in this package, just FYI).

Two questions that I have and I'm not sure the answer:

  • How to speed it up. I'm not sure whether github-to-sqlite does any cacheing or allows you to filter by date. If not, then it might take quite a long time to run this interactively.
  • How to run via a Python API. All the examples use a CLI, and while this is probably fine it would be nice if we could grab / update datasets by running this as part of other scripts.

Proposal

What do folks think about re-using github-to-sqlite for our "grab all of the activity in a repository" step, and focusing this repository on the munging / filtering by date / calculating statistics / generating markdown aspects?

I think this might be a nice way to reduce some unnecessary complexity here and to re-use code from others in the ecosystem. I also like the idea of becoming familiar with datasette structures as is opens the possibility that we could expose this kind of data in the future for others in the community to munge and use.

At this point I'm just exploring the idea and curious what others think!

Tasks and updates

No response

@choldgraf choldgraf added the enhancement New feature or request label Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant