From 9cd65d47c9d770959eb7bc317f6b4f17dbd2a70f Mon Sep 17 00:00:00 2001 From: Caleb Brown Date: Fri, 29 Apr 2022 16:55:23 +1000 Subject: [PATCH 1/4] Add a milestone 1 doc. --- docs/design/milestone_1.md | 335 +++++++++++++++++++++++++++++++++++++ 1 file changed, 335 insertions(+) create mode 100644 docs/design/milestone_1.md diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md new file mode 100644 index 000000000..076cc917e --- /dev/null +++ b/docs/design/milestone_1.md @@ -0,0 +1,335 @@ + +# Criticality Score Revamp: Milestone 1 + +- Author: [calebbrown@google.com](mailto:calebbrown@google.com) +- Updated: 2022-04-29 + + +## Goal + +Anyone can reliably generate raw signal data using the `criticality_score` +GitHub project. + +Support future moves towards scaling and automating criticality score. + +### Non-goals + +**Improve how the score is calculated.** + +While this is overall vital, the ability to calculate the score depends on +having reliable signals to base the score on. + +**Cover source repositories hosted on non-GitHub hosts.** + +Critical projects are hosted on GitLab, Bitbucket, or even self-hosted. These +should be supported, but given that over 90% of open source projects are +hosted by GitHub it seems prudent to focus efforts there first. + +## Background + +TBC + +## Design Overview + +### Project Definition + +A _project_ is defined as only having a _single repository_, and a _single +issue tracker_. A project may provide multiple _packages_. + +There are some "umbrella projects" (e.g. Kubernetes) that have multiple +repositories associated with them, or may use a centralized issue tracker. An +alternative approach would be to treat a project separately to the one or +more repositories that belong to it. + +However this approach has the following drawbacks: + +* Makes it hard to distinguish between organizations and umbrella projects +* Raises the possibility that a part of the umbrella project that is critical + to OSS is missed. +* Complicates the calculation required to aggregate signals and generate a + criticality score. + +So instead we define a project as a single repository. This provides a clear +"primary key" we can use for collecting signals. + +#### Forks and Mirrors + +Mirrors and forks are clones of another project's repository. + +A mirror is usually used to provide broader access to a repository, such as +when a self-hosted project mirrors its repository on GitHub. + +A fork has two primary uses: + +* A contributor commits changes to a fork for preparing pull-requests to the + main repository. +* A fork may become its own project when the original is unmaintained, or if + the forker decides to head in a different direction. + +This raises two considerations for generating criticality scores: + +* Project repositories may need to be de-duped to avoid treating the original + source and its mirrors as separate projects. +* Forks merely used for committing changes for a pull-request should be ignored + to lower the work and potential noise (fortunately these should score low + enough to make it easy to ignore them) + +### Multi Stage + +The design takes a multi stage approach to generating raw criticality signal +data ready for ingestion into a BigQuery table. + +The stages are: + +* **Project enumeration** - produce a list of project repositories, focusing + initially on GitHub for Milestone 1. +* **Raw signal collection** - iterate through the list of projects and query + various data sources for raw signals. +* **BigQuery ingestion** - take the raw signals and import them into a BigQuery + table for querying and scoring. + +Some API efficiency is gained by collecting some raw signals during project +enumeration. However, the ability to run stages separately and at different +frequencies improves the overall reliability of the project, and allows for raw +signal data to be refreshed more frequently. + +## Detailed Design + +### Project enumeration + +#### Direct GitHub Enumeration + +##### Challenges + +* GitHub has a lot of repos. Over 2.5M repos with 5 or more stars, and over + 400k repos with 50 or more stars at the time of writing. +* GitHub's API only allows you to iterate through 1000 results. +* GitHub's API has limited methods of sorting and filtering. + +Given these limitations it is difficult to extract all the repositories over +a certain number of stars, as the number of repositories with low stars exceeds +the 1000 result limit of GitHub's API. + +The lowest number of stars that returns fewer than 1000 results can be improved +by stepping through each creation date. + +With a sufficiently high minimum star threshold (e.g. 20), most creation dates +will have fewer than 1000 results in total. + +##### Algorithm + +* Set `MIN_STARS` to a value chosen such that the number of repositories with + that number of stars is less than 1000 for any given creation date. +* Set `STAR_OVERLAP`, `START_DATE` and `END_DATE` +* For each `DATE` between `START_DATE` and `END_DATE`: + * Set `MAX_STARS` to infinity + * Search for repos with a creation date of `DATE` and stars between + `MAX_STARS` and `MIN_STARS` inclusive, ordered from highest stars to + lowest. + * While True: + * For each repository (GitHub limits this to 1000 results): + * If the repository has not been seen: + * Add it to the list of repositories + * If there were fewer than 1000 results: + * Break + * Set `MAX_STARS` to the the number of stars the last repository + returned + `STAR_OVERLAP` + * If `MAX_STARS` is the same as the previous value + * Break + +The current implementation of this algorithm has a difference between GitHub +search of less than 0.05% for >=20 stars (GitHub search was checked ~12 hours +after the algorithm finished) and took 4 hours with 1 worker and 1 token. + + +##### Rate Limits + +A pool of GitHub tokens will be supported for increased performance. + +A single GitHub token has a limit of "5000" each hour, a single search page +consumes "1", and returning the 1000 results from a search consumes "10". This +allows 500 search queries per hour for a single token. + + +##### Output + +Output from enumeration will be a text file containing a list of GitHub urls. + + +#### Static Project URL Lists + +Rather than repeatedly query project repositories for a list of projects, use +pre-generated static lists of project repository URLs. + +Sources: + +* Prior invocations of the enumeration tool +* Manually curated lists of URLs +* [GHTorrent](https://ghtorrent.org/) data dumps + +##### GHTorrent + +GHTorrent monitors GitHub's public event feed and provides a fairly +comprehensive source of projects. + +Data from GHTorrent needs to be extracted from the SQL dump and filtered to +eliminate deleted repositories. + +The 2021-03-06 dump includes approx 190M repositories. This many repositories +would need to be curated to ensure each repository is still available. Culling +for "interesting" (e.g. more than 1 star) repositories may also be useful to +limit the amount of work generating signals. + + +#### Future Sources of Projects + +There are many other sources of projects for future milestones that can be +used. These are out-of-scope for Milestone 1, but worth listing. + +* Other source repositories such as GitLab and Bitbucket. +* [https://deps.dev/](https://deps.dev/) projects. This source captures many + projects that exist in package repositories and helps connect projects to + their packages and dependents. +* GHTorrent or GH Archive - these can avoid the expense of querying GitHub's + API directly. +* Google dorking - use Google's advanced search capabilities to find + self-hosted repositories (e.g. cgit, gitea, etc) +* JIRA, Bugzilla, etc support for issue tracking + + +### Raw Signal Collection + +This stage is when the list of projects are iterated over and for each project +a set of raw signal data is output. + +For Milestone 1, the focus will be on reproducing the existing signals +collected by the Python implementation, and adding support for dependent data +sourced from [deps.dev](https://deps.dev). + +Additionally there will be a focus on making it straightforward to add new +signal sources and signals. + + +#### Input / Output + +Input: + +* One or more text files containing a list of project urls, one URL per line + +Output: + +* Either JSON or CSV formatted records for each project in UTF-8, including + the project url. The output will support direct loading into BigQuery. + +#### Signal Collectors + +Signal collection will be built around multiple signal _collectors_ that +produce one or more _signals_ per repository. + +Signal collectors fall into one of three categories: + +* Source repository and hosting signal collectors (e.g. GitHub, Bitbucket, + cGit) +* Issue tracking signal collectors (e.g. GitHub, Bugzilla, JIRA) +* Additional signal collectors (e.g deps.dev) + +Each repository can have only one set of signals from a source repository +collector and one set of signals from an issue tracking signal collector, but +can have signals from many additional collectors. + +#### Repository Object + +During the collection process a repository object will be created and passed to +each collector. + +As each part of the collection process runs, data will be fetched for a +repository. The repository object will serve as the interface for accessing +repository specific data so that it can be cached and limit the amount of +additional queries that need to be executed. + +#### Collection Process + +The general process for collecting signals will do the following: + +* Initialize all the collectors +* For each repository URL + * Gather basic data about the repository (e.g. stars, has it moved, urls) + * It may have been removed, in which case the repository can be + skipped. + * It may not be "interesting" (e.g. too few stars) and should be + skipped. + * It may have already been processed and should be skipped. + * Determine the set of collectors that apply to the repository. + * For each collector: + * Start collecting the signals for the current repository + * Wait for all collectors to complete + * Write the signals to the output. + +#### Signal Fields + +##### Naming + +Signal fields will fall under the general naming pattern of +`[collector].[name]`. + +Where `[collector]` and `[name]` are made up of one or more of the +following: + +* Lowercase characters +* Numbers +* Underscores. + +The following restrictions further apply to `[collector]` names: + +* Source repository signal collectors must use the `repo` collector name +* Issue tracking signal collectors must use the `issues` collector name +* Signals matching the original set in the Python implementation can also use + the `legacy` collector name +* Additional collectors can use any other valid name. + +Finally, `[name]` names must include the unit value if it is not implied by +the type, and any time constraints. + +* e.g. `last_update_days` +* e.g. `comment_count_prev_year` + +##### Types + +For Milestone 1, all signal fields will be scalars. More complex data types are +out of scope. + +Supported scalars can be: + +* Boolean +* Int +* Float +* String +* Date +* DateTime + +All Dates and DateTimes must be in UTC. + +Strings will support Unicode. + +#### Batching (out of scope) + +More efficient usage of GitHub's APIs can be achieved by batching together +related requests. Support for batching is considered out of scope for +Milestone 1. + +### BigQuery Ingestion + +Injection into BigQuery will be done for Milestone 1 using the `bq` command +line tool. + +### Language Choice + +The Scorecard project and Criticality Score share many of the same needs. + +Scorecards also interacts with the GitHub API, negotiates rate limiting and +handles pools of GitHub tokens. + +Therefore it makes sense to move towards these projects sharing code. + +As Scorecards is a more mature project, this requires Criticality Score to be +rewritten in Go. From 17cd55c16039a0a2bfdf8b4e629dd7b85232933a Mon Sep 17 00:00:00 2001 From: Caleb Brown Date: Mon, 2 May 2022 11:08:10 +1000 Subject: [PATCH 2/4] Add a glossary and improve the milestone 1 doc. --- docs/design/milestone_1.md | 89 ++++++++++++++++---------------------- docs/glossary.md | 88 +++++++++++++++++++++++++++++++++++++ 2 files changed, 125 insertions(+), 52 deletions(-) create mode 100644 docs/glossary.md diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md index 076cc917e..04865ca10 100644 --- a/docs/design/milestone_1.md +++ b/docs/design/milestone_1.md @@ -4,13 +4,17 @@ - Author: [calebbrown@google.com](mailto:calebbrown@google.com) - Updated: 2022-04-29 - ## Goal Anyone can reliably generate raw signal data using the `criticality_score` GitHub project. -Support future moves towards scaling and automating criticality score. +For this milestone, the focus will be on reproducing the existing signals +collected by the Python implementation, and adding support for dependent data +sourced from [deps.dev](https://deps.dev). + +Additionally there will be a focus on supporting future moves towards scaling +and automating criticality score. ### Non-goals @@ -25,54 +29,48 @@ Critical projects are hosted on GitLab, Bitbucket, or even self-hosted. These should be supported, but given that over 90% of open source projects are hosted by GitHub it seems prudent to focus efforts there first. -## Background +**De-dupe mirrors from origin source repositories.** -TBC +Mirrors are frequently used to provide broader access to a project. Usually +when a self-hosted project uses a public service, such as GitHub, to host a +mirror of the project. -## Design Overview +This milestone will not attempt to detect and canonicalize mirrors. -### Project Definition - -A _project_ is defined as only having a _single repository_, and a _single -issue tracker_. A project may provide multiple _packages_. - -There are some "umbrella projects" (e.g. Kubernetes) that have multiple -repositories associated with them, or may use a centralized issue tracker. An -alternative approach would be to treat a project separately to the one or -more repositories that belong to it. - -However this approach has the following drawbacks: - -* Makes it hard to distinguish between organizations and umbrella projects -* Raises the possibility that a part of the umbrella project that is critical - to OSS is missed. -* Complicates the calculation required to aggregate signals and generate a - criticality score. +## Background -So instead we define a project as a single repository. This provides a clear -"primary key" we can use for collecting signals. +The OpenSSF has a +[Working Group (WG) focused on Securing Critical Projects](https://github.com/ossf/wg-securing-critical-projects). +A key part of this WG is focused on determining which Open Source projects are +"critical". Critical Open Source projects are those which are broadly depended +on by organizations, and present a security risk to those organizations, and +their customers, if they are not supported. -#### Forks and Mirrors +This project is one of a small set of sources of data used to find theses +critical projects. -Mirrors and forks are clones of another project's repository. +The current Python implementation available in this repo has been stagnant for +a while. -A mirror is usually used to provide broader access to a repository, such as -when a self-hosted project mirrors its repository on GitHub. +It has some serious problems with how it enumerates projects on GitHub (see +[#33](https://github.com/ossf/criticality_score/issues/33)), and lacks robust +support for non-GitHub projects (see +[#29](https://github.com/ossf/criticality_score/issues/29)). -A fork has two primary uses: +There are problems with the existing signals being collected (see +[#55](https://github.com/ossf/criticality_score/issues/55), +[#102](https://github.com/ossf/criticality_score/issues/102)) and interest in +exploring other signals and approaches +([#53](https://github.com/ossf/criticality_score/issues/53), +[#102](https://github.com/ossf/criticality_score/issues/102) deps.dev, +[#31](https://github.com/ossf/criticality_score/issues/31), +[#82](https://github.com/ossf/criticality_score/issues/82), etc). -* A contributor commits changes to a fork for preparing pull-requests to the - main repository. -* A fork may become its own project when the original is unmaintained, or if - the forker decides to head in a different direction. +Additionally, in [#102](https://github.com/ossf/criticality_score/issues/102) I propose an approach to improving the quality of the criticality score. -This raises two considerations for generating criticality scores: +## Design Overview -* Project repositories may need to be de-duped to avoid treating the original - source and its mirrors as separate projects. -* Forks merely used for committing changes for a pull-request should be ignored - to lower the work and potential noise (fortunately these should score low - enough to make it easy to ignore them) +Please see the [glossary](../glossary.md) for a terms used in this project. ### Multi Stage @@ -141,7 +139,6 @@ The current implementation of this algorithm has a difference between GitHub search of less than 0.05% for >=20 stars (GitHub search was checked ~12 hours after the algorithm finished) and took 4 hours with 1 worker and 1 token. - ##### Rate Limits A pool of GitHub tokens will be supported for increased performance. @@ -150,12 +147,10 @@ A single GitHub token has a limit of "5000" each hour, a single search page consumes "1", and returning the 1000 results from a search consumes "10". This allows 500 search queries per hour for a single token. - ##### Output Output from enumeration will be a text file containing a list of GitHub urls. - #### Static Project URL Lists Rather than repeatedly query project repositories for a list of projects, use @@ -180,7 +175,6 @@ would need to be curated to ensure each repository is still available. Culling for "interesting" (e.g. more than 1 star) repositories may also be useful to limit the amount of work generating signals. - #### Future Sources of Projects There are many other sources of projects for future milestones that can be @@ -196,20 +190,11 @@ used. These are out-of-scope for Milestone 1, but worth listing. self-hosted repositories (e.g. cgit, gitea, etc) * JIRA, Bugzilla, etc support for issue tracking - ### Raw Signal Collection This stage is when the list of projects are iterated over and for each project a set of raw signal data is output. -For Milestone 1, the focus will be on reproducing the existing signals -collected by the Python implementation, and adding support for dependent data -sourced from [deps.dev](https://deps.dev). - -Additionally there will be a focus on making it straightforward to add new -signal sources and signals. - - #### Input / Output Input: diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 000000000..3532bc942 --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,88 @@ +# Glossary + +This document defines the meaning of various terms used by this project. +This is to ensure they are clearly understood. + +Please keep the document sorted alphabetically. + +## Terms + +### Fork + +A _fork_, like a mirror, is a clone or copy of another project's source code or +repository. + +A fork has two primary uses: + +* A contributor commiting changes to a fork for preparing pull-requests to the + main repository. +* A fork may become its own project when the original is unmaintained, or if + the forker decides to head in a different direction. + +Forks merely used for committing changes for a pull-request are not interesting +when calculating criticality scores. + +See also "Mirror". + +### Mirror + +A _mirror_, like a fork, is a clone or copy of another project's source code or +repository. + +A mirror is usually used to provide broader access to a repository, such as +when a self-hosted project mirrors its repository on GitHub. + +Mirrors may require de-duping to avoid treating the original repository and +its mirrors as separate projects. + +See also "Fork". + +### Project + +A _project_ is defined as only having a _single repository_, and a _single +issue tracker_. A project may provide multiple _packages_. + +There are some "umbrella projects" (e.g. Kubernetes) that have multiple +repositories associated with them, or may use a centralized issue tracker. An +alternative approach would be to treat a project independently of the one or +more repositories that belong to it. + +However this approach has the following drawbacks: + +* Makes it hard to distinguish between organizations and umbrella projects +* Raises the possibility that a part of the umbrella project that is critical + to OSS is missed. +* Complicates the calculation required to aggregate signals and generate a + criticality score. + +So instead we define a project as a single repository. This provides a clear +"primary key" we can use for collecting signals. + +### Repository + +A _repository_ refers to the system used to store and manage access to a +project's source code. Usually a version control system (e.g. git or mercurial) +is used to track and manage changes to the source code. + +A _repository_ can be the canonical source of a project's code, or it could +also be a _fork_ or a _mirror_. + +A _repository_ is usually owned by an individual or an organization, although +the specifics of how this behaves in practice depends on the repositories host. + +### Repository Host + +A _repository host_ is the service hosting a _repository_. It may be a service +such as GitHub, GitLab or Bitbucket. It may also be "self-hosted", where the +infrastructure for hosting a repository is managed by the maintainers of a +project. + +Self-hosted repositories often deploy an open-source application to provide +access, such as GitLab, cGit, or Gitea. + +### Umbrella Project + +An _umbrella project_ is a group of related projects that are maintained by a +larger community surrounding the project. + +See also "project". \ No newline at end of file From 1c7c5e23df9516107f23c2ec64da824834b8b11a Mon Sep 17 00:00:00 2001 From: Caleb Brown Date: Mon, 2 May 2022 11:15:09 +1000 Subject: [PATCH 3/4] Tweak docs goals and design approach to make the intent clearer. --- docs/design/milestone_1.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md index 04865ca10..69964ed2b 100644 --- a/docs/design/milestone_1.md +++ b/docs/design/milestone_1.md @@ -6,16 +6,16 @@ ## Goal -Anyone can reliably generate raw signal data using the `criticality_score` -GitHub project. +Anyone can reliably generate signal data using the `criticality_score` GitHub +project. + +Additionally there will be a focus on supporting future moves towards scaling +and automating criticality score. For this milestone, the focus will be on reproducing the existing signals collected by the Python implementation, and adding support for dependent data sourced from [deps.dev](https://deps.dev). -Additionally there will be a focus on supporting future moves towards scaling -and automating criticality score. - ### Non-goals **Improve how the score is calculated.** @@ -70,6 +70,15 @@ Additionally, in [#102](https://github.com/ossf/criticality_score/issues/102) I ## Design Overview +This milestone is a fundamental rearchitecturing of the project to meet the + + +this is rearchitecturing the project to enable more reliability and extensibility +focusing on: +reliably enumerating GitHub projects +Reliably generating existing signals. Adding dependent information. +Making the current criticality rankings list generated more frequently. (And more easily) + Please see the [glossary](../glossary.md) for a terms used in this project. ### Multi Stage From 3deb32389d507c88c1dc7588cdb41302e8f4946f Mon Sep 17 00:00:00 2001 From: Caleb Brown Date: Mon, 2 May 2022 11:27:47 +1000 Subject: [PATCH 4/4] More doc wordsmithing. --- docs/design/milestone_1.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md index 69964ed2b..84efb83d5 100644 --- a/docs/design/milestone_1.md +++ b/docs/design/milestone_1.md @@ -6,15 +6,16 @@ ## Goal -Anyone can reliably generate signal data using the `criticality_score` GitHub -project. +Anyone can reliably generate the existing set of signal data using the +`criticality_score` GitHub project, and calculate the scores using the +existing algorithm. Additionally there will be a focus on supporting future moves towards scaling and automating criticality score. -For this milestone, the focus will be on reproducing the existing signals -collected by the Python implementation, and adding support for dependent data -sourced from [deps.dev](https://deps.dev). +For this milestone, collecting dependent signal data sourced from +[deps.dev](https://deps.dev) will also be added to improve the overall +quality of the score produced. ### Non-goals @@ -71,13 +72,13 @@ Additionally, in [#102](https://github.com/ossf/criticality_score/issues/102) I ## Design Overview This milestone is a fundamental rearchitecturing of the project to meet the +goals of higher reliability, extensibility and ease of use. +The design focuses on: -this is rearchitecturing the project to enable more reliability and extensibility -focusing on: -reliably enumerating GitHub projects -Reliably generating existing signals. Adding dependent information. -Making the current criticality rankings list generated more frequently. (And more easily) +- reliable GitHub project enumeration. +- reliable signal collection, with better dependent data. +- being able to update the criticality scores and rankings more frequently. Please see the [glossary](../glossary.md) for a terms used in this project.