From 9cd65d47c9d770959eb7bc317f6b4f17dbd2a70f Mon Sep 17 00:00:00 2001
From: Caleb Brown <calebbrown@google.com>
Date: Fri, 29 Apr 2022 16:55:23 +1000
Subject: [PATCH 1/4] Add a milestone 1 doc.

---
 docs/design/milestone_1.md | 335 +++++++++++++++++++++++++++++++++++++
 1 file changed, 335 insertions(+)
 create mode 100644 docs/design/milestone_1.md

diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md
new file mode 100644
index 000000000..076cc917e
--- /dev/null
+++ b/docs/design/milestone_1.md
@@ -0,0 +1,335 @@
+
+# Criticality Score Revamp: Milestone 1
+
+- Author: [calebbrown@google.com](mailto:calebbrown@google.com)
+- Updated: 2022-04-29
+
+
+## Goal
+
+Anyone can reliably generate raw signal data using the `criticality_score`
+GitHub project.
+
+Support future moves towards scaling and automating criticality score.
+
+### Non-goals
+
+**Improve how the score is calculated.**
+
+While this is overall vital, the ability to calculate the score depends on
+having reliable signals to base the score on.
+
+**Cover source repositories hosted on non-GitHub hosts.**
+
+Critical projects are hosted on GitLab, Bitbucket, or even self-hosted. These
+should be supported, but given that over 90% of open source projects are
+hosted by GitHub it seems prudent to focus efforts there first.
+
+## Background
+
+TBC
+
+## Design Overview
+
+### Project Definition
+
+A _project_ is defined as only having a _single repository_, and a _single
+issue tracker_. A project may provide multiple _packages_.
+
+There are some "umbrella projects" (e.g. Kubernetes) that have multiple
+repositories associated with them, or may use a centralized issue tracker. An
+alternative approach would be to treat a project separately to the one or
+more repositories that belong to it.
+
+However this approach has the following drawbacks:
+
+* Makes it hard to distinguish between organizations and umbrella projects
+* Raises the possibility that a part of the umbrella project that is critical
+  to OSS is missed.
+* Complicates the calculation required to aggregate signals and generate a
+  criticality score.
+
+So instead we define a project as a single repository. This provides a clear
+"primary key" we can use for collecting signals.
+
+#### Forks and Mirrors
+
+Mirrors and forks are clones of another project's repository. 
+
+A mirror is usually used to provide broader access to a repository, such as
+when a self-hosted project mirrors its repository on GitHub.
+
+A fork has two primary uses:
+
+* A contributor commits changes to a fork for preparing pull-requests to the
+  main repository.
+* A fork may become its own project when the original is unmaintained, or if
+  the forker decides to head in a different direction.
+
+This raises two considerations for generating criticality scores:
+
+* Project repositories may need to be de-duped to avoid treating the original
+  source and its mirrors as separate projects.
+* Forks merely used for committing changes for a pull-request should be ignored
+  to lower the work and potential noise (fortunately these should score low
+  enough to make it easy to ignore them)
+
+### Multi Stage
+
+The design takes a multi stage approach to generating raw criticality signal
+data ready for ingestion into a BigQuery table.
+
+The stages are:
+
+* **Project enumeration** - produce a list of project repositories, focusing
+  initially on GitHub for Milestone 1.
+* **Raw signal collection** - iterate through the list of projects and query
+  various data sources for raw signals.
+* **BigQuery ingestion** - take the raw signals and import them into a BigQuery
+  table for querying and scoring.
+
+Some API efficiency is gained by collecting some raw signals during project
+enumeration. However, the ability to run stages separately and at different
+frequencies improves the overall reliability of the project, and allows for raw
+signal data to be refreshed more frequently.
+
+## Detailed Design
+
+### Project enumeration
+
+#### Direct GitHub Enumeration
+
+##### Challenges
+
+* GitHub has a lot of repos. Over 2.5M repos with 5 or more stars, and over
+  400k repos with 50 or more stars at the time of writing.
+* GitHub's API only allows you to iterate through 1000 results.
+* GitHub's API has limited methods of sorting and filtering.
+
+Given these limitations it is difficult to extract all the repositories over
+a certain number of stars, as the number of repositories with low stars exceeds
+the 1000 result limit of GitHub's API.
+
+The lowest number of stars that returns fewer than 1000 results can be improved
+by stepping through each creation date.
+
+With a sufficiently high minimum star threshold (e.g. 20), most creation dates
+will have fewer than 1000 results in total.
+
+##### Algorithm
+
+* Set `MIN_STARS` to a value chosen such that the number of repositories with
+  that number of stars is less than 1000 for any given creation date.
+* Set `STAR_OVERLAP`, `START_DATE` and `END_DATE`
+* For each `DATE` between `START_DATE` and `END_DATE`:
+    * Set `MAX_STARS` to infinity
+    * Search for repos with a creation date of `DATE` and stars between
+      `MAX_STARS` and `MIN_STARS` inclusive, ordered from highest stars to
+      lowest.
+    * While True:
+        * For each repository (GitHub limits this to 1000 results):
+            * If the repository has not been seen:
+                * Add it to the list of repositories
+        * If there were fewer than 1000 results:
+            * Break
+        * Set `MAX_STARS` to the the number of stars the last repository
+          returned + `STAR_OVERLAP`
+        * If `MAX_STARS` is the same as the previous value
+            * Break
+
+The current implementation of this algorithm has a difference between GitHub
+search of less than 0.05% for >=20 stars (GitHub search was checked ~12 hours
+after the algorithm finished) and took 4 hours with 1 worker and 1 token.
+
+
+##### Rate Limits
+
+A pool of GitHub tokens will be supported for increased performance.
+
+A single GitHub token has a limit of "5000" each hour, a single search page
+consumes "1", and returning the 1000 results from a search consumes "10". This
+allows 500 search queries per hour for a single token.
+
+
+##### Output
+
+Output from enumeration will be a text file containing a list of GitHub urls.
+
+
+#### Static Project URL Lists
+
+Rather than repeatedly query project repositories for a list of projects, use
+pre-generated static lists of project repository URLs.
+
+Sources:
+
+* Prior invocations of the enumeration tool
+* Manually curated lists of URLs
+* [GHTorrent](https://ghtorrent.org/) data dumps
+
+##### GHTorrent
+
+GHTorrent monitors GitHub's public event feed and provides a fairly
+comprehensive source of projects.
+
+Data from GHTorrent needs to be extracted from the SQL dump and filtered to
+eliminate deleted repositories.
+
+The 2021-03-06 dump includes approx 190M repositories. This many repositories
+would need to be curated to ensure each repository is still available. Culling
+for "interesting" (e.g. more than 1 star) repositories may also be useful to
+limit the amount of work generating signals.
+
+
+#### Future Sources of Projects
+
+There are many other sources of projects for future milestones that can be
+used. These are out-of-scope for Milestone 1, but worth listing.
+
+* Other source repositories such as GitLab and Bitbucket.
+* [https://deps.dev/](https://deps.dev/) projects. This source captures many
+  projects that exist in package repositories and helps connect projects to
+  their packages and dependents.
+* GHTorrent or GH Archive - these can avoid the expense of querying GitHub's
+  API directly.
+* Google dorking - use Google's advanced search capabilities to find
+  self-hosted repositories (e.g. cgit, gitea, etc)
+* JIRA, Bugzilla, etc support for issue tracking
+
+
+### Raw Signal Collection
+
+This stage is when the list of projects are iterated over and for each project
+a set of raw signal data is output.
+
+For Milestone 1, the focus will be on reproducing the existing signals
+collected by the Python implementation, and adding support for dependent data
+sourced from [deps.dev](https://deps.dev).
+
+Additionally there will be a focus on making it straightforward to add new
+signal sources and signals.
+
+
+#### Input / Output
+
+Input:
+
+* One or more text files containing a list of project urls, one URL per line
+
+Output:
+
+* Either JSON or CSV formatted records for each project in UTF-8, including
+  the project url. The output will support direct loading into BigQuery.
+
+#### Signal Collectors
+
+Signal collection will be built around multiple signal _collectors_ that
+produce one or more _signals_ per repository.
+
+Signal collectors fall into one of three categories:
+
+* Source repository and hosting signal collectors (e.g. GitHub, Bitbucket,
+  cGit)
+* Issue tracking signal collectors (e.g. GitHub, Bugzilla, JIRA)
+* Additional signal collectors (e.g deps.dev)
+
+Each repository can have only one set of signals from a source repository
+collector and one set of signals from an issue tracking signal collector, but
+can have signals from many additional collectors.
+
+#### Repository Object
+
+During the collection process a repository object will be created and passed to
+each collector.
+
+As each part of the collection process runs, data will be fetched for a
+repository. The repository object will serve as the interface for accessing
+repository specific data so that it can be cached and limit the amount of
+additional queries that need to be executed.
+
+#### Collection Process
+
+The general process for collecting signals will do the following:
+
+* Initialize all the collectors
+* For each repository URL
+    * Gather basic data about the repository (e.g. stars, has it moved, urls)
+        * It may have been removed, in which case the repository can be
+          skipped.
+        * It may not be "interesting" (e.g. too few stars) and should be
+          skipped.
+        * It may have already been processed and should be skipped.
+    * Determine the set of collectors that apply to the repository.
+    * For each collector:
+        * Start collecting the signals for the current repository
+    * Wait for all collectors to complete
+    * Write the signals to the output.
+
+#### Signal Fields
+
+##### Naming
+
+Signal fields will fall under the general naming pattern of
+`[collector].[name]`.
+
+Where `[collector]` and `[name]` are made up of one or more of the
+following:
+
+* Lowercase characters
+* Numbers
+* Underscores.
+
+The following restrictions further apply to `[collector]` names:
+
+* Source repository signal collectors must use the `repo` collector name
+* Issue tracking signal collectors must use the `issues` collector name
+* Signals matching the original set in the Python implementation can also use
+  the `legacy` collector name
+* Additional collectors can use any other valid name.
+
+Finally, `[name]` names must include the unit value if it is not implied by
+the type, and any time constraints.
+
+* e.g. `last_update_days`
+* e.g. `comment_count_prev_year`
+
+##### Types
+
+For Milestone 1, all signal fields will be scalars. More complex data types are
+out of scope.
+
+Supported scalars can be:
+
+* Boolean
+* Int
+* Float
+* String
+* Date
+* DateTime
+
+All Dates and DateTimes must be in UTC.
+
+Strings will support Unicode.
+
+#### Batching (out of scope)
+
+More efficient usage of GitHub's APIs can be achieved by batching together
+related requests. Support for batching is considered out of scope for
+Milestone 1.
+
+### BigQuery Ingestion
+
+Injection into BigQuery will be done for Milestone 1 using the `bq` command
+line tool.
+
+### Language Choice
+
+The Scorecard project and Criticality Score share many of the same needs.
+
+Scorecards also interacts with the GitHub API, negotiates rate limiting and
+handles pools of GitHub tokens.
+
+Therefore it makes sense to move towards these projects sharing code.
+
+As Scorecards is a more mature project, this requires Criticality Score to be
+rewritten in Go.

From 17cd55c16039a0a2bfdf8b4e629dd7b85232933a Mon Sep 17 00:00:00 2001
From: Caleb Brown <calebbrown@google.com>
Date: Mon, 2 May 2022 11:08:10 +1000
Subject: [PATCH 2/4] Add a glossary and improve the milestone 1 doc.

---
 docs/design/milestone_1.md | 89 ++++++++++++++++----------------------
 docs/glossary.md           | 88 +++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+), 52 deletions(-)
 create mode 100644 docs/glossary.md

diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md
index 076cc917e..04865ca10 100644
--- a/docs/design/milestone_1.md
+++ b/docs/design/milestone_1.md
@@ -4,13 +4,17 @@
 - Author: [calebbrown@google.com](mailto:calebbrown@google.com)
 - Updated: 2022-04-29
 
-
 ## Goal
 
 Anyone can reliably generate raw signal data using the `criticality_score`
 GitHub project.
 
-Support future moves towards scaling and automating criticality score.
+For this milestone, the focus will be on reproducing the existing signals
+collected by the Python implementation, and adding support for dependent data
+sourced from [deps.dev](https://deps.dev).
+
+Additionally there will be a focus on supporting future moves towards scaling
+and automating criticality score.
 
 ### Non-goals
 
@@ -25,54 +29,48 @@ Critical projects are hosted on GitLab, Bitbucket, or even self-hosted. These
 should be supported, but given that over 90% of open source projects are
 hosted by GitHub it seems prudent to focus efforts there first.
 
-## Background
+**De-dupe mirrors from origin source repositories.**
 
-TBC
+Mirrors are frequently used to provide broader access to a project. Usually
+when a self-hosted project uses a public service, such as GitHub, to host a
+mirror of the project.
 
-## Design Overview
+This milestone will not attempt to detect and canonicalize mirrors.
 
-### Project Definition
-
-A _project_ is defined as only having a _single repository_, and a _single
-issue tracker_. A project may provide multiple _packages_.
-
-There are some "umbrella projects" (e.g. Kubernetes) that have multiple
-repositories associated with them, or may use a centralized issue tracker. An
-alternative approach would be to treat a project separately to the one or
-more repositories that belong to it.
-
-However this approach has the following drawbacks:
-
-* Makes it hard to distinguish between organizations and umbrella projects
-* Raises the possibility that a part of the umbrella project that is critical
-  to OSS is missed.
-* Complicates the calculation required to aggregate signals and generate a
-  criticality score.
+## Background
 
-So instead we define a project as a single repository. This provides a clear
-"primary key" we can use for collecting signals.
+The OpenSSF has a
+[Working Group (WG) focused on Securing Critical Projects](https://github.com/ossf/wg-securing-critical-projects).
+A key part of this WG is focused on determining which Open Source projects are
+"critical". Critical Open Source projects are those which are broadly depended
+on by organizations, and present a security risk to those organizations, and
+their customers, if they are not supported.
 
-#### Forks and Mirrors
+This project is one of a small set of sources of data used to find theses
+critical projects.
 
-Mirrors and forks are clones of another project's repository. 
+The current Python implementation available in this repo has been stagnant for
+a while.
 
-A mirror is usually used to provide broader access to a repository, such as
-when a self-hosted project mirrors its repository on GitHub.
+It has some serious problems with how it enumerates projects on GitHub (see
+[#33](https://github.com/ossf/criticality_score/issues/33)), and lacks robust
+support for non-GitHub projects (see
+[#29](https://github.com/ossf/criticality_score/issues/29)).
 
-A fork has two primary uses:
+There are problems with the existing signals being collected (see
+[#55](https://github.com/ossf/criticality_score/issues/55),
+[#102](https://github.com/ossf/criticality_score/issues/102)) and interest in
+exploring other signals and approaches
+([#53](https://github.com/ossf/criticality_score/issues/53),
+[#102](https://github.com/ossf/criticality_score/issues/102) deps.dev,
+[#31](https://github.com/ossf/criticality_score/issues/31),
+[#82](https://github.com/ossf/criticality_score/issues/82), etc).
 
-* A contributor commits changes to a fork for preparing pull-requests to the
-  main repository.
-* A fork may become its own project when the original is unmaintained, or if
-  the forker decides to head in a different direction.
+Additionally, in [#102](https://github.com/ossf/criticality_score/issues/102) I propose an approach to improving the quality of the criticality score.
 
-This raises two considerations for generating criticality scores:
+## Design Overview
 
-* Project repositories may need to be de-duped to avoid treating the original
-  source and its mirrors as separate projects.
-* Forks merely used for committing changes for a pull-request should be ignored
-  to lower the work and potential noise (fortunately these should score low
-  enough to make it easy to ignore them)
+Please see the [glossary](../glossary.md) for a terms used in this project.
 
 ### Multi Stage
 
@@ -141,7 +139,6 @@ The current implementation of this algorithm has a difference between GitHub
 search of less than 0.05% for >=20 stars (GitHub search was checked ~12 hours
 after the algorithm finished) and took 4 hours with 1 worker and 1 token.
 
-
 ##### Rate Limits
 
 A pool of GitHub tokens will be supported for increased performance.
@@ -150,12 +147,10 @@ A single GitHub token has a limit of "5000" each hour, a single search page
 consumes "1", and returning the 1000 results from a search consumes "10". This
 allows 500 search queries per hour for a single token.
 
-
 ##### Output
 
 Output from enumeration will be a text file containing a list of GitHub urls.
 
-
 #### Static Project URL Lists
 
 Rather than repeatedly query project repositories for a list of projects, use
@@ -180,7 +175,6 @@ would need to be curated to ensure each repository is still available. Culling
 for "interesting" (e.g. more than 1 star) repositories may also be useful to
 limit the amount of work generating signals.
 
-
 #### Future Sources of Projects
 
 There are many other sources of projects for future milestones that can be
@@ -196,20 +190,11 @@ used. These are out-of-scope for Milestone 1, but worth listing.
   self-hosted repositories (e.g. cgit, gitea, etc)
 * JIRA, Bugzilla, etc support for issue tracking
 
-
 ### Raw Signal Collection
 
 This stage is when the list of projects are iterated over and for each project
 a set of raw signal data is output.
 
-For Milestone 1, the focus will be on reproducing the existing signals
-collected by the Python implementation, and adding support for dependent data
-sourced from [deps.dev](https://deps.dev).
-
-Additionally there will be a focus on making it straightforward to add new
-signal sources and signals.
-
-
 #### Input / Output
 
 Input:
diff --git a/docs/glossary.md b/docs/glossary.md
new file mode 100644
index 000000000..3532bc942
--- /dev/null
+++ b/docs/glossary.md
@@ -0,0 +1,88 @@
+# Glossary
+
+This document defines the meaning of various terms used by this project.
+This is to ensure they are clearly understood.
+
+Please keep the document sorted alphabetically.
+
+## Terms
+
+### Fork
+
+A _fork_, like a mirror, is a clone or copy of another project's source code or
+repository. 
+
+A fork has two primary uses:
+
+* A contributor commiting changes to a fork for preparing pull-requests to the
+  main repository.
+* A fork may become its own project when the original is unmaintained, or if
+  the forker decides to head in a different direction.
+
+Forks merely used for committing changes for a pull-request are not interesting
+when calculating criticality scores.
+
+See also "Mirror".
+
+### Mirror
+
+A _mirror_, like a fork, is a clone or copy of another project's source code or
+repository. 
+
+A mirror is usually used to provide broader access to a repository, such as
+when a self-hosted project mirrors its repository on GitHub.
+
+Mirrors may require de-duping to avoid treating the original repository and
+its mirrors as separate projects.
+
+See also "Fork".
+
+### Project
+
+A _project_ is defined as only having a _single repository_, and a _single
+issue tracker_. A project may provide multiple _packages_.
+
+There are some "umbrella projects" (e.g. Kubernetes) that have multiple
+repositories associated with them, or may use a centralized issue tracker. An
+alternative approach would be to treat a project independently of the one or
+more repositories that belong to it.
+
+However this approach has the following drawbacks:
+
+* Makes it hard to distinguish between organizations and umbrella projects
+* Raises the possibility that a part of the umbrella project that is critical
+  to OSS is missed.
+* Complicates the calculation required to aggregate signals and generate a
+  criticality score.
+
+So instead we define a project as a single repository. This provides a clear
+"primary key" we can use for collecting signals.
+
+### Repository
+
+A _repository_ refers to the system used to store and manage access to a
+project's source code. Usually a version control system (e.g. git or mercurial)
+is used to track and manage changes to the source code.
+
+A _repository_ can be the canonical source of a project's code, or it could
+also be a _fork_ or a _mirror_.
+
+A _repository_ is usually owned by an individual or an organization, although
+the specifics of how this behaves in practice depends on the repositories host.
+
+### Repository Host
+
+A _repository host_ is the service hosting a _repository_. It may be a service
+such as GitHub, GitLab or Bitbucket. It may also be "self-hosted", where the
+infrastructure for hosting a repository is managed by the maintainers of a
+project.
+
+Self-hosted repositories often deploy an open-source application to provide
+access, such as GitLab, cGit, or Gitea.
+
+### Umbrella Project
+
+An _umbrella project_ is a group of related projects that are maintained by a
+larger community surrounding the project.
+
+See also "project".
\ No newline at end of file

From 1c7c5e23df9516107f23c2ec64da824834b8b11a Mon Sep 17 00:00:00 2001
From: Caleb Brown <calebbrown@google.com>
Date: Mon, 2 May 2022 11:15:09 +1000
Subject: [PATCH 3/4] Tweak docs goals and design approach to make the intent
 clearer.

---
 docs/design/milestone_1.md | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md
index 04865ca10..69964ed2b 100644
--- a/docs/design/milestone_1.md
+++ b/docs/design/milestone_1.md
@@ -6,16 +6,16 @@
 
 ## Goal
 
-Anyone can reliably generate raw signal data using the `criticality_score`
-GitHub project.
+Anyone can reliably generate signal data using the `criticality_score` GitHub
+project.
+
+Additionally there will be a focus on supporting future moves towards scaling
+and automating criticality score.
 
 For this milestone, the focus will be on reproducing the existing signals
 collected by the Python implementation, and adding support for dependent data
 sourced from [deps.dev](https://deps.dev).
 
-Additionally there will be a focus on supporting future moves towards scaling
-and automating criticality score.
-
 ### Non-goals
 
 **Improve how the score is calculated.**
@@ -70,6 +70,15 @@ Additionally, in [#102](https://github.com/ossf/criticality_score/issues/102) I
 
 ## Design Overview
 
+This milestone is a fundamental rearchitecturing of the project to meet the
+
+
+this is rearchitecturing the project to enable more reliability and extensibility
+focusing on:
+reliably enumerating GitHub projects
+Reliably generating existing signals. Adding dependent information.
+Making the current criticality rankings list generated more frequently. (And more easily)
+
 Please see the [glossary](../glossary.md) for a terms used in this project.
 
 ### Multi Stage

From 3deb32389d507c88c1dc7588cdb41302e8f4946f Mon Sep 17 00:00:00 2001
From: Caleb Brown <calebbrown@google.com>
Date: Mon, 2 May 2022 11:27:47 +1000
Subject: [PATCH 4/4] More doc wordsmithing.

---
 docs/design/milestone_1.md | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/docs/design/milestone_1.md b/docs/design/milestone_1.md
index 69964ed2b..84efb83d5 100644
--- a/docs/design/milestone_1.md
+++ b/docs/design/milestone_1.md
@@ -6,15 +6,16 @@
 
 ## Goal
 
-Anyone can reliably generate signal data using the `criticality_score` GitHub
-project.
+Anyone can reliably generate the existing set of signal data using the
+`criticality_score` GitHub project, and calculate the scores using the
+existing algorithm.
 
 Additionally there will be a focus on supporting future moves towards scaling
 and automating criticality score.
 
-For this milestone, the focus will be on reproducing the existing signals
-collected by the Python implementation, and adding support for dependent data
-sourced from [deps.dev](https://deps.dev).
+For this milestone, collecting dependent signal data sourced from
+[deps.dev](https://deps.dev) will also be added to improve the overall
+quality of the score produced.
 
 ### Non-goals
 
@@ -71,13 +72,13 @@ Additionally, in [#102](https://github.com/ossf/criticality_score/issues/102) I
 ## Design Overview
 
 This milestone is a fundamental rearchitecturing of the project to meet the
+goals of higher reliability, extensibility and ease of use.
 
+The design focuses on:
 
-this is rearchitecturing the project to enable more reliability and extensibility
-focusing on:
-reliably enumerating GitHub projects
-Reliably generating existing signals. Adding dependent information.
-Making the current criticality rankings list generated more frequently. (And more easily)
+- reliable GitHub project enumeration.
+- reliable signal collection, with better dependent data.
+- being able to update the criticality scores and rankings more frequently.
 
 Please see the [glossary](../glossary.md) for a terms used in this project.