docs: 📝 move modular design and requirements into design landing page (…

…#979) ## Description I'm going through the design docs again as I extend on to the resources section. If we go with seedcase-project/design#168 (which I think we should), this simplifies the design description to be about core and what we want with it.  This PR needs an in-depth review. --------- Co-authored-by: martonvago <[email protected]> Co-authored-by: Signe Kirk Brødbæk <[email protected]>
seedcase-project · Jan 22, 2025 · f3ef3ce · f3ef3ce
1 parent d3d8722
commit f3ef3ce
Show file tree

Hide file tree

Showing 3 changed files with 87 additions and 123 deletions.
diff --git a/docs/design/architecture/modular-design.qmd b/docs/design/architecture/modular-design.qmd
diff --git a/docs/design/architecture/requirements.qmd b/docs/design/architecture/requirements.qmd
diff --git a/docs/design/index.qmd b/docs/design/index.qmd
@@ -2,23 +2,10 @@
 title: "Design"
 ---
 
-The core aim of Sprout is to **take data and metadata and convert them
-into a standardized and organized storage structure** that follows best
-practices for data engineering, particularly with a focus on research
-contexts. Specifically, Sprout aims to:
-
-1.  Take generated data from various source locations (such as clinics
-    or laboratories), which may be distributed geographically or
-    organizationally, and store it in a standardized and efficient
-    format.
-2.  Ensure that metadata is included for the data and organized in a
-    standardized format, explicitly and programmatically linking the
-    metadata to the data.
-
-The purpose of these documents is to describe the design of
-Sprout in enough detail to help us develop it in a way that is
-sustainable (i.e., maintainable over the long term) and that ensures we
-as a team have a shared understanding of what Sprout is and is not.
+The purpose of these documents is to describe the design of Sprout in
+enough detail to help us develop it in a way that is sustainable (i.e.,
+maintainable over the long term) and that ensures we as a team have a
+shared understanding of what Sprout is and is not.
 
 Sprout's design builds off of our overall Seedcase [design principles
 and patterns](https://design.seedcase-project.org/), in that any design
@@ -35,3 +22,86 @@ There are two main sections of this design documentation:
     more detailed description of the software interface for the
     architecture. The detail is at the level of exact Python functions
     and CLI commands, as well as their arguments, inputs, and outputs.
+
+## Purpose
+
+The overall aim of Sprout is to **take data and metadata and convert
+them into a standardized and organized storage structure** that follows
+best practices for data engineering, particularly with a focus on
+research contexts. Specifically, Sprout aims to:
+
+1.  Take generated data from various source locations (such as clinics
+    or laboratories), which may be distributed geographically or
+    organizationally, and store it in a standardized and efficient
+    format.
+2.  Ensure that metadata is included for the data and organized in a
+    standardized format, explicitly and programmatically linking the
+    metadata to the data.
+
+Aligning with our modular design pattern to build software that has a
+focused and narrow scope as well as that can be easily extended or
+customized, *Sprout core* (this package) has an additional aim to:
+
+1.  Build a flexible, extensible, and fine-grained set of
+    functionalities that implement Sprout's overall design, enabling
+    users to configure and customize Sprout to their own needs.
+    Specifically, to build functionality that supports us in building
+    more opinionated interfaces and extensions of Sprout.
+
+## Requirements
+
+Overall, Sprout must:
+
+-   Run on Windows, MacOS, and Linux: Our potential users work on any of
+    these systems (including servers), so we need to ensure
+    compatibility across the most commonly used operating systems.
+-   Integrate GDPR, privacy, and security compliance: Many of our users
+    work with health or personally sensitive data.
+-   Run remotely on servers and locally on computers: The location where
+    data are stored should be flexible based on the needs and
+    restrictions of the user.
+-   Be able to handle a variety of data file sizes: While the size of
+    research data usually does not compare to that found in industry, it can
+    still become large enough that it requires special care and
+    handling.
+-   Store data in a format that is open source, integrates with many
+    tools, and is storage efficient: Sprout is first and foremost a data
+    engineering tool for research data storage and distribution (or at
+    least, easier sharing).
+-   Store, organize, and manage multiple distinct data sources per user
+    or group of users (for example in a server setting): Researchers
+    rarely collect and work on one data source at any given time. So,
+    Sprout must be able to handle multiple distinct data sources.
+-   Upload and update data: Data can be added to Sprout in batches or on
+    a more frequent basis. We anticipate that batch uploads will be the
+    most common.
+-   Store, organize, and manage metadata connected to the data: Metadata
+    are vital to understanding the data and its context, without which
+    data can be near useless. Sprout needs to make managing and
+    organizing metadata fundamental to its functionality.
+-   Track changes to the data in a changelog and versioning system: Data
+    are not static and can change over time. Sprout must track these
+    changes and provide a way to show, track, and manage versions of the
+    data. This is also necessary for legal compliance, auditing and
+    record-keeping.
+
+In general, Sprout will not:
+
+-   Run any analytic computing or data science work: While some data
+    processing and analysis will occur, it will be limited to running
+    checks on the quality of the data and metadata.
+
+Aside from the overall requirements for Sprout, the requirements for
+this *core* module of Sprout are:
+
+-   Individual pieces of functionality should be independent: To keep it
+    flexible and extensible, each functionality should (ideally) be able
+    to run on its own without needing to depend on other functionality.
+-   Assume as little as possible about the environment: For each
+    functionality, ideally assume as little about the current state of
+    the environment as possible. For instance, what the exact path is on
+    the filesystem or where in the filesystem the software is running.
+-   Make as few assumptions and expectations as is reasonable: This
+    package should not be designed to be (too) opinionated about the
+    order steps are taken, what steps are taken, where they are taken,
+    and other specific details.