From f3ef3ce4c3a539382591120ba88c25d4da9193fb Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Wed, 22 Jan 2025 18:48:12 +0100 Subject: [PATCH] docs: :memo: move modular design and requirements into design landing page (#979) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Description I'm going through the design docs again as I extend on to the resources section. If we go with seedcase-project/design#168 (which I think we should), this simplifies the design description to be about core and what we want with it. This PR needs an in-depth review. --------- Co-authored-by: martonvago <57952344+martonvago@users.noreply.github.com> Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> --- docs/design/architecture/modular-design.qmd | 61 ------------ docs/design/architecture/requirements.qmd | 45 --------- docs/design/index.qmd | 104 ++++++++++++++++---- 3 files changed, 87 insertions(+), 123 deletions(-) delete mode 100644 docs/design/architecture/modular-design.qmd delete mode 100644 docs/design/architecture/requirements.qmd diff --git a/docs/design/architecture/modular-design.qmd b/docs/design/architecture/modular-design.qmd deleted file mode 100644 index 96cea4add..000000000 --- a/docs/design/architecture/modular-design.qmd +++ /dev/null @@ -1,61 +0,0 @@ ---- -title: "Modular design" ---- - -While we describe how we implement aspects of modular design across all -Seedcase products in our [design -documentation](https://design.seedcase-project.org), this document -explains this split between Python code, CLI, web app, and web API in -the context of Sprout. - -## Python modules - -In order to achieve our aims, the main functionality is designed and -implemented as a Python package and can be used as a Python package. -This code is stored in the Python module (folder location) -`seedcase_sprout/core/` and is accessible via the Python code -`import seedcase_sprout.core as sp` statement. Only external-facing functions -would be exposed to the user and to other programs. - -The `core` functions and classes aim to have as few external assumptions -and dependencies as possible. For instance, input and output don't -depend on the state of any other function. This way, we can have a -highly flexible, modular, and easily testable `core` that other -developers (and potentially users) can easily configure and customize to -their own needs. This modular and flexible "lower level" design allows -us to create a variety of interfaces and extensions to Sprout. - -Any extensions would then only need to incorporate and depend on the -`seedcase_sprout.core` package to get Sprout's functionality. Each -extension would be its own module within Sprout, with modules named like -`seedcase_sprout/extension_name/`, for instance, `seedcase_sprout/cli/`. -That way, within the extension's module, the logic that we implement is -only specific to creating the CLI and not to the actual functionality of -Sprout. - -Other potential extensions would follow the same or similar pattern. -This allows other developers to create their own extensions and -interfaces to Sprout that suit their particular needs. - -Our [Guide](/docs/guide/index.qmd) has examples and tutorials on how -each split is used. - -- A higher-level, opinionated Python abstraction of `core`, named - `lib` and stored in `seedcase_sprout/lib/`. This module abstracts - away and simplifies many of the steps and functions available in - `core`, with the consequence of being less flexible and less - customizable. This module is intended for users who want to use - Sprout without having to worry about the finer details of the exact - implementation. The internals of the `lib` functions would be strict - and opinionated steps for how we envision a data package to be - created, structured, and managed. - -- The CLI is placed in the module `seedcase_sprout/cli/`. Each - command imports the relevant functions from `seedcase_sprout.core`, - along with decorators (from other Python packages) to convert the - Python code into a CLI. - -- The web app is in `seedcase_sprout/app/`. Each page of the web app - imports the required functions from `seedcase_sprout.core`, with - functions from other Python packages that convert that code into a - web app. diff --git a/docs/design/architecture/requirements.qmd b/docs/design/architecture/requirements.qmd deleted file mode 100644 index b9f94d47c..000000000 --- a/docs/design/architecture/requirements.qmd +++ /dev/null @@ -1,45 +0,0 @@ ---- -title: "Requirements" ---- - -Sprout must: - -- Run on Windows, MacOS, and Linux (likely on servers): Our potential - users work on any of these systems, so we need to ensure compatibility across - most commonly used operating systems. -- Integrate GDPR, privacy, and security compliance: Our target users - work with health data, so this is vital to consider -- Run remotely on servers and locally on computers: The location where - data are stored should be flexible based on the needs and - restrictions of the user. -- Be able to handle a variety of data file sizes: While the size of - research data does not compare to those found in industry, it can - still become large enough that it requires special care and - handling. -- Store data in a format that is open source, integrates with many - tools, and is storage efficient: Sprout is first and foremost a data - engineering tool for research data storage and distribution (or at - least, easier sharing). -- Store, organize, and manage multiple distinct data sources per user - or group of users (for example in a server setting): Researchers rarely - collect and work on one data source at any - given time. So, Sprout must be able to handle multiple distinct data - sources. -- Upload and update data: Data can be added to Sprout could happen in - batches or on a more frequent basis. We anticipate that batch - uploads will be the most common. -- Store, organize, and manage metadata connected to the data: Metadata - are vital to understanding the data and its context, without which - data can be near useless. Sprout needs to make managing and - organizing metadata fundamental to its functionality. -- Track changes to the data in a changelog and versioning system: Data - are not static and can change over time. Sprout must - track these changes and provide a way to show, track, and manage - versions of the data. This is also necessary for legal compliance - for auditing and record-keeping. - - Sprout will not: - -- Run any analytic computing or data science work: While some - data processing and analysis will occur, it will be limited - to running checks on the quality of the data and metadata. diff --git a/docs/design/index.qmd b/docs/design/index.qmd index dc1c14265..7a5fabdfd 100644 --- a/docs/design/index.qmd +++ b/docs/design/index.qmd @@ -2,23 +2,10 @@ title: "Design" --- -The core aim of Sprout is to **take data and metadata and convert them -into a standardized and organized storage structure** that follows best -practices for data engineering, particularly with a focus on research -contexts. Specifically, Sprout aims to: - -1. Take generated data from various source locations (such as clinics - or laboratories), which may be distributed geographically or - organizationally, and store it in a standardized and efficient - format. -2. Ensure that metadata is included for the data and organized in a - standardized format, explicitly and programmatically linking the - metadata to the data. - -The purpose of these documents is to describe the design of -Sprout in enough detail to help us develop it in a way that is -sustainable (i.e., maintainable over the long term) and that ensures we -as a team have a shared understanding of what Sprout is and is not. +The purpose of these documents is to describe the design of Sprout in +enough detail to help us develop it in a way that is sustainable (i.e., +maintainable over the long term) and that ensures we as a team have a +shared understanding of what Sprout is and is not. Sprout's design builds off of our overall Seedcase [design principles and patterns](https://design.seedcase-project.org/), in that any design @@ -35,3 +22,86 @@ There are two main sections of this design documentation: more detailed description of the software interface for the architecture. The detail is at the level of exact Python functions and CLI commands, as well as their arguments, inputs, and outputs. + +## Purpose + +The overall aim of Sprout is to **take data and metadata and convert +them into a standardized and organized storage structure** that follows +best practices for data engineering, particularly with a focus on +research contexts. Specifically, Sprout aims to: + +1. Take generated data from various source locations (such as clinics + or laboratories), which may be distributed geographically or + organizationally, and store it in a standardized and efficient + format. +2. Ensure that metadata is included for the data and organized in a + standardized format, explicitly and programmatically linking the + metadata to the data. + +Aligning with our modular design pattern to build software that has a +focused and narrow scope as well as that can be easily extended or +customized, *Sprout core* (this package) has an additional aim to: + +1. Build a flexible, extensible, and fine-grained set of + functionalities that implement Sprout's overall design, enabling + users to configure and customize Sprout to their own needs. + Specifically, to build functionality that supports us in building + more opinionated interfaces and extensions of Sprout. + +## Requirements + +Overall, Sprout must: + +- Run on Windows, MacOS, and Linux: Our potential users work on any of + these systems (including servers), so we need to ensure + compatibility across the most commonly used operating systems. +- Integrate GDPR, privacy, and security compliance: Many of our users + work with health or personally sensitive data. +- Run remotely on servers and locally on computers: The location where + data are stored should be flexible based on the needs and + restrictions of the user. +- Be able to handle a variety of data file sizes: While the size of + research data usually does not compare to that found in industry, it can + still become large enough that it requires special care and + handling. +- Store data in a format that is open source, integrates with many + tools, and is storage efficient: Sprout is first and foremost a data + engineering tool for research data storage and distribution (or at + least, easier sharing). +- Store, organize, and manage multiple distinct data sources per user + or group of users (for example in a server setting): Researchers + rarely collect and work on one data source at any given time. So, + Sprout must be able to handle multiple distinct data sources. +- Upload and update data: Data can be added to Sprout in batches or on + a more frequent basis. We anticipate that batch uploads will be the + most common. +- Store, organize, and manage metadata connected to the data: Metadata + are vital to understanding the data and its context, without which + data can be near useless. Sprout needs to make managing and + organizing metadata fundamental to its functionality. +- Track changes to the data in a changelog and versioning system: Data + are not static and can change over time. Sprout must track these + changes and provide a way to show, track, and manage versions of the + data. This is also necessary for legal compliance, auditing and + record-keeping. + +In general, Sprout will not: + +- Run any analytic computing or data science work: While some data + processing and analysis will occur, it will be limited to running + checks on the quality of the data and metadata. + +Aside from the overall requirements for Sprout, the requirements for +this *core* module of Sprout are: + +- Individual pieces of functionality should be independent: To keep it + flexible and extensible, each functionality should (ideally) be able + to run on its own without needing to depend on other functionality. +- Assume as little as possible about the environment: For each + functionality, ideally assume as little about the current state of + the environment as possible. For instance, what the exact path is on + the filesystem or where in the filesystem the software is running. +- Make as few assumptions and expectations as is reasonable: This + package should not be designed to be (too) opinionated about the + order steps are taken, what steps are taken, where they are taken, + and other specific details.