Skip to content

Commit

Permalink
docs: 📝 move modular design and requirements into design landing page (
Browse files Browse the repository at this point in the history
…#979)

## Description

I'm going through the design docs again as I extend on to the resources
section. If we go with seedcase-project/design#168 (which I think we
should), this simplifies the design description to be about core and
what we want with it.

<!-- Select quick/in-depth as necessary -->
This PR needs an in-depth review.

---------

Co-authored-by: martonvago <[email protected]>
Co-authored-by: Signe Kirk Brødbæk <[email protected]>
  • Loading branch information
3 people authored Jan 22, 2025
1 parent d3d8722 commit f3ef3ce
Show file tree
Hide file tree
Showing 3 changed files with 87 additions and 123 deletions.
61 changes: 0 additions & 61 deletions docs/design/architecture/modular-design.qmd

This file was deleted.

45 changes: 0 additions & 45 deletions docs/design/architecture/requirements.qmd

This file was deleted.

104 changes: 87 additions & 17 deletions docs/design/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,10 @@
title: "Design"
---

The core aim of Sprout is to **take data and metadata and convert them
into a standardized and organized storage structure** that follows best
practices for data engineering, particularly with a focus on research
contexts. Specifically, Sprout aims to:

1. Take generated data from various source locations (such as clinics
or laboratories), which may be distributed geographically or
organizationally, and store it in a standardized and efficient
format.
2. Ensure that metadata is included for the data and organized in a
standardized format, explicitly and programmatically linking the
metadata to the data.

The purpose of these documents is to describe the design of
Sprout in enough detail to help us develop it in a way that is
sustainable (i.e., maintainable over the long term) and that ensures we
as a team have a shared understanding of what Sprout is and is not.
The purpose of these documents is to describe the design of Sprout in
enough detail to help us develop it in a way that is sustainable (i.e.,
maintainable over the long term) and that ensures we as a team have a
shared understanding of what Sprout is and is not.

Sprout's design builds off of our overall Seedcase [design principles
and patterns](https://design.seedcase-project.org/), in that any design
Expand All @@ -35,3 +22,86 @@ There are two main sections of this design documentation:
more detailed description of the software interface for the
architecture. The detail is at the level of exact Python functions
and CLI commands, as well as their arguments, inputs, and outputs.

## Purpose

The overall aim of Sprout is to **take data and metadata and convert
them into a standardized and organized storage structure** that follows
best practices for data engineering, particularly with a focus on
research contexts. Specifically, Sprout aims to:

1. Take generated data from various source locations (such as clinics
or laboratories), which may be distributed geographically or
organizationally, and store it in a standardized and efficient
format.
2. Ensure that metadata is included for the data and organized in a
standardized format, explicitly and programmatically linking the
metadata to the data.

Aligning with our modular design pattern to build software that has a
focused and narrow scope as well as that can be easily extended or
customized, *Sprout core* (this package) has an additional aim to:

1. Build a flexible, extensible, and fine-grained set of
functionalities that implement Sprout's overall design, enabling
users to configure and customize Sprout to their own needs.
Specifically, to build functionality that supports us in building
more opinionated interfaces and extensions of Sprout.

## Requirements

Overall, Sprout must:

- Run on Windows, MacOS, and Linux: Our potential users work on any of
these systems (including servers), so we need to ensure
compatibility across the most commonly used operating systems.
- Integrate GDPR, privacy, and security compliance: Many of our users
work with health or personally sensitive data.
- Run remotely on servers and locally on computers: The location where
data are stored should be flexible based on the needs and
restrictions of the user.
- Be able to handle a variety of data file sizes: While the size of
research data usually does not compare to that found in industry, it can
still become large enough that it requires special care and
handling.
- Store data in a format that is open source, integrates with many
tools, and is storage efficient: Sprout is first and foremost a data
engineering tool for research data storage and distribution (or at
least, easier sharing).
- Store, organize, and manage multiple distinct data sources per user
or group of users (for example in a server setting): Researchers
rarely collect and work on one data source at any given time. So,
Sprout must be able to handle multiple distinct data sources.
- Upload and update data: Data can be added to Sprout in batches or on
a more frequent basis. We anticipate that batch uploads will be the
most common.
- Store, organize, and manage metadata connected to the data: Metadata
are vital to understanding the data and its context, without which
data can be near useless. Sprout needs to make managing and
organizing metadata fundamental to its functionality.
- Track changes to the data in a changelog and versioning system: Data
are not static and can change over time. Sprout must track these
changes and provide a way to show, track, and manage versions of the
data. This is also necessary for legal compliance, auditing and
record-keeping.

In general, Sprout will not:

- Run any analytic computing or data science work: While some data
processing and analysis will occur, it will be limited to running
checks on the quality of the data and metadata.

Aside from the overall requirements for Sprout, the requirements for
this *core* module of Sprout are:

- Individual pieces of functionality should be independent: To keep it
flexible and extensible, each functionality should (ideally) be able
to run on its own without needing to depend on other functionality.
- Assume as little as possible about the environment: For each
functionality, ideally assume as little about the current state of
the environment as possible. For instance, what the exact path is on
the filesystem or where in the filesystem the software is running.
- Make as few assumptions and expectations as is reasonable: This
package should not be designed to be (too) opinionated about the
order steps are taken, what steps are taken, where they are taken,
and other specific details.

0 comments on commit f3ef3ce

Please sign in to comment.