Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…prout into refactor/edit-package-properties
  • Loading branch information
lwjohnst86 committed Jan 27, 2025
2 parents 82c753d + ffd83aa commit 54bea5f
Show file tree
Hide file tree
Showing 24 changed files with 641 additions and 604 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
## 0.13.0 (2025-01-24)

### Feat

- ✨ output everything from code cells in Reference docs (#997)

## 0.12.2 (2025-01-24)

### Refactor

- :recycle: `path_package_properties()` -> `path_properties()` (#996)

## 0.12.1 (2025-01-23)

### Refactor
Expand Down
5 changes: 2 additions & 3 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,8 @@ website:
- section: "Architecture"
href: docs/design/architecture/index.qmd
contents:
- docs/design/architecture/requirements.qmd
- docs/design/architecture/naming.qmd
- docs/design/architecture/modular-design.qmd
- docs/design/architecture/input-data.qmd
- section: "Interface"
href: docs/design/interface/index.qmd
contents:
Expand Down Expand Up @@ -94,9 +93,9 @@ quartodoc:
desc: "Functions to work with and manage data resources found within a data package."
package: "seedcase_sprout.core"
contents:
- write_resource_properties
- create_resource_properties
- create_resource_structure
- write_resource_properties

- subtitle: "Property dataclasses"
desc: "Dataclasses that support creating correct data package properties."
Expand Down
15 changes: 14 additions & 1 deletion _renderer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,20 @@


class Renderer(MdRenderer):
style = "fix-output-of-returns-and-raises-without-names"
style = "seedcase"

@dispatch
def render_header(self, el: layout.Doc) -> str:
"""Render the header of a docstring, including any anchors."""
_str_dispname = el.name

_anchor = f"{{ #{el.obj.path} }}"

# For lvl 1 headers, add a yml header with the ipynb-shell-interactivity setting
# to get all output from the cell
if self.crnt_header_level == 1:
return f"---\nipynb-shell-interactivity: all\ntitle: {_str_dispname}\n---"
return f"{'#' * self.crnt_header_level} {_str_dispname} {_anchor}"

# returns ----

Expand Down
96 changes: 96 additions & 0 deletions docs/design/architecture/input-data.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "Input data"
---

The types of data we expect or anticipate to be input into Sprout are
described in this section. We design Sprout with these types of data and
formats in mind.

## Domain-specific types of data

Currently, we only have experience with health data, so we have a bias
towards that type of data.

### Health research

Health research data tends to consist of these types of data:

- **Clinical**: This data is typically collected during patient visits
to doctors. Depending on the country or administrative region, there
will likely already be well-established data processing and storage
pipelines in place.
- **Register**: This type of data is highly dependent on the country
or region. Generally, this data is collected for national or
regional administrative purposes, such as recording employment
status, income, address, medication purchases, and diagnoses. Like
for routine clinical data, the pipelines in place for processing and
storing this data are usually very extensive and well
established.
- **Biological sample data**: This type of data is generated from
biological samples, like blood, saliva, semen, hair, or urine. Data
generated from sample analytic techniques often produce large
volumes of data per person. Samples may be generated in larger
established laboratories or in smaller research groups, depending on
what analytic technology is used and how new it is. The structure
and format of the generated data also tend to be highly variable
and depend heavily on the technology used, sometimes requiring
specialized software to process and output.
- **Survey or questionnaire**: This type of data is often collected based
on a given study's aims and research questions. There are hundreds
of different questionnaires that can have highly specific purposes
and uses for their data. They are also highly variable in the volume
of data collected based on the survey, and on the format of the
data.

## File and data formats

While we aim to handle a wide variety of data types, we will start with
the most common types of formats. We also have a limitation or
restriction that the data format needs to be open source and not
proprietary, since we cannot process it if we don't have the software to
read it.

The file formats we expect to work with are text (`.txt`) files, various
forms of comma-separated value (`.csv`) files, Excel (`.xls` or `.xlsx`)
files (technically closed source but practically easy to read), images,
audio, XML, JSON, and potentially some SQL databases.

## Flow or frequency of data collection

In research (and even in most industry settings), we rarely encounter
truly real-time data collection. Most data collection is done in
"batches", with data being collected at irregular and inconsistent
intervals and then stored to be processed later. This batch
collection can be broken down into two categories based on its
frequency:

- *Routine or continuous collection*, where data is collected on a
more regular interval and in smaller batches of "observational
units"[^1]. Ingestion or processing of this type of data may happen
on a more regular basis. Clinical data as well as survey or
questionnaire data may likely fall under this category. For example,
data collected on a few patients seen during the day at a clinic.
- *Grouped collection*, where data is collected from many observational
units during a short period of time at very irregular intervals or
potentially only once. Data ingesting or processing occurs some time
after all the data has been collected. Biological sample data
would fall under this category, since laboratories usually run
several samples at once and input data after internal quality
control checks and machine-specific data processing. While
register-based and clinical data usually get collected
continuously, direct access to them is only given on a batch and
infrequent basis, so they may also fall under this category. Survey
data may also come in batches, depending on the questionnaire and
software used for its collection.

[^1]: Observational unit is the "entity" that the data was collected
from at a given point in time, such as a human participant in a
cohort study or a rat in an animal study at a specific time point.

Regardless of the flow or frequency of data generation and collection,
the ability to automatically ingest the data into Sprout will vary wildly
based on the data source, the organization who generates the data, and
their technical expertise. Some data sources may have well-established,
but not always programmatic or automatic, workflows and processes.
Others may not have any workflow and it may be an extremely manual
process.
2 changes: 1 addition & 1 deletion docs/design/architecture/naming.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ We may also occasionally use "properties" to refer to the file itself.
| Action | Description |
|----------------------------|--------------------------------------------|
| create | Create a new object. |
| construct | Construct or reconstruct a data file (like the Parquet file). |
| build | Build implies either creating a new object or recreating an existing one, e.g. (re-)build a file like the README or Parquet file. |
| view | View details about an object. |
| list | List basic details about many objects. |
| edit | Edit an object, specifically the properties object. |
Expand Down
209 changes: 0 additions & 209 deletions docs/design/architecture/runtime-view.qmd

This file was deleted.

Loading

0 comments on commit 54bea5f

Please sign in to comment.