-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…
…prout into refactor/edit-package-properties
- Loading branch information
Showing
24 changed files
with
641 additions
and
604 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
title: "Input data" | ||
--- | ||
|
||
The types of data we expect or anticipate to be input into Sprout are | ||
described in this section. We design Sprout with these types of data and | ||
formats in mind. | ||
|
||
## Domain-specific types of data | ||
|
||
Currently, we only have experience with health data, so we have a bias | ||
towards that type of data. | ||
|
||
### Health research | ||
|
||
Health research data tends to consist of these types of data: | ||
|
||
- **Clinical**: This data is typically collected during patient visits | ||
to doctors. Depending on the country or administrative region, there | ||
will likely already be well-established data processing and storage | ||
pipelines in place. | ||
- **Register**: This type of data is highly dependent on the country | ||
or region. Generally, this data is collected for national or | ||
regional administrative purposes, such as recording employment | ||
status, income, address, medication purchases, and diagnoses. Like | ||
for routine clinical data, the pipelines in place for processing and | ||
storing this data are usually very extensive and well | ||
established. | ||
- **Biological sample data**: This type of data is generated from | ||
biological samples, like blood, saliva, semen, hair, or urine. Data | ||
generated from sample analytic techniques often produce large | ||
volumes of data per person. Samples may be generated in larger | ||
established laboratories or in smaller research groups, depending on | ||
what analytic technology is used and how new it is. The structure | ||
and format of the generated data also tend to be highly variable | ||
and depend heavily on the technology used, sometimes requiring | ||
specialized software to process and output. | ||
- **Survey or questionnaire**: This type of data is often collected based | ||
on a given study's aims and research questions. There are hundreds | ||
of different questionnaires that can have highly specific purposes | ||
and uses for their data. They are also highly variable in the volume | ||
of data collected based on the survey, and on the format of the | ||
data. | ||
|
||
## File and data formats | ||
|
||
While we aim to handle a wide variety of data types, we will start with | ||
the most common types of formats. We also have a limitation or | ||
restriction that the data format needs to be open source and not | ||
proprietary, since we cannot process it if we don't have the software to | ||
read it. | ||
|
||
The file formats we expect to work with are text (`.txt`) files, various | ||
forms of comma-separated value (`.csv`) files, Excel (`.xls` or `.xlsx`) | ||
files (technically closed source but practically easy to read), images, | ||
audio, XML, JSON, and potentially some SQL databases. | ||
|
||
## Flow or frequency of data collection | ||
|
||
In research (and even in most industry settings), we rarely encounter | ||
truly real-time data collection. Most data collection is done in | ||
"batches", with data being collected at irregular and inconsistent | ||
intervals and then stored to be processed later. This batch | ||
collection can be broken down into two categories based on its | ||
frequency: | ||
|
||
- *Routine or continuous collection*, where data is collected on a | ||
more regular interval and in smaller batches of "observational | ||
units"[^1]. Ingestion or processing of this type of data may happen | ||
on a more regular basis. Clinical data as well as survey or | ||
questionnaire data may likely fall under this category. For example, | ||
data collected on a few patients seen during the day at a clinic. | ||
- *Grouped collection*, where data is collected from many observational | ||
units during a short period of time at very irregular intervals or | ||
potentially only once. Data ingesting or processing occurs some time | ||
after all the data has been collected. Biological sample data | ||
would fall under this category, since laboratories usually run | ||
several samples at once and input data after internal quality | ||
control checks and machine-specific data processing. While | ||
register-based and clinical data usually get collected | ||
continuously, direct access to them is only given on a batch and | ||
infrequent basis, so they may also fall under this category. Survey | ||
data may also come in batches, depending on the questionnaire and | ||
software used for its collection. | ||
|
||
[^1]: Observational unit is the "entity" that the data was collected | ||
from at a given point in time, such as a human participant in a | ||
cohort study or a rat in an animal study at a specific time point. | ||
|
||
Regardless of the flow or frequency of data generation and collection, | ||
the ability to automatically ingest the data into Sprout will vary wildly | ||
based on the data source, the organization who generates the data, and | ||
their technical expertise. Some data sources may have well-established, | ||
but not always programmatic or automatic, workflows and processes. | ||
Others may not have any workflow and it may be an extremely manual | ||
process. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.