-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write a flowchart of computation #61
Comments
flowchart TB
raw_data[raw data.parquet]
class raw_data dataNode;
cached_data[.cache/nisapi/clean]
class cached_data dataNode;
clean_data[clean data as lazy data frame]
class clean_data dataNode;
cached_all_datasets.py
class cached_all_datasets.py funcNode;
get_nis.py
class get_nis.py funcNode;
preprocess.py
class preprocess.py funcNode;
ready_data[clean data as data frame]
class ready_data dataNode;
data_validate1[validate schema]
class data_validate1 validateNode;
data_validate2[validate schema]
class data_validate2 validateNode;
pred_validate[validate schema and quantile]
class pred_validate validateNode;
config([config])
class config configNode;
train_data[train data as IncidentUptakeData]
class train_data objectNode;
incident_predict[incident prediction as PointForecast]
class incident_predict objectNode;
test_data[test data as IncidentUptakeData]
class test_data objectNode;
eval[eval.py: mspe, bias, end-of-season uptake error]
class eval funcNode;
model[model.py:LinearIncidentUptakeModel]
class model funcNode;
subgraph NIS-API
raw_data --> cached_all_datasets.py --> cached_data --> get_nis.py --> clean_data
end
clean_data --> preprocess.py --> ready_data
subgraph main1.py["main.py 'projection' "]
ready_data --> train_data --> model --> incident_predict
data_validate1 --> train_data
pred_validate --> incident_predict
end
config --> main1.py
subgraph main2.py["main.py 'evaluation' "]
incident_predict --> eval
ready_data --> test_data --> eval
data_validate2 --> test_data
end
config --> main2.py
classDef dataNode fill:#00cfcc;
classDef funcNode fill:#42b3f5;
classDef configNode fill:#f58742;
classDef objectNode fill:#bac5e8;
classDef validateNode fill:#b00afc;
|
My vote is to keep the entire pipeline in
If/when these changes make the run-time of |
As for the flow diagram, I agree with @swo that the granular steps inside NIS_API can be left out, and that the low contrast between font color and compartment color is hard to read. Additionally, removing the "projection" vs. "evaluation" to make [model fitting -> making projections -> evaluation metrics] a single linear process, as discussed in PR #82, will change the structure of the flow diagram a bit. But I think this is a very good start, and I like your color-coding of data objects vs. key operations vs. validation vs. etc. Maybe you can teach me how to make these flowcharts sometime? |
After our discussion, I agree. It's OK to keep everything in one script for now, if it's fast. Once it starts slowing down, think harder about breaking up computational actions into functions, splitting them across scripts, and using a Makefile to manage the pipeline. |
Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go? |
target is unusual in that it's R code running R functions, which introduces a number of limitations (e.g., it makes parallelization or cloud computation really challenging). Make runs command line commands (which, in this case, will be python scripts). Python has Snakemake, which lets you write makefile-like "snakefiles" that can take advantage of full Python logic (e.g., read in a list of targets from a yaml), but it still ultimately issues command line commands, so you could eg use a Snakefile to run an R pipeline (if you had R scripts that accepted command line arguments). |
@swo Thanks! that is cool to learn about. I haven't heard about Snakemake! |
See also https://teams.microsoft.com/l/message/19:[email protected]/1734462023465?tenantId=9ce70869-60db-44fd-abe8-d2767077fc8f&groupId=4f34eba7-8fcb-4a26-9b34-a434ea777f0c&parentMessageId=1734462023465&teamName=CFA-Predict&channelName=Topic%20-%20Dev&createdTime=1734462023465 for a discussion about similar tools; it seems like Nextflow and Metaflow are common alternatives |
Cf. https://github.com/cdcent/fall-virus-model?tab=readme-ov-file#file-and-script-organization for an example
The text was updated successfully, but these errors were encountered: