Write a flowchart of computation #61

swo · 2024-11-25T15:21:49Z

Inputs (and their types)
Where do validation happen
Which scripts are involved, when
Add validations where needed (cf Data refactor #65 (comment))

Cf. https://github.com/cdcent/fall-virus-model?tab=readme-ov-file#file-and-script-organization for an example

Fuhan-Yang · 2024-12-17T21:04:00Z

flowchart TB

raw_data[raw data.parquet] 
class raw_data dataNode; 

cached_data[.cache/nisapi/clean]
class cached_data dataNode;

clean_data[clean data as lazy data frame]
class clean_data dataNode;

cached_all_datasets.py
class cached_all_datasets.py funcNode;

get_nis.py
class get_nis.py funcNode;

preprocess.py
class preprocess.py funcNode;

ready_data[clean data as data frame]
class ready_data dataNode;

data_validate1[validate schema]
class data_validate1 validateNode;

data_validate2[validate schema]
class data_validate2 validateNode;

pred_validate[validate schema and quantile]
class pred_validate validateNode;

config([config])
class config configNode;

train_data[train data as IncidentUptakeData]
class train_data objectNode;

incident_predict[incident prediction as PointForecast]
class incident_predict objectNode;

test_data[test data as IncidentUptakeData] 
class test_data objectNode;

eval[eval.py: mspe, bias, end-of-season uptake error]
class eval funcNode;

model[model.py:LinearIncidentUptakeModel]
class model funcNode;

subgraph NIS-API
raw_data --> cached_all_datasets.py --> cached_data --> get_nis.py --> clean_data
end

clean_data --> preprocess.py -->  ready_data

subgraph main1.py["main.py 'projection' "]
ready_data --> train_data --> model --> incident_predict
data_validate1 --> train_data
pred_validate --> incident_predict
end

config --> main1.py 

subgraph main2.py["main.py 'evaluation' "]
incident_predict --> eval
ready_data --> test_data --> eval
data_validate2 --> test_data
end

config --> main2.py

classDef dataNode fill:#00cfcc;
classDef funcNode fill:#42b3f5;
classDef configNode fill:#f58742;
classDef objectNode fill:#bac5e8;
classDef validateNode fill:#b00afc;

swo · 2024-12-17T21:33:28Z

The big question is whether you want a single main.py to do everything, or if you want to break this up into different pieces. I guess it's fine to start with everything all in once place, and then start splitting it up, as you find that it slows you down, to need to re-run everything. Do you have a sense of how long it takes main.py to run now?
You can drop everything that's inside of nisapi; you can just start from nisapi.get_nis()
I find it hard to read some of these colors:

eschrom · 2024-12-17T23:13:50Z

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second. That said, several upcoming changes will cause this to increase:

Replacing scikit-learn models that generate point estimates with numpyro models that generate full posteriors
Using multiple models rather than just the one
Refitting the model each time the forecast date changes, to include all available data, as discussed in PR Add evaluation to configuration #82

If/when these changes make the run-time of main.py intolerably long, that's when I suggest we switch to a Makefile to compartmentalize the steps. But personally, for the sake of scientific progress, I'd prefer to focus on these three items themselves before refactoring our scripts.

eschrom · 2024-12-17T23:18:40Z

As for the flow diagram, I agree with @swo that the granular steps inside NIS_API can be left out, and that the low contrast between font color and compartment color is hard to read. Additionally, removing the "projection" vs. "evaluation" to make [model fitting -> making projections -> evaluation metrics] a single linear process, as discussed in PR #82, will change the structure of the flow diagram a bit.

But I think this is a very good start, and I like your color-coding of data objects vs. key operations vs. validation vs. etc. Maybe you can teach me how to make these flowcharts sometime?

swo · 2024-12-20T14:57:22Z

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second

After our discussion, I agree. It's OK to keep everything in one script for now, if it's fast. Once it starts slowing down, think harder about breaking up computational actions into functions, splitting them across scripts, and using a Makefile to manage the pipeline.

cherz4 · 2024-12-20T16:12:16Z

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

swo · 2024-12-20T17:11:56Z

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

target is unusual in that it's R code running R functions, which introduces a number of limitations (e.g., it makes parallelization or cloud computation really challenging).

Make runs command line commands (which, in this case, will be python scripts).

Python has Snakemake, which lets you write makefile-like "snakefiles" that can take advantage of full Python logic (e.g., read in a list of targets from a yaml), but it still ultimately issues command line commands, so you could eg use a Snakefile to run an R pipeline (if you had R scripts that accepted command line arguments).

cherz4 · 2024-12-20T22:17:51Z

@swo Thanks! that is cool to learn about. I haven't heard about Snakemake!

swo · 2024-12-27T17:12:44Z

See also https://teams.microsoft.com/l/message/19:[email protected]/1734462023465?tenantId=9ce70869-60db-44fd-abe8-d2767077fc8f&groupId=4f34eba7-8fcb-4a26-9b34-a434ea777f0c&parentMessageId=1734462023465&teamName=CFA-Predict&channelName=Topic%20-%20Dev&createdTime=1734462023465 for a discussion about similar tools; it seems like Nextflow and Metaflow are common alternatives

swo mentioned this issue Dec 4, 2024

Data refactor #65

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a flowchart of computation #61

Write a flowchart of computation #61

swo commented Nov 25, 2024 •

edited

Loading

Fuhan-Yang commented Dec 17, 2024 •

edited

Loading

swo commented Dec 17, 2024

eschrom commented Dec 17, 2024

eschrom commented Dec 17, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 27, 2024

Write a flowchart of computation #61

Write a flowchart of computation #61

Comments

swo commented Nov 25, 2024 • edited Loading

Fuhan-Yang commented Dec 17, 2024 • edited Loading

swo commented Dec 17, 2024

eschrom commented Dec 17, 2024

eschrom commented Dec 17, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 27, 2024

swo commented Nov 25, 2024 •

edited

Loading

Fuhan-Yang commented Dec 17, 2024 •

edited

Loading