Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a flowchart of computation #61

Open
swo opened this issue Nov 25, 2024 · 9 comments
Open

Write a flowchart of computation #61

swo opened this issue Nov 25, 2024 · 9 comments

Comments

@swo
Copy link
Collaborator

swo commented Nov 25, 2024

  • Inputs (and their types)
  • Where do validation happen
  • Which scripts are involved, when
  • Add validations where needed (cf Data refactor #65 (comment))

Cf. https://github.com/cdcent/fall-virus-model?tab=readme-ov-file#file-and-script-organization for an example

@swo swo mentioned this issue Dec 4, 2024
@Fuhan-Yang
Copy link
Contributor

Fuhan-Yang commented Dec 17, 2024

flowchart TB

raw_data[raw data.parquet] 
class raw_data dataNode; 

cached_data[.cache/nisapi/clean]
class cached_data dataNode;

clean_data[clean data as lazy data frame]
class clean_data dataNode;

cached_all_datasets.py
class cached_all_datasets.py funcNode;

get_nis.py
class get_nis.py funcNode;

preprocess.py
class preprocess.py funcNode;

ready_data[clean data as data frame]
class ready_data dataNode;

data_validate1[validate schema]
class data_validate1 validateNode;

data_validate2[validate schema]
class data_validate2 validateNode;

pred_validate[validate schema and quantile]
class pred_validate validateNode;

config([config])
class config configNode;

train_data[train data as IncidentUptakeData]
class train_data objectNode;

incident_predict[incident prediction as PointForecast]
class incident_predict objectNode;

test_data[test data as IncidentUptakeData] 
class test_data objectNode;

eval[eval.py: mspe, bias, end-of-season uptake error]
class eval funcNode;

model[model.py:LinearIncidentUptakeModel]
class model funcNode;

subgraph NIS-API
raw_data --> cached_all_datasets.py --> cached_data --> get_nis.py --> clean_data
end

clean_data --> preprocess.py -->  ready_data

subgraph main1.py["main.py 'projection' "]
ready_data --> train_data --> model --> incident_predict
data_validate1 --> train_data
pred_validate --> incident_predict
end

config --> main1.py 

subgraph main2.py["main.py 'evaluation' "]
incident_predict --> eval
ready_data --> test_data --> eval
data_validate2 --> test_data
end

config --> main2.py

classDef dataNode fill:#00cfcc;
classDef funcNode fill:#42b3f5;
classDef configNode fill:#f58742;
classDef objectNode fill:#bac5e8;
classDef validateNode fill:#b00afc;





Loading

@swo
Copy link
Collaborator Author

swo commented Dec 17, 2024

  • The big question is whether you want a single main.py to do everything, or if you want to break this up into different pieces. I guess it's fine to start with everything all in once place, and then start splitting it up, as you find that it slows you down, to need to re-run everything. Do you have a sense of how long it takes main.py to run now?
  • You can drop everything that's inside of nisapi; you can just start from nisapi.get_nis()
  • I find it hard to read some of these colors: image

@eschrom
Copy link
Collaborator

eschrom commented Dec 17, 2024

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second. That said, several upcoming changes will cause this to increase:

  • Replacing scikit-learn models that generate point estimates with numpyro models that generate full posteriors
  • Using multiple models rather than just the one
  • Refitting the model each time the forecast date changes, to include all available data, as discussed in PR Add evaluation to configuration #82

If/when these changes make the run-time of main.py intolerably long, that's when I suggest we switch to a Makefile to compartmentalize the steps. But personally, for the sake of scientific progress, I'd prefer to focus on these three items themselves before refactoring our scripts.

@eschrom
Copy link
Collaborator

eschrom commented Dec 17, 2024

As for the flow diagram, I agree with @swo that the granular steps inside NIS_API can be left out, and that the low contrast between font color and compartment color is hard to read. Additionally, removing the "projection" vs. "evaluation" to make [model fitting -> making projections -> evaluation metrics] a single linear process, as discussed in PR #82, will change the structure of the flow diagram a bit.

But I think this is a very good start, and I like your color-coding of data objects vs. key operations vs. validation vs. etc. Maybe you can teach me how to make these flowcharts sometime?

@swo
Copy link
Collaborator Author

swo commented Dec 20, 2024

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second

After our discussion, I agree. It's OK to keep everything in one script for now, if it's fast. Once it starts slowing down, think harder about breaking up computational actions into functions, splitting them across scripts, and using a Makefile to manage the pipeline.

@cherz4
Copy link

cherz4 commented Dec 20, 2024

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

@swo
Copy link
Collaborator Author

swo commented Dec 20, 2024

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

target is unusual in that it's R code running R functions, which introduces a number of limitations (e.g., it makes parallelization or cloud computation really challenging).

Make runs command line commands (which, in this case, will be python scripts).

Python has Snakemake, which lets you write makefile-like "snakefiles" that can take advantage of full Python logic (e.g., read in a list of targets from a yaml), but it still ultimately issues command line commands, so you could eg use a Snakefile to run an R pipeline (if you had R scripts that accepted command line arguments).

@cherz4
Copy link

cherz4 commented Dec 20, 2024

@swo Thanks! that is cool to learn about. I haven't heard about Snakemake!

@swo
Copy link
Collaborator Author

swo commented Dec 27, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants