Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TC Analysis parallel #631

Merged
merged 1 commit into from
Oct 15, 2024
Merged

Make TC Analysis parallel #631

merged 1 commit into from
Oct 15, 2024

Conversation

forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Oct 15, 2024

Issue resolution

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

1. Does this do what we want it to do?

Objectives:

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Discussed these design decisions with @chengzhuzhang
  • Testing: I have added or modified at least one "min-case" configuration file to test this change. Every objective above is represented in at least one cfg.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.
  • Left comments, no block here requires review.

3. Is this well documented?

Required:

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.
  • Discussed these design decisions with @chengzhuzhang

@forsyth2 forsyth2 added the semver: small improvement Small improvement (will increment patch version) label Oct 15, 2024
@forsyth2 forsyth2 self-assigned this Oct 15, 2024
@forsyth2
Copy link
Collaborator Author

Unit tests pass (https://github.com/E3SM-Project/zppy/actions/runs/11355423070/job/31584722994?pr=631).

I didn't run into any of the issues from #180. Those errors were intermittent, so I do have some concern they just didn't happen to happen this time. That said, it's very possible underlying things have changed since those bugs were first noticed. For reference, those bugs were:

  • GenerateConnectivityFile: error while loading shared libraries: libnetcdf.so.11: cannot open shared object file: No such file or directory previously occurred when multiple year sets of tc_analysis were run simultaneously. This was tested here with test_min_case_e3sm_diags_tc_analysis_parallel_chrysalis.cfg.
  • socket.gaierror: [Errno -2] Name or service not known previously occurred in E3SM Diags when running the tc_analysis set in parallel. This was also tested here with test_min_case_e3sm_diags_tc_analysis_parallel_chrysalis.cfg.
Full testing details
# zppy dev environment
mamba clean --all
mamba env create -f conda/dev.yml -n zppy_dev_make-tc-parallel-20241015
conda activate zppy_dev_make-tc-parallel-20241015
pre-commit run --all-files
pip install .
python -u -m unittest tests/test_*.py

# e3sm_diags dev environment
cd ../e3sm_diags
git checkout main
git fetch upstream
git reset --hard upstream/main
git log # Should match https://github.com/E3SM-Project/e3sm_diags/commits/main
mamba clean --all
mamba env create -f conda-env/dev.yml -n e3sm_diags_20241015
conda activate e3sm_diags_20241015
pip install .

cd ../zppy
conda activate zppy_dev_make-tc-parallel-20241015
pip install .

zppy -c tests/integration/generated/test_min_case_tc_analysis_simultaneous_1_chrysalis.cfg
zppy -c tests/integration/generated/test_min_case_tc_analysis_simultaneous_2_chrysalis.cfg 
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_tc_analysis_simultaneous_1_output/test-631-v2/v2.LR.historical_0201/post/scripts/
grep -v "OK" *status
# No errors
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_tc_analysis_simultaneous_2_output/test-631-v2/v2.LR.historical_0201/post/scripts/
grep -v "OK" *status
# No errors

zppy -c tests/integration/generated/test_min_case_e3sm_diags_tc_analysis_chrysalis.cfg
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_output/test-631-v2/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# No errors
# https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_www/test-631-v2/v2.LR.historical_0201/e3sm_diags/atm_monthly_180x360_aave_tc_analysis/model_vs_obs_1985-1986/viewer/tc_analysis/index.html has plots

zppy -c tests/integration/generated/test_min_case_e3sm_diags_tc_analysis_mvm_1_chrysalis.cfg
# Wait to complete
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_mvm_1_output/test-631-v2/v2.LR.historical_0201/post/scripts/
grep -v "OK" *status
# No errors
cd -
zppy -c tests/integration/generated/test_min_case_e3sm_diags_tc_analysis_mvm_2_chrysalis.cfg
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_mvm_2_output/test-631-v2/v2.LR.historical_0201/post/scripts/
grep -v "OK" *status
# No errors
# https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_mvm_2_www/test-631-v2/v2.LR.historical_0201/e3sm_diags/atm_monthly_180x360_aave_mvm/model_vs_model_1995-1996/viewer/tc_analysis/index.html has plots

zppy -c tests/integration/generated/test_min_case_e3sm_diags_tc_analysis_parallel_chrysalis.cfg
cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_parallel_output/test-631-v2/v2.LR.historical_0201/post/scripts/
grep -v "OK" *status
# No errors
# https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_parallel_www/test-631-v2/v2.LR.historical_0201/e3sm_diags/atm_monthly_180x360_aave_tc_analysis/model_vs_obs_1985-1986/viewer/tc_analysis/index.html has plots
# https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.forsyth2/zppy_min_case_e3sm_diags_tc_analysis_parallel_www/test-631-v2/v2.LR.historical_0201/e3sm_diags/atm_monthly_180x360_aave_tc_analysis/model_vs_obs_1987-1988/viewer/tc_analysis/index.html has plots

@forsyth2 forsyth2 marked this pull request as ready for review October 15, 2024 22:48
Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should resolve: #180, #622. However, #623 will still need to be done in some form to account for dependency handling in other tasks. Furthermore, we still need to implement the equivalent changes of E3SM-Project/e3sm_diags#824.

@@ -1,20 +1,20 @@
[default]
case = "#expand case_name#"
case = "#expand case_name_v2#"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any test using tc_analysis now uses the v2 input data because of E3SM-Project/e3sm_diags#866 (comment)

@@ -0,0 +1,36 @@
[default]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New min-case test where E3SM Diags runs in parallel to generate tc_analysis plots for two year sets.

for c in tasks:
dependencies: List[str] = []
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant changes from #623. We want to reset the dependencies list at the beginning of each task (i.e., nothing should be carried over).


# Due to a `socket.gaierror: [Errno -2] Name or service not known` error when running e3sm_diags with tc_analysis
# on multiple year_sets, if tc_analysis is in sets, then e3sm_diags should be run sequentially.
if "tc_analysis" in c["sets"]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer running e3sm_diags sequentially

for c in tasks:

dependencies: List[str] = []
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant change from #623

# Note that this line should still be executed even if jobid == -1
# The later tc_analysis tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(status_file)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer running tc_analysis sequentially.

@forsyth2 forsyth2 merged commit d5c8005 into main Oct 15, 2024
4 checks passed
@forsyth2 forsyth2 deleted the make-tc-parallel branch October 15, 2024 23:03
@chengzhuzhang
Copy link
Collaborator

@forsyth2 thank you for working on the PR and the comments. I visually checked and it looks good. I'm happy to review any upcoming TC analysis related PRs, so feel free to tag me.

@forsyth2
Copy link
Collaborator Author

Thanks @chengzhuzhang!

@forsyth2 forsyth2 mentioned this pull request Oct 16, 2024
11 tasks
zhangshixuan1987 pushed a commit to zhangshixuan1987/zppy that referenced this pull request Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: small improvement Small improvement (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix shared library error in tc_analysis
2 participants