-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16362: pydaos.torch checkpointing #15691
base: master
Are you sure you want to change the base?
Conversation
Errors are component not formatted correctly,Ticket number suffix is not a number. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/1/execution/node/133/log |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/2/execution/node/134/log |
91c5ab5
to
9edb95b
Compare
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/3/execution/node/134/log |
60a4612
to
e72b63d
Compare
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/4/execution/node/135/log |
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/5/execution/node/135/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits. Overall looks good!
Since directory_tree.py
was modified we should run the test that uses it. You can do that by including this string in future commit messages
Features: DfuseFind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the C shim part looks good to me. no changes requested; just some clarifications / comments added.
|
||
assert(hdl->dfs != NULL); | ||
|
||
int rc = dfs_lookup(hdl->dfs, path, O_RDONLY, &obj, NULL, &st); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can optimize this a bit if the dir path is cached to just call dfs_stat. but for now this is fine.
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/6/execution/node/133/log |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/7/execution/node/134/log |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/8/execution/node/134/log |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/9/execution/node/134/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/9/execution/node/1229/log |
Functional tests for pydaos.torch module is now available. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
No more negative error values. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Checkpoint writes can be now done in chunks in parallel. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Did not take into account that timeout is shared across all tests in suit. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Co-authored-by: Dalton Bohning <[email protected]> Signed-off-by: enakta <[email protected]>
Features: DfuseFind,PytorchCheckpointTest,PytorchDatasetsTest Signed-off-by: Denis Barakhtanov <[email protected]>
EL8 has a bit old pytorch version Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
5ea3a6a
to
82c629f
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/16/execution/node/1220/log |
# (C) Copyright 2024 Google LLC | ||
# (C) Copyright 2024 Enakta Labs Ltd | ||
# (C) Copyright 2024-2025 Intel Corporation. | ||
# (C) Copyright 2025 Hewlett Packard Enterprise Development LP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably not touch those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linting was failing without these 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils/cq/check_update_copyright.sh
Just add handling for Enakta there and you should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can still land it with copyright issues though if you just want to change them
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15691/17/testReport/ |
Looks like lazy python shim module initialisation does not always work correctly, most likely the same issue as #15277. |
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
EL8.8 has python 3.6 that does not have `initializer` argument in ProcessPoolExecutor which makes it impossible to use due to forking and needs of `daos_reinit` call. This commit replace ProcessPoolExecutor API to its underlying multiprocess API. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
3c0067d
to
5843c45
Compare
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
5843c45
to
f19fe56
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/21/execution/node/1176/log |
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/22/execution/node/1565/log |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15691/23/testReport/ |
Features: DfuseFind Allow-unstable-test: true Signed-off-by: Denis Barakhtanov <[email protected]>
Introducing PyTorch checkpoint interface and user documentation for
pydaos.torch
module.Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: