Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16897 object: add N+3 EC object class #15649

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wangshilong
Copy link
Contributor

@wangshilong wangshilong commented Dec 19, 2024

currently, DAOS supports EC (Erasure Coding) object classes with redundancy levels of N+1 and N+2.
In certain scenarios, users may wish to use N+3 for enhanced redundancy and safety. Generally,
DAOS’s EC and rebuild mechanisms are designed to handle various parity levels effectively.

With the introduction of new object classes, comprehensive testing should be conducted to
ensure that these changes do not introduce any unexpected issues or disrupt existing functionalities.

Extend test cases to cover EC_4P3X object classes as a min test coverage.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Dec 19, 2024

Ticket title is 'Add new N+3 EC object class'
Status is 'In Progress'
Labels: 'LRZ'
https://daosio.atlassian.net/browse/DAOS-16897

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15649/1/execution/node/1506/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15649/2/execution/node/1607/log

Currently, DAOS supports EC (Erasure Coding) object classes with redundancy levels of N+1 and N+2.
In certain scenarios, users may wish to use N+3 for enhanced redundancy and safety. Generally,
DAOS’s EC and rebuild mechanisms are designed to handle various parity levels effectively.

With the introduction of new object classes, comprehensive testing should be conducted to
ensure that these changes do not introduce any unexpected issues or disrupt existing functionalities.

Extend test cases to cover EC_4P3X object classes as a min test coverage.

Required-githooks: true
Signed-off-by: Wang Shilong <[email protected]>
@wangshilong wangshilong marked this pull request as ready for review January 2, 2025 06:35
@wangshilong wangshilong requested review from a team as code owners January 2, 2025 06:35
liuxuezhao
liuxuezhao previously approved these changes Jan 2, 2025
@@ -65,4 +65,5 @@ ior:
#- [EC_Object_Class, Minimum number of servers]
- ["EC_2P2G1", 6]
- ["EC_4P2G1", 8]
- ["EC_4P3G1", 9]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be 10 instead of 9?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the comment says "Minimum number of servers", the code implements "Exact number of servers" because it uses ==

if oclass[1] == self.server_count:

So we should actually set this to 12 so the test runs with 4P3 at all.
(because above in this file there is 6, 8, 12 server variants)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please update the comment above :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just note 4P3G1 with 7 shards per object. for placement consideration 7 engines is enough, may need to add a few more spare engines if kills some ranks in the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supposed to survive 3 engine failures. so we need 3 spares.

@@ -58,4 +58,5 @@ mdtest:
#- [EC_Object_Class, Minimum number of servers]
- ["EC_2P2GX", 6]
- ["EC_4P2GX", 8]
- ["EC_4P3GX", 9]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here for RF3 case with ECxP3, should we set dir_oclass to RP4 for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to expand this in yaml file. because there are different object classes which has different RP levels for directory now, so I leave it as RP3 for now.

A further think tests now at mostly killed two ranks(it assumed 2 parities before) I suppose it should be smarter to kill different number of ranks according to different object classes. I suppose this is not an easy change for me, it will be nice that someone in CI could help improve this in the PR(or different PR later).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok maybe i misunderstood the purpose of this PR. i thought we want to test that EC xP3 will work with 3 concurrent fault domain failures?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to test EC*PX with X failures we'll need to make some utility changes. It shouldn't be too difficult but does affect several tests. I can help with that if needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the dir_oclass to match the RF is also doable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FWIW many of these tests with P2 only test with 1 failure

Copy link
Contributor

@mchaarawi mchaarawi Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think we need to improve that to properly test against data loss for rf2 and now rf3. for this PR though i think the changes made are sufficient. but still need a +1 from @daltonbohning

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix some of the ftest since the comments in the configs are wrong. Otherwise ftest LGTM

src/tests/ftest/erasurecode/online_rebuild_mdtest.yaml Outdated Show resolved Hide resolved
src/tests/ftest/ior/hard_rebuild.yaml Outdated Show resolved Hide resolved
Signed-off-by: Wang Shilong <[email protected]>
liuxuezhao
liuxuezhao previously approved these changes Jan 3, 2025
mchaarawi
mchaarawi previously approved these changes Jan 3, 2025
daltonbohning
daltonbohning previously approved these changes Jan 3, 2025
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15649/8/testReport/

Signed-off-by: Wang Shilong <[email protected]>
Skip-nlt: true
Signed-off-by: Wang Shilong <[email protected]>
@wangshilong
Copy link
Contributor Author

Fixed copyright and skip NLT since it is a pain for now(I don't think this PR could break anything for NLT).

@wangshilong
Copy link
Contributor Author

ping other reviewers..

@wangshilong wangshilong requested a review from a team January 10, 2025 06:29
@wangshilong wangshilong added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

5 participants