-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16897 object: add N+3 EC object class #15649
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'Add new N+3 EC object class' |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15649/1/execution/node/1506/log |
b15b79d
to
c4b3e71
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15649/2/execution/node/1607/log |
Currently, DAOS supports EC (Erasure Coding) object classes with redundancy levels of N+1 and N+2. In certain scenarios, users may wish to use N+3 for enhanced redundancy and safety. Generally, DAOS’s EC and rebuild mechanisms are designed to handle various parity levels effectively. With the introduction of new object classes, comprehensive testing should be conducted to ensure that these changes do not introduce any unexpected issues or disrupt existing functionalities. Extend test cases to cover EC_4P3X object classes as a min test coverage. Required-githooks: true Signed-off-by: Wang Shilong <[email protected]>
c4b3e71
to
7d3241b
Compare
@@ -65,4 +65,5 @@ ior: | |||
#- [EC_Object_Class, Minimum number of servers] | |||
- ["EC_2P2G1", 6] | |||
- ["EC_4P2G1", 8] | |||
- ["EC_4P3G1", 9] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be 10 instead of 9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the comment says "Minimum number of servers", the code implements "Exact number of servers" because it uses ==
if oclass[1] == self.server_count: |
So we should actually set this to
12
so the test runs with 4P3 at all.(because above in this file there is 6, 8, 12 server variants)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please update the comment above :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just note 4P3G1 with 7 shards per object. for placement consideration 7 engines is enough, may need to add a few more spare engines if kills some ranks in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is supposed to survive 3 engine failures. so we need 3 spares.
@@ -58,4 +58,5 @@ mdtest: | |||
#- [EC_Object_Class, Minimum number of servers] | |||
- ["EC_2P2GX", 6] | |||
- ["EC_4P2GX", 8] | |||
- ["EC_4P3GX", 9] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here for RF3 case with ECxP3, should we set dir_oclass to RP4 for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how to expand this in yaml file. because there are different object classes which has different RP levels for directory now, so I leave it as RP3 for now.
A further think tests now at mostly killed two ranks(it assumed 2 parities before) I suppose it should be smarter to kill different number of ranks according to different object classes. I suppose this is not an easy change for me, it will be nice that someone in CI could help improve this in the PR(or different PR later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok maybe i misunderstood the purpose of this PR. i thought we want to test that EC xP3 will work with 3 concurrent fault domain failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to test EC*PX
with X
failures we'll need to make some utility changes. It shouldn't be too difficult but does affect several tests. I can help with that if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the dir_oclass to match the RF is also doable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also FWIW many of these tests with P2
only test with 1 failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think we need to improve that to properly test against data loss for rf2 and now rf3. for this PR though i think the changes made are sufficient. but still need a +1 from @daltonbohning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix some of the ftest since the comments in the configs are wrong. Otherwise ftest LGTM
Signed-off-by: Wang Shilong <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15649/8/testReport/ |
Signed-off-by: Wang Shilong <[email protected]>
490a5e3
Skip-nlt: true Signed-off-by: Wang Shilong <[email protected]>
Fixed copyright and skip NLT since it is a pain for now(I don't think this PR could break anything for NLT). |
ping other reviewers.. |
currently, DAOS supports EC (Erasure Coding) object classes with redundancy levels of N+1 and N+2.
In certain scenarios, users may wish to use N+3 for enhanced redundancy and safety. Generally,
DAOS’s EC and rebuild mechanisms are designed to handle various parity levels effectively.
With the introduction of new object classes, comprehensive testing should be conducted to
ensure that these changes do not introduce any unexpected issues or disrupt existing functionalities.
Extend test cases to cover EC_4P3X object classes as a min test coverage.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: