-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-6611 control: Enable dmg pool exclude,drain,reint on multiple ranks #15731
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'Add support to reintegration, drain (dmg) commands for multiple ranks on command line.' |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/1/execution/node/370/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/1/execution/node/375/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/1/execution/node/353/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/1/execution/node/261/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/1/execution/node/354/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/1/execution/node/301/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/2/execution/node/364/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/2/execution/node/375/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/2/execution/node/344/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/2/execution/node/345/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/2/execution/node/319/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/2/execution/node/314/log |
d47692a
to
4509d9b
Compare
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/3/execution/node/308/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/3/execution/node/349/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/3/execution/node/334/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/3/execution/node/331/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/3/execution/node/401/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/3/execution/node/388/log |
4509d9b
to
0041dcb
Compare
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/4/execution/node/334/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/4/execution/node/327/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/4/execution/node/326/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/4/execution/node/386/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/4/execution/node/373/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/4/execution/node/482/log |
0041dcb
to
9352123
Compare
Signed-off-by: Tom Nabarro <[email protected]>
…ainpool-multirank Features: pool Allow-unstable-test: true Signed-off-by: Tom Nabarro <tom.nabarrointel.com>
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/8/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/8/testReport/ |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/9/execution/node/371/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/9/execution/node/376/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/9/execution/node/346/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/9/execution/node/345/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/9/execution/node/365/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15731/9/execution/node/349/log |
Features: pool Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
1810850
to
79da39d
Compare
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/10/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/10/testReport/ |
Features: pool Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/11/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/11/testReport/ |
Features: pool Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/12/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/12/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it planned to have one or more functional tests using these new features ?
Do we have some protocol version information to update as this new feature will change the RPC content and not just extend it ?
type PoolRanksResp struct { | ||
Status int32 `json:"status"` | ||
ID string `json:"id"` | ||
FailedRank ranklist.Rank `json:"failed_rank"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am probably missing something, but do not see obvious reasons why no more than one rank could fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the design chosen is to iterate through ranklist in the engine dRPC handler to update each rank's state. the first failure will break out of the loop and return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be more future-proof to make it a list. Even if the current implementation can only ever return a single rank, if the implementation is changed to perform some of these operations in parallel, then it won't require updates to the messages, UI, etc.
@@ -831,9 +927,9 @@ type PoolExtendReq struct { | |||
// This should automatically start the rebalance process. | |||
// Returns an error (including any DER code from DAOS). | |||
func PoolExtend(ctx context.Context, rpcClient UnaryInvoker, req *PoolExtendReq) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, even if PoolExtend
is already using a list of ranks, why do not generalise the usage of PoolRanksReq
and remove PoolExtendReq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to do this but PoolExtendResp contains an extra field tier_bytes
and I didn't want to add this field to the common PoolRanksResp just for the extend specific case.
Features: pool Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
I imagine there will be more than one functional test added, I think @rpadma2 will be adding these. The protocol changes are dmg <-> daos_server <-> engine which at the moment all have to be at the same version IIUC, am I understanding that correct @mjmac @kjacque @NiuYawei @liw or is there some compatibility related work to be done? |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15731/13/testReport/ |
type PoolRanksResp struct { | ||
Status int32 `json:"status"` | ||
ID string `json:"id"` | ||
FailedRank ranklist.Rank `json:"failed_rank"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be more future-proof to make it a list. Even if the current implementation can only ever return a single rank, if the implementation is changed to perform some of these operations in parallel, then it won't require updates to the messages, UI, etc.
return cmd.OutputJSON(resp, err) | ||
} | ||
|
||
cmd.Debugf("%T: %+v, %T: %+v", req, req, resp, resp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these debug lines necessary? I think you should already have debug printing from the control library.
That's correct. There's no interoperability concern here because all of those components must be the same version. At some point, when rolling upgrades are supported, we'll need to ensure that dmg can always talk to the -1/+1 versions of daos_server, though. |
Enable --ranks option on excluded drain and reintegrate dmg pool cmds.
Uses common request and response type for pool-ranks operations where
possible. Reuse common display format for exclude/drain/reint system
and pool cmds. Improvement of workflow and unit test coverage.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: