-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16251 cart: Ignore expired bulks and rpcs #14971
base: master
Are you sure you want to change the base?
Conversation
- Switch rpc headers to transfer deadline instead of a timeout - Add checks at the start and end of bulk transfer to ensure deadline has not expired. - Add deadline expiration checks in all places where rpc_priv timeout is initialized Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Ticket title is 'DAOS 2.4.2-4: Errored DAOS engine 0 exited unexpectedly on daos_user' |
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14971/3/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14971/3/testReport/ |
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Allow-unstable-test: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Allow-unstable-test: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
src/cart/crt_bulk.c
Outdated
if (crt_bulk_desc_expired(bulk_desc)) | ||
D_GOTO(out, rc = -DER_DEADLINE_EXPIRED); | ||
|
||
D_ALLOC_PTR(verify_arg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should avoid allocating anything here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, this is for now for experimentation, not happy with this alloc either, but cant think of cleaner way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated verify_cb into the existing cb info structs, avoiding extra allocations on this path
src/cart/crt_hg_proc.c
Outdated
@@ -583,7 +583,7 @@ crt_proc_in_common(crt_proc_t proc, crt_rpc_input_t *data) | |||
); | |||
hdr->cch_dst_tag = rpc_priv->crp_pub.cr_ep.ep_tag; | |||
|
|||
hdr->cch_src_timeout = rpc_priv->crp_timeout_sec; | |||
hdr->cch_src_deadline_sec = timeout_to_deadline(rpc_priv->crp_timeout_sec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general we would want to use half of the timeout value, based on what Sean had mentioned in the past, ie. client timeout value is double the server timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure using half is the right approach. i could see using "deadline - [small amount]" to account for the response delay, but for example, expiring a "60second timeout" rpc in 30 seconds (vs say 59 seconds on the receiver) would cause a lot of test issues and weird behaviors.
Skip-nlt: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14971/7/execution/node/1053/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14971/7/execution/node/1141/log |
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Skip-nlt: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14971/10/execution/node/1104/log |
Required-githooks: true Skip-nlt: true Signed-off-by: Alexander A Oganezov <[email protected]>
Skip-nlt: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14971/12/execution/node/1105/log |
Skin-nlt: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14971/13/execution/node/1183/log |
of the second, causing rpc to expire prematurely. Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Allow-unstable-test: true Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
- add crp_deadline_sec to rpc_priv. rpc header is now populated from it, instead of timeout being converted to deadline at the time of the encoding. - deadline is now set at crt_req_send() time. for collective calls, if deadline is already set prior to crt_req_send(), then deadline is left alone. Otherwise a timeout is converted to the deadline. This is done for children of collective rpcs to properly inherit the parents deadline. - colletive rpcs will also set local expiration time now based on deadline, instead of previously incorrect behavior with each child having its own timeout starting from 'now'. - 1 second test timeouts expanded to 3 seconds, to allow for room due to 1 sec expiration granularity. - fix: realtime clock is used now in bulk expiration handling similar to rpc deadline handling Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Required-githooks: true
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14971/17/testReport/ |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14971/17/execution/node/1206/log |
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14971/30/testReport/ |
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
Required-githooks: true Signed-off-by: Alexander A Oganezov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
ping reviewers |
Signed-off-by: Alexander A Oganezov <[email protected]>
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: