-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16469 dtx: optimize DTX CoS cache #15089
Conversation
Ticket title is 'IOR Easy performance low with EC_16P2GX' |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15089/1/testReport/ |
41685c0
to
4e2b075
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/2/execution/node/1462/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/2/execution/node/1416/log |
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/2/execution/node/1608/log |
853a223
to
41ffc35
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/4/execution/node/1509/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/4/execution/node/1477/log |
41ffc35
to
707064e
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/5/execution/node/1462/log |
a752a00
to
00b6c56
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/7/execution/node/1551/log |
b2e9019
to
6d792d6
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/9/execution/node/1492/log |
6d792d6
to
7991462
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15089/10/display/redirect |
7991462
to
2f479fb
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15089/11/display/redirect |
2f479fb
to
5825c3c
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/12/execution/node/1508/log |
5825c3c
to
ac36370
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15089/13/execution/node/1492/log |
If there are a lot of committable DTX entries in DTX CoS cache, then it may be inefficient to locate the DTX entry in CoS cache with given oid + dkey_hash, that may happen under the case of that DTX batched commit is blocked (such as because of network trouble) as to trigger DTX refresh (for DTX cleanup) on other related engines. If that happened, it will increase the system load on such engine and slow down DTX commit further more. The patch reduces unnecessary search operation inside CoS cache. Other changes: 1. Metrics (io/dtx/async_cmt_lat/tgt_id) for DTX asynchronously commit latency (with unit ms). 2. Fix a bug in sched_ult2xs() with multiple numa sockets for DSS_XS_OFFLOAD case. 3. Delay commit (or abort) collective DTX on the leader target to handle resent race. 4. Avoid blocking dtx_req_wait() if chore failed to send out some DTX RPC. 5. Some cleanup for error handling. Signed-off-by: Fan Yong <[email protected]>
ac36370
to
8845d0e
Compare
Ping reviewers, thanks! |
@@ -1487,15 +1485,9 @@ dtx_leader_end(struct dtx_leader_handle *dlh, struct ds_cont_hdl *coh, int resul | |||
/* If piggyback DTX has been done everywhere, then need to handle CoS cache. | |||
* It is harmless to keep some partially committed DTX entries in CoS cache. | |||
*/ | |||
if (result == 0 && dth->dth_cos_done) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirm why this branch removed? thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition will checked via the parameter for dtx_cos_put_piggyback().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
Ping @daos-stack/daos-gatekeeper , thanks! |
If there are a lot of committable DTX entries in DTX CoS cache, then it may be inefficient to locate the DTX entry in CoS cache with given oid + dkey_hash, that may happen under the case of that DTX batched commit is blocked (such as because of network trouble) as to trigger DTX refresh (for DTX cleanup) on other related engines. If that happened, it will increase the system load on such engine and slow down DTX commit further more. The patch reduces unnecessary search operation inside CoS cache.
Other changes:
Metrics (io/dtx/async_cmt_lat/tgt_id) for DTX asynchronously commit latency (with unit ms).
Fix a bug in sched_ult2xs() with multiple numa sockets for DSS_XS_OFFLOAD case.
Delay commit (or abort) collective DTX on the leader target to handle resent race.
Avoid blocking dtx_req_wait() if chore failed to send out some DTX RPC.
Some cleanup for error handling.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: