Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16501 build: Add libsanitize #15105

Draft
wants to merge 29 commits into
base: master
Choose a base branch
from
Draft

Conversation

knard38
Copy link
Contributor

@knard38 knard38 commented Sep 9, 2024

Description

Add scons build option SANITIZERS allowing to use the libasan. This new option takes a list of sanitizer tool such as: AddressSanitizer (i.e. -fsanitize=address), ThreadSanitizer(i.e. -fsanitize=thread), LeakSanitizer (i.e. -fsanitize=leak), etc.

A list of the available santizer tools and their compatibility could be found in the gcc man page.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

TODO

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Copy link

github-actions bot commented Sep 9, 2024

Ticket title is 'LRZ: m02r01s10dao coredump - invalid free'
Status is 'Resolved'
Labels: 'lrz'
https://daosio.atlassian.net/browse/DAOS-16501

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/1/execution/node/328/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/1/execution/node/336/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/1/execution/node/396/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/1/execution/node/340/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/1/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/1/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/1/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/1/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/1/display/redirect

utils/rpms/daos.spec Outdated Show resolved Hide resolved
Miscelleneaous fixe\s.

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/2/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/2/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/2/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/2/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/2/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/6/execution/node/360/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/38/execution/node/341/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/38/execution/node/339/log

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/40/execution/node/340/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/41/execution/node/352/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/41/execution/node/308/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/41/execution/node/355/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/41/execution/node/329/log

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/43/execution/node/358/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/43/execution/node/359/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/43/execution/node/345/log

@mchaarawi
Copy link
Contributor

TBH, I have not done some performance test as my initial aim was not to use it in production but for helping to identify thread doing some memory corruptions (such as buffer overflow).

yes for sure this is just for dev, but when i tried enabling sanitizer on the server side a while ago, the server just hangs. maybe i did something wrong and it works for you.
does this work also for the dependencies? like does it also report things for mercury for example?

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/45/execution/node/346/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/45/execution/node/344/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/45/execution/node/345/log

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/45/execution/node/337/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/47/execution/node/341/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/47/execution/node/353/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/47/execution/node/354/log

@knard38
Copy link
Contributor Author

knard38 commented Jan 7, 2025

TBH, I have not done some performance test as my initial aim was not to use it in production but for helping to identify thread doing some memory corruptions (such as buffer overflow).

yes for sure this is just for dev, but when i tried enabling sanitizer on the server side a while ago, the server just hangs. maybe i did something wrong and it works for you. does this work also for the dependencies? like does it also report things for mercury for example?

I was not planning to manage the direct DAOS dependencies such as mercury.
I have planned look at it in a second time... if I am able to obtain an acceptable result with this PR 🤞

@liw
Copy link
Contributor

liw commented Jan 8, 2025

@mchaarawi, ASan requires an LD_LIBRARY_PATH for engine modules and a larger default Argobots stack size in your engine YAML file(s). Without them, engines either fail to start because they can't find librdb.so, or overrun stacks during pool creation. The performance has been amazing compared to Valgrind---I barely notice a difference when running daos_test on Wolf.

Mercury already has the support; @soumagne must have already fixed the issues that are easily found. Same for Argobots.

I've used ASan with DAOS plus daos_test in a rush before 2.6.1 and managed to fix the following product issues.

And the following test issues.

In that process, @knard38 and I discovered each other's work on ASan. We planned to land the support (this PR), present an introduction, and see if this is robust enough for certain regular CI runs. Beyond that, our future plan is to look into the memory leak reports (ignored so far), TSan, etc.

@mchaarawi
Copy link
Contributor

@mchaarawi, ASan requires an LD_LIBRARY_PATH for engine modules and a larger default Argobots stack size in your engine YAML file(s). Without them, engines either fail to start because they can't find librdb.so, or overrun stacks during pool creation. The performance has been amazing compared to Valgrind---I barely notice a difference when running daos_test on Wolf.

Mercury already has the support; @soumagne must have already fixed the issues that are easily found. Same for Argobots.

I've used ASan with DAOS plus daos_test in a rush before 2.6.1 and managed to fix the following product issues.

And the following test issues.

In that process, @knard38 and I discovered each other's work on ASan. We planned to land the support (this PR), present an introduction, and see if this is robust enough for certain regular CI runs. Beyond that, our future plan is to look into the memory leak reports (ignored so far), TSan, etc.

thanks for the detailed reply!
you obviously spent more time getting it to work than i did :-)

@liw
Copy link
Contributor

liw commented Jan 8, 2025

thanks for the detailed reply! you obviously spent more time getting it to work than i did :-)

You're very welcome to work with Cedric and I, if you like. We still don't know whether ASan will turn out to be robust enough or not, because I believe Argobots should cause some false positives, but I've seen zero, which puzzles me.

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix libasan version for ubuntu image used with GHA
Fix invalid copyright

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants