Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16585 tests: Fix NLT handling of __fxstat detection #15150

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

techbasset
Copy link
Contributor

Use strace to determine whether calls to __fxstat actually happen when using a utility/command the IL is being tested on, and stop treating it as an error to not see __fxstat when it's not used.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

Ticket title is 'NLT test failures under Ubuntu 22.04'
Status is 'Open'
Labels: 'google-cloud-daos'
https://daosio.atlassian.net/browse/DAOS-16585

@techbasset techbasset force-pushed the ncmurphy/master-DAOS-16585 branch 2 times, most recently from 14f3e0e to 6e23f42 Compare September 18, 2024 22:48
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect

@techbasset techbasset force-pushed the ncmurphy/master-DAOS-16585 branch from 6e23f42 to 9a53b93 Compare September 19, 2024 17:34
Use strace to determine whether calls to __fxstat actually happen
when using a utility/command the IL is being tested on, and stop
treating it as an error to not see __fxstat when it's not used.

Signed-off-by: Nicholas Murphy <[email protected]>
Required-githooks: true
Run-GHA: true
@techbasset techbasset force-pushed the ncmurphy/master-DAOS-16585 branch from 9a53b93 to ad18961 Compare September 19, 2024 17:45
@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect

@@ -6327,6 +6327,23 @@ def server_fi(args):
server.set_fi(probability=0)


def look_for_library_call(conf, cmd, library_str):
"""Look for library_str in the strace call stack of running cmd."""
tmpfile = tempfile.NamedTemporaryFile(mode='r',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if pylint will still complain on every PR thereafter but it used to be the case. You could either remove strace from the comment or add a comment such as

# pylint: disable=wrong-spelling-in-comment

def look_for_library_call(conf, cmd, library_str):
"""Look for library_str in the strace call stack of running cmd."""
tmpfile = tempfile.NamedTemporaryFile(mode='r',
prefix='dnt_assess_',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is simple enough to address, then you don't need explicit close

Required-githooks: true
Required-githooks: true
Required-githooks: true
Copy link
Contributor

@ashleypittman ashleypittman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work in as much as if glibc is behaving differently then the specific check that's failing will be disabled. What we should be testing however is if fstat is being intercepted properly, historically and what the code currently does is check the logs for the wrapper function however dfuse now has per operation statistics available to the client, a more comprehensive solution would be to sample the fstat count before and after the command is invoked as a way of knowing if it had been intercepted or not.

One complexity here is that the first fstat of every file is forwarded so that the st_dev value can be loaded/cached so in order to properly fix this il_stat may need to be passed in the number of files which are accessed. I'll see if I can get a PR together on this basis.

Comment on lines +1524 to +1525
check_fstat = check_fstat and not self.caching and \
look_for_library_call(self.conf, cmd, '__fxstat')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be called for every il_cmd invocation and there are probably dozens so it could/should be saved in conf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that and wanted to not assume different executables ended up using the same libraries. And, I don't think it's going to affect overall runtime to just do this every time. Do you feel differently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd missed that it was running the actual command rather than just a generic unix command here. This means the command has to be idempotent and perform the same operations on subsequent invocations but looking at the places where this is called it seems it probably is.

Comment on lines +17 to +23
# hack to install 24.04's golang-go on 22.04:
apt-get install -y software-properties-common
add-apt-repository "deb http://archive.ubuntu.com/ubuntu noble main"
apt-get update
apt-get install -y golang-go
add-apt-repository -r "deb http://archive.ubuntu.com/ubuntu noble main"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a part of the fix or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a part of the fix or something else?

there was a recent change that made it such that made go 1.22 a requirement and ubuntu 22 has 1.18. Not sure why the builds didn't fail on that patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be better landing as part of #15174 where the go developers can review it.

That said, this script normally just installs packages, https://github.com/daos-stack/daos/blob/master/utils/docker/Dockerfile.ubuntu would be a better place for this code, or perhaps a utils/scripts/helpers/repo-helper-debian.sh script to match what rocky does.

@techbasset
Copy link
Contributor Author

This would work in as much as if glibc is behaving differently then the specific check that's failing will be disabled. What we should be testing however is if fstat is being intercepted properly, historically and what the code currently does is check the logs for the wrapper function however dfuse now has per operation statistics available to the client, a more comprehensive solution would be to sample the fstat count before and after the command is invoked as a way of knowing if it had been intercepted or not.

One complexity here is that the first fstat of every file is forwarded so that the st_dev value can be loaded/cached so in order to properly fix this il_stat may need to be passed in the number of files which are accessed. I'll see if I can get a PR together on this basis.

FWIW I kind of like the strace approach as a general solution as it lets you establish a ground truth about what's actually happening without making assumptions. One can imagine extending this to have the strace dictate all the calls you should be intercepting. @jolivier23 pointed out, for instance, that "newfstat" shows up and is probably not being intercepted right now? shrug

Meantime we (Google) would like some fix here ASAP to unblock our own client testing. So: request to separate a short term fix from a longer term more complete solution?

@ashleypittman
Copy link
Contributor

FWIW I kind of like the strace approach as a general solution as it lets you establish a ground truth about what's actually happening without making assumptions. One can imagine extending this to have the strace dictate all the calls you should be intercepting. @jolivier23 pointed out, for instance, that "newfstat" shows up and is probably not being intercepted right now? shrug

I think we're for different points but with the same end. The current code checks is a particular glibc implementation of fstat is being intercepted, if it's not then is that because the interception is broken or because a different implementation is in use? Using strace will hide the second failure mode.

Overall this is really just a quick smoke test and tracking for filenames in the log file is a bit of a hack, to do this properly we'd write custom c code and call it from ftest with an appropriate harness. For now, and to make progress here I'd be happy to simply disable the check_fstat check, I can re-work NLT to re-enable this smoke test using more modern dfuse/ioil features and having it disabled for a short period of time won't be an issue IMO.

Meantime we (Google) would like some fix here ASAP to unblock our own client testing. So: request to separate a short term fix from a longer term more complete solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants