Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

Open
wants to merge 3 commits into
base: 8.x
Choose a base branch
from

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jan 6, 2025

Proposed commit message

For some reason this test became flaky, the root of the flakiness is not on the test, it is on how a rename operation is detected. Even though this test uses os.Rename, it does not seem to be an atomic operation. https://www.man7.org/linux/man-pages/man2/rename.2.html does not make it clear whether 'renameat' (used by os.Rename) is atomic.

On a flaky execution, the file is actually perceived as removed and then a new file is created, both with the same inode. This happens on a system that does not reuse inodes as soon they're freed. Because the file is detected as removed, it's state is also removed. Then when more data is added, only the offset of the new data is tracked by the registry, causing the test to fail.

A workaround for this is to not remove the state when the file is removed, hence clean_removed: false is set in the test config.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Disruptive User Impact
## Author's Checklist

How to test this PR locally

Run TestFilestreamMetadataUpdatedOnRename and ensure it does not fail

cd filebeat
go test -tags integration -run=TestFilestreamMetadataUpdatedOnRename -v -count=100 ./input/filestream/.

Related issues

I first saw this test failing on #41954, however the issue/fix does not seem related to the PR, so I'm creating a PR with the standalone fix so it can be easily backported to 8.x

Buildkite failure

## Use cases
## Screenshots
## Logs

For some reason this test became flaky, the root of the flakiness
is not on the test, it is on how a rename operation is detected.
Even though this test uses `os.Rename`, it does not seem to be an atomic
operation. https://www.man7.org/linux/man-pages/man2/rename.2.html
does not make it clear whether 'renameat' (used by `os.Rename`) is
atomic.

On a flaky execution, the file is actually perceived as removed
and then a new file is created, both with the same inode. This
happens on a system that does not reuse inodes as soon they're
freed. Because the file is detected as removed, it's state is also
removed. Then when more data is added, only the offset of the new
data is tracked by the registry, causing the test to fail.

A workaround for this is to not remove the state when the file is
removed, hence `clean_removed: false` is set in the test config.
@belimawr belimawr requested a review from a team as a code owner January 6, 2025 16:34
@belimawr belimawr requested review from rdner and faec and removed request for a team January 6, 2025 16:34
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 6, 2025
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 6, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 6, 2025
Comment on lines +107 to +112
// On a flaky execution, the file is actually perceived as removed
// and then a new file is created, both with the same inode. This
// happens on a system that does not reuse inodes as soon they're
// freed. Because the file is detected as removed, it's state is also
// removed. Then when more data is added, only the offset of the new
// data is tracked by the registry, causing the test to fail.
Copy link
Member

@mauri870 mauri870 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a bug with the file system watcher implementation? AFAIK rename is atomic on Linux.

The man page says:

If newpath already exists, it will be atomically replaced, so
       that there is no point at which another process attempting to
       access newpath will find it missing.  However, there will
       probably be a window in which both oldpath and newpath refer to
       the file being renamed.

For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this bit of the man page, however, my understanding is that it is atomic only when "newpath" already exists:

If newpath already exists, it will be atomically replaced

The documentation makes no statement about the behaviour when "newpath" does not exist.

For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.

We kinda have a debounce approach, we only check the file system once every 10s (default value for prospector.scanner.check_interval), at that point we list the files pointed by the glob, then diff it with our current list of files being harvested, if the file is not in the file system but is in our list, then the file has been removed.

There hasn't been any significant change on the file watcher in a while. It is curious this started happening how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

Honestly, we don't know. If that happens on a user's deployment, the observable effect will be data duplication of that specific file, which is not a good outcome, but at least it is not data loss.

One particularity of this test, which is very likely what makes it flaky, is the fact that prospector.scanner.check_interval is set to 1ms. Our docs are clear: we do not recommend using values < 1s. The reason for that is that most file systems track modification times in seconds, so scanning faster than that can have unexpected results.

Honestly, if in a real production environment, if a rename operation is taking more than 10s (the default value), there are likely bigger problems happening there.

I do not believe it's a bug with our implementation, we take a "snapshot" of the files (well, the globs defined in paths) and compare it with the list of files we know. If the file is not on the file system, but it is on our list, it has been removed. There isn't much more (if anything) that we can do without getting into the inner workings of the Kernel and file system. We have reasonable defaults and document possible pitfalls users might fall into. Our implementation also behaves the same across all OSes and file systems.

Even inotify does not handle renames directly:

Rename events are not handled directly; i.e., inotify issues two separate events that must be examined and matched in a context of potential race conditions.

Which is pretty much the issue this PR is circumventing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.

Could be wrong but isn't this a limitation of how the native ID interacts with inode re-use? In which case the "fix" is to use fingerprint which shouldn't be getting confused about a rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.

The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".

With a prospector.scanner.check_interval so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.

The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".

With a prospector.scanner.check_interval so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.

OK. So I misunderstood the comment about inodes. Maybe we should add to the prospector.scanner.check_interval docs this possibility.

Copy link
Contributor

mergify bot commented Jan 6, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b FixflakinessonTestFilestreamMetadataUpdatedOnRename-8.x upstream/FixflakinessonTestFilestreamMetadataUpdatedOnRename-8.x
git merge upstream/8.x
git push upstream FixflakinessonTestFilestreamMetadataUpdatedOnRename-8.x

@belimawr
Copy link
Contributor Author

belimawr commented Jan 7, 2025

The CI is failing due to another flaky test, I created PR to skip the flaky test: #42245

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants