-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221
base: 8.x
Are you sure you want to change the base?
Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221
Conversation
For some reason this test became flaky, the root of the flakiness is not on the test, it is on how a rename operation is detected. Even though this test uses `os.Rename`, it does not seem to be an atomic operation. https://www.man7.org/linux/man-pages/man2/rename.2.html does not make it clear whether 'renameat' (used by `os.Rename`) is atomic. On a flaky execution, the file is actually perceived as removed and then a new file is created, both with the same inode. This happens on a system that does not reuse inodes as soon they're freed. Because the file is detected as removed, it's state is also removed. Then when more data is added, only the offset of the new data is tracked by the registry, causing the test to fail. A workaround for this is to not remove the state when the file is removed, hence `clean_removed: false` is set in the test config.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
// On a flaky execution, the file is actually perceived as removed | ||
// and then a new file is created, both with the same inode. This | ||
// happens on a system that does not reuse inodes as soon they're | ||
// freed. Because the file is detected as removed, it's state is also | ||
// removed. Then when more data is added, only the offset of the new | ||
// data is tracked by the registry, causing the test to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a bug with the file system watcher implementation? AFAIK rename is atomic on Linux.
The man page says:
If newpath already exists, it will be atomically replaced, so
that there is no point at which another process attempting to
access newpath will find it missing. However, there will
probably be a window in which both oldpath and newpath refer to
the file being renamed.
For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this bit of the man page, however, my understanding is that it is atomic only when "newpath" already exists:
If newpath already exists, it will be atomically replaced
The documentation makes no statement about the behaviour when "newpath" does not exist.
For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.
We kinda have a debounce approach, we only check the file system once every 10s (default value for prospector.scanner.check_interval
), at that point we list the files pointed by the glob, then diff it with our current list of files being harvested, if the file is not in the file system but is in our list, then the file has been removed.
There hasn't been any significant change on the file watcher in a while. It is curious this started happening how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we sure that this happens only in testing and our customers don't face the same issue running in production?
If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we sure that this happens only in testing and our customers don't face the same issue running in production?
Honestly, we don't know. If that happens on a user's deployment, the observable effect will be data duplication of that specific file, which is not a good outcome, but at least it is not data loss.
One particularity of this test, which is very likely what makes it flaky, is the fact that prospector.scanner.check_interval
is set to 1ms
. Our docs are clear: we do not recommend using values < 1s
. The reason for that is that most file systems track modification times in seconds, so scanning faster than that can have unexpected results.
Honestly, if in a real production environment, if a rename operation is taking more than 10s
(the default value), there are likely bigger problems happening there.
I do not believe it's a bug with our implementation, we take a "snapshot" of the files (well, the globs defined in paths
) and compare it with the list of files we know. If the file is not on the file system, but it is on our list, it has been removed. There isn't much more (if anything) that we can do without getting into the inner workings of the Kernel and file system. We have reasonable defaults and document possible pitfalls users might fall into. Our implementation also behaves the same across all OSes and file systems.
Even inotify does not handle renames directly:
Rename events are not handled directly; i.e., inotify issues two separate events that must be examined and matched in a context of potential race conditions.
Which is pretty much the issue this PR is circumventing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we sure that this happens only in testing and our customers don't face the same issue running in production?
If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.
Could be wrong but isn't this a limitation of how the native ID interacts with inode re-use? In which case the "fix" is to use fingerprint which shouldn't be getting confused about a rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.
The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".
With a prospector.scanner.check_interval
so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.
The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".
With a
prospector.scanner.check_interval
so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.
OK. So I misunderstood the comment about inodes. Maybe we should add to the prospector.scanner.check_interval
docs this possibility.
This pull request is now in conflicts. Could you fix it? 🙏
|
…tFilestreamMetadataUpdatedOnRename-8.x
The CI is failing due to another flaky test, I created PR to skip the flaky test: #42245 |
Proposed commit message
For some reason this test became flaky, the root of the flakiness is not on the test, it is on how a rename operation is detected. Even though this test uses
os.Rename
, it does not seem to be an atomic operation. https://www.man7.org/linux/man-pages/man2/rename.2.html does not make it clear whether 'renameat' (used byos.Rename
) is atomic.On a flaky execution, the file is actually perceived as removed and then a new file is created, both with the same inode. This happens on a system that does not reuse inodes as soon they're freed. Because the file is detected as removed, it's state is also removed. Then when more data is added, only the offset of the new data is tracked by the registry, causing the test to fail.
A workaround for this is to not remove the state when the file is removed, hence
clean_removed: false
is set in the test config.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry inCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.## Disruptive User Impact## Author's ChecklistHow to test this PR locally
Run
TestFilestreamMetadataUpdatedOnRename
and ensure it does not failRelated issues
I first saw this test failing on #41954, however the issue/fix does not seem related to the PR, so I'm creating a PR with the standalone fix so it can be easily backported to 8.x
Buildkite failure
## Use cases## Screenshots## Logs