Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

belimawr · 2025-01-06T16:34:05Z

Proposed commit message

For some reason this test became flaky, the root of the flakiness is not on the test, it is on how a rename operation is detected. Even though this test uses os.Rename, it does not seem to be an atomic operation. https://www.man7.org/linux/man-pages/man2/rename.2.html does not make it clear whether 'renameat' (used by os.Rename) is atomic.

On a flaky execution, the file is actually perceived as removed and then a new file is created, both with the same inode. This happens on a system that does not reuse inodes as soon they're freed. Because the file is detected as removed, it's state is also removed. Then when more data is added, only the offset of the new data is tracked by the registry, causing the test to fail.

A workaround for this is to not remove the state when the file is removed, hence clean_removed: false is set in the test config.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

~~## Disruptive User Impact~~
~~## Author's Checklist~~

How to test this PR locally

Run TestFilestreamMetadataUpdatedOnRename and ensure it does not fail

cd filebeat
go test -tags integration -run=TestFilestreamMetadataUpdatedOnRename -v -count=100 ./input/filestream/.

Related issues

I first saw this test failing on #41954, however the issue/fix does not seem related to the PR, so I'm creating a PR with the standalone fix so it can be easily backported to 8.x

Buildkite failure

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

For some reason this test became flaky, the root of the flakiness is not on the test, it is on how a rename operation is detected. Even though this test uses `os.Rename`, it does not seem to be an atomic operation. https://www.man7.org/linux/man-pages/man2/rename.2.html does not make it clear whether 'renameat' (used by `os.Rename`) is atomic. On a flaky execution, the file is actually perceived as removed and then a new file is created, both with the same inode. This happens on a system that does not reuse inodes as soon they're freed. Because the file is detected as removed, it's state is also removed. Then when more data is added, only the offset of the new data is tracked by the registry, causing the test to fail. A workaround for this is to not remove the state when the file is removed, hence `clean_removed: false` is set in the test config.

elasticmachine · 2025-01-06T16:34:54Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

mauri870 · 2025-01-06T16:38:45Z

filebeat/input/filestream/input_integration_test.go

+		// On a flaky execution, the file is actually perceived as removed
+		// and then a new file is created, both with the same inode. This
+		// happens on a system that does not reuse inodes as soon they're
+		// freed. Because the file is detected as removed, it's state is also
+		// removed. Then when more data is added, only the offset of the new
+		// data is tracked by the registry, causing the test to fail.


Could this be a bug with the file system watcher implementation? AFAIK rename is atomic on Linux.

The man page says:

If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However, there will probably be a window in which both oldpath and newpath refer to the file being renamed.

For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.

I saw this bit of the man page, however, my understanding is that it is atomic only when "newpath" already exists:

If newpath already exists, it will be atomically replaced

The documentation makes no statement about the behaviour when "newpath" does not exist.

For this limitation the file watcher should have a debounce mechanism to avoid changing states multiple times within a short period.

We kinda have a debounce approach, we only check the file system once every 10s (default value for prospector.scanner.check_interval), at that point we list the files pointed by the glob, then diff it with our current list of files being harvested, if the file is not in the file system but is in our list, then the file has been removed.

There hasn't been any significant change on the file watcher in a while. It is curious this started happening how.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

Honestly, we don't know. If that happens on a user's deployment, the observable effect will be data duplication of that specific file, which is not a good outcome, but at least it is not data loss.

One particularity of this test, which is very likely what makes it flaky, is the fact that prospector.scanner.check_interval is set to 1ms. Our docs are clear: we do not recommend using values < 1s. The reason for that is that most file systems track modification times in seconds, so scanning faster than that can have unexpected results.

Honestly, if in a real production environment, if a rename operation is taking more than 10s (the default value), there are likely bigger problems happening there.

I do not believe it's a bug with our implementation, we take a "snapshot" of the files (well, the globs defined in paths) and compare it with the list of files we know. If the file is not on the file system, but it is on our list, it has been removed. There isn't much more (if anything) that we can do without getting into the inner workings of the Kernel and file system. We have reasonable defaults and document possible pitfalls users might fall into. Our implementation also behaves the same across all OSes and file systems.

Even inotify does not handle renames directly:

Rename events are not handled directly; i.e., inotify issues two separate events that must be examined and matched in a context of potential race conditions.

Which is pretty much the issue this PR is circumventing.

How are we sure that this happens only in testing and our customers don't face the same issue running in production?

If our customers experience the same issue on their environments, it's not a flaky test it's a bug in our implementation that needs to be fixed.

Could be wrong but isn't this a limitation of how the native ID interacts with inode re-use? In which case the "fix" is to use fingerprint which shouldn't be getting confused about a rename.

This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.

The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".

With a prospector.scanner.check_interval so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.

This case has nothing to do with inode reuse, I manually validated that. The file is correctly renamed, inode is preserved, the content is preserved, etc. I did some tests of quickly deleting and creating a different file and the inodes are never re-used.

The problem is that when we resolve the glob, none of the files are there, neither the original, nor the "destination".

With a prospector.scanner.check_interval so small, there is a small chance that when we resolve the glob, it resolves to nothing, but then late the renamed file appears.

OK. So I misunderstood the comment about inodes. Maybe we should add to the prospector.scanner.check_interval docs this possibility.

mergify · 2025-01-06T20:34:46Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b FixflakinessonTestFilestreamMetadataUpdatedOnRename-8.x upstream/FixflakinessonTestFilestreamMetadataUpdatedOnRename-8.x
git merge upstream/8.x
git push upstream FixflakinessonTestFilestreamMetadataUpdatedOnRename-8.x

…tFilestreamMetadataUpdatedOnRename-8.x

…Rename-8.x

belimawr · 2025-01-07T17:53:20Z

The CI is failing due to another flaky test, I created PR to skip the flaky test: #42245

belimawr requested a review from a team as a code owner January 6, 2025 16:34

belimawr requested review from rdner and faec and removed request for a team January 6, 2025 16:34

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 6, 2025

belimawr mentioned this pull request Jan 6, 2025

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42213

Closed

6 tasks

mergify bot assigned belimawr Jan 6, 2025

belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 6, 2025

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 6, 2025

mauri870 reviewed Jan 6, 2025

View reviewed changes

mauri870 approved these changes Jan 6, 2025

View reviewed changes

belimawr added 2 commits January 6, 2025 15:45

Merge branch '8.x' of github.com:elastic/beats into FixflakinessonTes…

3a9c094

…tFilestreamMetadataUpdatedOnRename-8.x

Merge branch '8.x' into FixflakinessonTestFilestreamMetadataUpdatedOn…

b0d34e3

…Rename-8.x

ngadeg approved these changes Jan 7, 2025

View reviewed changes

belimawr mentioned this pull request Jan 7, 2025

[Flaky test] TestSQSReceiver - missing call(s) to *awss3.MockSQSAPI.ReceiveMessage #41458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

belimawr commented Jan 6, 2025

elasticmachine commented Jan 6, 2025

mauri870 Jan 6, 2025 •

edited

Loading

belimawr Jan 6, 2025

rdner Jan 7, 2025

belimawr Jan 7, 2025

leehinman Jan 7, 2025

belimawr Jan 7, 2025

leehinman Jan 8, 2025

mergify bot commented Jan 6, 2025

belimawr commented Jan 7, 2025

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

Are you sure you want to change the base?

Fix flakiness on TestFilestreamMetadataUpdatedOnRename #42221

Conversation

belimawr commented Jan 6, 2025

Proposed commit message

Checklist

How to test this PR locally

Related issues

elasticmachine commented Jan 6, 2025

mauri870 Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

belimawr Jan 6, 2025

Choose a reason for hiding this comment

rdner Jan 7, 2025

Choose a reason for hiding this comment

belimawr Jan 7, 2025

Choose a reason for hiding this comment

leehinman Jan 7, 2025

Choose a reason for hiding this comment

belimawr Jan 7, 2025

Choose a reason for hiding this comment

leehinman Jan 8, 2025

Choose a reason for hiding this comment

mergify bot commented Jan 6, 2025

belimawr commented Jan 7, 2025

mauri870 Jan 6, 2025 •

edited

Loading