Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail: Fix a stall bug on !follow_inode case #4327

Merged
merged 3 commits into from
Oct 27, 2023
Merged

Conversation

ashie
Copy link
Member

@ashie ashie commented Oct 18, 2023

Which issue(s) this PR fixes:
Fixes #3614
Closes #4264

What this PR does / why we need it:

Although known stall issues of in_tail on follow_inode case are fixed
in v1.16.2, it has still a similar problem on !follow_inode case.

In this case, a tail watcher is possible to mark the position entry as
unwatched if it's tansitioned to rotate_wait state by
refresh_watcher even if another newer tail watcher is managing it.
It's hard to occur in usual because stat_watcher will be called
immediately after the file is changed while refresh_wather is called
every 60 seconds in default. However, there is a rare possibility that
this order might be swapped especillay if in_tail is busy on processing
large amount of logs. Because in_tail is single threadied, event queues
such as timers or inotify will be stucked in this case.

There is no such problem on follow_inode case because position entries
are always marked as unwatched before entering rotate_wait state.

Docs Changes:
N/A

Release Note:
Same with the title

daipom and others added 2 commits October 18, 2023 17:33
Currently, this fails.
We need to fix this problem.

Signed-off-by: Daijiro Fukuda <[email protected]>
Fix #3614

Although known stall issues of in_tail on `follow_inode` case are fixed
in v1.16.2, it has still a similar problem on `!follow_inode` case.

In this case, a tail watcher is possible to mark the position entry as
`unwatched` if it's tansitioned to `rotate_wait` state by
`refresh_watcher` even if another newer tail watcher is managing it.
It's hard to occur in usual because `stat_watcher` will be called
immediately after the file is changed  while `refresh_wather` is called
every 60 seconds by default. However, there is a rare possibility that
this order might be swapped especillay if in_tail is busy on processing
large amount of logs. Because in_tail is single threadied, event queues
such as timers or inotify will be stucked in this case.

There is no such problem on `follow_inode` case because position entries
are always marked as `unwatched` before entering `rotate_wait` state.

Signed-off-by: Takuro Ashie <[email protected]>
@ashie
Copy link
Member Author

ashie commented Oct 18, 2023

https://github.com/fluent/fluentd/actions/runs/6558310821/job/17811619099?pr=4327

io_handler_opened_statuses seems unstable on Windows, sometimes it's expected but there are also failed cases like this:

2023-10-18T08:53:50.6823554Z ===============================================================================
2023-10-18T08:53:50.6825206Z Failure: test_refreshTW_during_rotation(TailInputTest::Update watchers for rotation without follow_inodes)
2023-10-18T08:53:50.6844439Z D:/a/fluentd/fluentd/test/plugin/test_in_tail.rb:3101:in `test_refreshTW_during_rotation'
2023-10-18T08:53:50.6845382Z      3098:         end
2023-10-18T08:53:50.6845968Z      3099:       end
2023-10-18T08:53:50.6846518Z      3100: 
2023-10-18T08:53:50.6847244Z   => 3101:       assert_equal(
2023-10-18T08:53:50.6847886Z      3102:         {
2023-10-18T08:53:50.6848685Z      3103:           record_values: ["file1 log1", "file1 log2", "file2 log1", "file2 log2", "file3 log1", "file3 log2"],
2023-10-18T08:53:50.6849785Z      3104:           tail_watcher_paths: ["#{@tmp_dir}/tail.txt0", "#{@tmp_dir}/tail.txt0", "#{@tmp_dir}/tail.txt0"],
2023-10-18T08:53:50.6850760Z <{:position_entries=>
2023-10-18T08:53:50.6851951Z   [["D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6852987Z     "0000000000000016",
2023-10-18T08:53:50.6853652Z     4741643960320]],
2023-10-18T08:53:50.6854343Z  :record_values=>
2023-10-18T08:53:50.6855399Z   ["file1 log1",
2023-10-18T08:53:50.6856034Z    "file1 log2",
2023-10-18T08:53:50.6856713Z    "file2 log1",
2023-10-18T08:53:50.6857366Z    "file2 log2",
2023-10-18T08:53:50.6858022Z    "file3 log1",
2023-10-18T08:53:50.6858631Z    "file3 log2"],
2023-10-18T08:53:50.6859548Z  :tail_watcher_inodes=>[4728759713792, 4737349255168, 4741643960320],
2023-10-18T08:53:50.6860651Z  :tail_watcher_io_handler_opened_statuses=>[false, false, false],
2023-10-18T08:53:50.6861582Z  :tail_watcher_paths=>
2023-10-18T08:53:50.6862594Z   ["D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6863930Z    "D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6865416Z    "D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0"]}> expected but was
2023-10-18T08:53:50.6866642Z <{:position_entries=>
2023-10-18T08:53:50.6867654Z   [["D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6868665Z     "0000000000000016",
2023-10-18T08:53:50.6869580Z     4741643960320]],
2023-10-18T08:53:50.6870261Z  :record_values=>
2023-10-18T08:53:50.6870976Z   ["file1 log1",
2023-10-18T08:53:50.6873162Z    "file1 log2",
2023-10-18T08:53:50.6874008Z    "file2 log1",
2023-10-18T08:53:50.6874668Z    "file2 log2",
2023-10-18T08:53:50.6875391Z    "file3 log1",
2023-10-18T08:53:50.6876069Z    "file3 log2"],
2023-10-18T08:53:50.6884221Z  :tail_watcher_inodes=>[4728759713792, 4737349255168, 4741643960320],
2023-10-18T08:53:50.6885920Z  :tail_watcher_io_handler_opened_statuses=>[false, true, false],
2023-10-18T08:53:50.6886930Z  :tail_watcher_paths=>
2023-10-18T08:53:50.6888137Z   ["D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6889589Z    "D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6890948Z    "D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0"]}>
2023-10-18T08:53:50.6891904Z 
2023-10-18T08:53:50.6892173Z diff:
2023-10-18T08:53:50.6892755Z   {:position_entries=>
2023-10-18T08:53:50.6893503Z     [["D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6894312Z       "0000000000000016",
2023-10-18T08:53:50.6894777Z       4741643960320]],
2023-10-18T08:53:50.6895266Z    :record_values=>
2023-10-18T08:53:50.6895910Z     ["file1 log1",
2023-10-18T08:53:50.6896312Z      "file1 log2",
2023-10-18T08:53:50.6896753Z      "file2 log1",
2023-10-18T08:53:50.6897195Z      "file2 log2",
2023-10-18T08:53:50.6898257Z      "file3 log1",
2023-10-18T08:53:50.6899149Z      "file3 log2"],
2023-10-18T08:53:50.6900039Z    :tail_watcher_inodes=>[4728759713792, 4737349255168, 4741643960320],
2023-10-18T08:53:50.6901396Z ?  :tail_watcher_io_handler_opened_statuses=>[false, false, false],
2023-10-18T08:53:50.6908389Z ?                                                    tru           
2023-10-18T08:53:50.6911302Z ?                                                    ????           
2023-10-18T08:53:50.6912446Z    :tail_watcher_paths=>
2023-10-18T08:53:50.6913579Z     ["D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6914598Z      "D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0",
2023-10-18T08:53:50.6915606Z      "D:/a/fluentd/fluentd/test/plugin/../tmp/tail/1cadfdd8b26c210a9541/tail.txt0"]}
2023-10-18T08:53:50.6916393Z ===============================================================================

@ashie
Copy link
Member Author

ashie commented Oct 20, 2023

https://github.com/fluent/fluentd/actions/runs/6558310821/job/17811619099?pr=4327

io_handler_opened_statuses seems unstable on Windows, sometimes it's expected but there are also failed cases like this:

Sometimes it's also reproduced on my local GNU/Linux environment.
Probably it's under rotate_wait state and not yet timed out, because current in_tail's implementation doesn't manage tail watchers under rotate_wait state.

@ashie ashie marked this pull request as ready for review October 23, 2023 00:06
@ashie ashie requested a review from daipom October 23, 2023 00:06
@ashie ashie added this to the v1.16.3 milestone Oct 23, 2023
@ashie ashie added the bug Something isn't working label Oct 23, 2023
@ashie
Copy link
Member Author

ashie commented Oct 23, 2023

We found another issue while checking this patch.
When a tail watcher under rotate_wait state detects rotation, it calls detach_watcher_after_rotate_wait again, so that detach is processed twice for a same watcher. In addition, when rotations are occurred multiple times while rotate_wait, duplicate watcher might be created for a same file (not confirmed yet).
Although it would be better to fix such issues too, this PR focused on only fixing the stall issue.

@daipom
Copy link
Contributor

daipom commented Oct 25, 2023

Thanks for this fix!

Indeed, there seems to be no need to unwatch the target existing in @tails, at least for follow_inodes false.
And, this fixes the follow_inodes false issue in #3614.

if @pf && tw.unwatched && (@follow_inode || !@tails[tw.path])
target_info = TargetInfo.new(tw.path, ino)
@pf.unwatch(target_info)
end

I think we will need more fundamental fixes in the future, but currently, we need this fix to solve this issue while minimizing the impact on existing logic.

@daipom
Copy link
Contributor

daipom commented Oct 25, 2023

I'll check the behavior a little more tomorrow.

Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thanks!

@daipom daipom merged commit 52e46f0 into master Oct 27, 2023
13 of 16 checks passed
@daipom daipom deleted the fix-issue-3614 branch October 27, 2023 10:17
daipom added a commit to daipom/fluentd that referenced this pull request Oct 31, 2023
Fix fluent#3614

Although known stall issues of in_tail on `follow_inode` case are fixed
in v1.16.2, it has still a similar problem on `!follow_inode` case.

In this case, a tail watcher is possible to mark the position entry as
`unwatched` if it's tansitioned to `rotate_wait` state by
`refresh_watcher` even if another newer tail watcher is managing it.
It's hard to occur in usual because `stat_watcher` will be called
immediately after the file is changed  while `refresh_wather` is called
every 60 seconds by default. However, there is a rare possibility that
this order might be swapped especillay if in_tail is busy on processing
large amount of logs. Because in_tail is single threadied, event queues
such as timers or inotify will be stucked in this case.

There is no such problem on `follow_inode` case because position entries
are always marked as `unwatched` before entering `rotate_wait` state.

---------

Signed-off-by: Takuro Ashie <[email protected]>
Co-authored-by: Daijiro Fukuda <[email protected]>
daipom added a commit to daipom/fluentd that referenced this pull request Nov 9, 2023
For `refresh_watcher`, if an exisiting TailWatcher already
follows a target path with the different inode, it means that
the TailWatcher following the rotated file still exists.

In this case, `refresh_watcher` can't start the new TailWatcher
for the new current file.
So, we should output a warning log in order to prevent silent
collection stops, such as fluent#4327.

The similar warning may work for follow_inodes too.
Just limiting the case to suppress the impact to existing logic.

Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to daipom/fluentd that referenced this pull request Nov 10, 2023
For `refresh_watcher`, if an exisiting TailWatcher already
follows a target path with the different inode, it means that
the TailWatcher following the rotated file still exists.

In this case, `refresh_watcher` can't start the new TailWatcher
for the new current file.
So, we should output a warning log in order to prevent silent
collection stops, such as fluent#4327.

The similar warning may work for follow_inodes too.
Just limiting the case to suppress the impact to existing logic.

Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit that referenced this pull request Nov 10, 2023
For `refresh_watcher`, if an exisiting TailWatcher already
follows a target path with the different inode, it means that
the TailWatcher following the rotated file still exists.

In this case, `refresh_watcher` can't start the new TailWatcher
for the new current file.
So, we should output a warning log in order to prevent silent
collection stops, such as #4327.

The similar warning may work for follow_inodes too.
Just limiting the case to suppress the impact to existing logic.

Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to daipom/fluentd that referenced this pull request Nov 10, 2023
)

For `refresh_watcher`, if an exisiting TailWatcher already
follows a target path with the different inode, it means that
the TailWatcher following the rotated file still exists.

In this case, `refresh_watcher` can't start the new TailWatcher
for the new current file.
So, we should output a warning log in order to prevent silent
collection stops, such as fluent#4327.

The similar warning may work for follow_inodes too.
Just limiting the case to suppress the impact to existing logic.

Signed-off-by: Daijiro Fukuda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed
2 participants