Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero downtime update #713

Merged
merged 29 commits into from
Dec 11, 2024
Merged

Zero downtime update #713

merged 29 commits into from
Dec 11, 2024

Conversation

daipom
Copy link
Contributor

@daipom daipom commented Nov 5, 2024

This changes the behavior when an update is performed while the service is active by using Fluentd's new feature: the zero downtime restart.
If you stop the service before the update, it will not be affected.

Change of specification:

  • Changes auto restart conditions for update
  • Adds FLUENT_PACKAGE_SERVICE_RESTART environmental variable
  • Automate plugin install for update on demand

Changes auto restart conditions for update

RPM:

  • Note: The specification of FROM-side version is applied. (It differs from DEB)
    • Ex. In case of v5.0.x -> v5.2.x, the spec of v5.0.x is applied.
  • Before
    • The service automatically restarts only when it was active before update.
  • After
    • The service automatically restarts with the zero downtime restart feature only when all of the following conditions are met.
      • It was active before update.
      • FLUENT_PACKAGE_SERVICE_RESTART setting is auto (default).
      • Both sides are version 5.2.0 or higher (Both sides support this feature).
    • Unless, the service does not automatically restart for update.

DEB:

  • Note: The specification of TO-side version is applied. (It differs from RPM)
    • Ex. In case of v5.0.x -> v5.2.x, the spec of v5.2.x is applied.
  • Before
    • The service automatically restarts when it was active or enabled before update
  • After
    • The same as the RPM.

Adds FLUENT_PACKAGE_SERVICE_RESTART environmental variable

We can set this in the following file.

  • RPM: /etc/sysconfig/fluentd
  • DEB: /etc/default/fluentd

Value:

  • auto (default):
    • The service automatically restarts with the zero downtime restart feature when all of the following conditions are met.
      • The service was active before update.
      • Both sides are version 5.2.0 or higher (Both sides support this feature).
    • Plugins are automatically reinstalled at that time.
      • Needs online envirionment.
      • Please See Automate plugin install for update on demand for details.
  • manual:
    • If the service was active before update, the service does not restart automatically. The old processes continue to work after update.
    • You need to restart service manually as follows:
      • The zero downtime restart: Send SIGUSR2 to the supervisor process.
      • Normal restart: Restart service normally, such as systemctl restart.
    • Please use this when you want to manage plugins manually, or the environment is offline.

Automate plugin install for update on demand

When the service automatically restarts with the zero downtime restart feature, the missing plugins are automatically detected and reinstalled before restarting.

Previously, an automatic restart after the update was not recommended because it is problematic.
If you install some plugins manually, you need to reinstall plugins before restarting when the version of the embedded Ruby is updated.

So, for automatic restart, this feature provides automatic plugin installation.
This works as follows:

  1. Collect the current plugin-list before udpate.
    • Recognize gems as plugins that begin with the prefix fluent-plugin-.
  2. After installing new package, detect missing plugins by comparing the list with the default plugins.
  3. Install the missing plugins.
  4. Restart.

If you need to manage plugins manually, please set FLUENT_PACKAGE_SERVICE_RESTART to manual. (Ex. fixing the versions, offline environment, ...)
Then, automatic plugin installing and restarting are disabled.
You can manually install plugins and send SIGUSR2 signal to activate the zero downtime restart feature after update.

Caution: if you use a custom unit file, need to migrate it to use this feature safely

If you use a custom unit file, such as /etc/systemd/system/fluentd.service, please remove these 2 lines.

Environment=GEM_HOME=/opt/fluentd/lib/ruby/gems/...
Environment=GEM_PATH=/opt/fluentd/lib/ruby/gems/...

We don't need GEM_HOME and GEM_PATH.
They will be removed after v5.2.0, but if you use a custom unit file, you need to remove them manually.
If these variables are set, the zero downtime restart feature may not work correctly.
It is because the new process inherits the original environment variables if using this feature.

Mechanism of automatic restart with zero downtime

RPM

  1. to-pre(2): Collect plugin-list.
  2. Install TO-package
  3. to-post(2): Do nothing.
  4. from-preun(1):
    • Check auto or not.
    • Leave plugin-install flag and pid if need.
  5. Uninstall FROM-package
  6. from-postun(1): Disable %systemd_postun_with_restart.
  7. to-posttrans: Install plugin and send SIGUSR2 if need.

DEB

  1. from-prerm(upgrade):
    • Check auto or not.
    • Leave plugin-list and pid if need.
  2. to-preinst(upgrade): Set tmp files for TO-package.
  3. Install TO-package
  4. from-postrm(upgrade): Clean tmp files of FROM-package.
  5. Uninstall FROM-package
  6. to-postinst(configure): Install plugin and send SIGUSR2 if need.

@daipom daipom added the enhancement New feature or request label Nov 5, 2024
@kenhys kenhys force-pushed the feature-nodowntime branch 2 times, most recently from 91799bd to 112d5c5 Compare November 19, 2024 07:53
@daipom daipom force-pushed the feature-nodowntime branch from 303e0b1 to 6f781e5 Compare November 29, 2024 04:30
@daipom daipom force-pushed the feature-nodowntime branch 4 times, most recently from 2ee719f to e5a6e5a Compare December 6, 2024 07:09
kenhys and others added 19 commits December 10, 2024 10:00
deb:
 * preinst: detect locally installed plugins
   * collect plugin information via fluent-diagtool
 * postinst: install locally installed plugins
   * if network access is denied, giving up install plugins

rpm:
 * %pre $1 == 2 (upgrade): detect locally installed plugins
   * collect plugin information via fluent-diagtool
 * %post $1 == 2 (upgrade): install locally installed plugins
   * if network access is denied, giving up install plugins
Signed-off-by: Kentaro Hayashi <[email protected]>
auto: Automatically restart service without downtime.
      service restart will be fired during %preun (upgrade).

manual: Manually restart service by user. If user select manual,
      suppress auto-restart with automatically generated service hook
      %post and %systemd_postun_with_restart.
      when downgrading from v6, use manual and uninstall v6 then
      reinstall v5.

Signed-off-by: Kentaro Hayashi <[email protected]>
Accidentally debug print in preinst was pushed in 26785ce.

Signed-off-by: Kentaro Hayashi <[email protected]>
When upgrading package, the following hook is executed:
plugins should be installed before restarting service.

Before:

* old prerm
* new preinst collect plugin information
* old postrm auto restart
* new postinst install plugins

After:

* old prerm collect plugin information
* new preinst
* old postrm
* new postinst install plugins and auto restart

Signed-off-by: Kentaro Hayashi <[email protected]>
fluentd_auto_restart will be launched during
"configure" phase, so there is no need to check that action.

This is occurred by migrating restart logic from postrm to postinst.

Signed-off-by: Kentaro Hayashi <[email protected]>
Signed-off-by: Shizuo Fujita <[email protected]>
To verify major upgrade in workflow, it needs
more package build time.

Signed-off-by: Kentaro Hayashi <[email protected]>
This PR adds testing to ensure that package upgrades succeed without
data loss.
At present, the following plugins will be checked:

- in_tcp
- in_udp
- in_syslog

Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Before:

* missing fluent-plugin was installed

After:

* missing fluent-plugin was installed
* missing dependency gem was also installed

NOTE:

* if missing gem requires development packages to build it, it will
fail.
* fluent-diagtool depends on highly systemd service, so it is simple
just to use fluent-gem detecting missing gems.

---------

Signed-off-by: Kentaro Hayashi <[email protected]>
add test to update with auto / manual feature.

Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Even though needrestart package was installed, service will not be
restarted out of the maintainer script (hook).

Signed-off-by: Kentaro Hayashi <[email protected]>
Update the test to ensure :

* There is no issue even if an upgrade is performed that includes a Ruby
major version update.
* There is no data loss even if data is sent until the main process is
replaced.

---------

Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Kentaro Hayashi <[email protected]>
Watson1978 and others added 2 commits December 10, 2024 10:24
This PR will allow to switch auto/manual restart settings with DEB
package even when Fluentd is running.

---------

Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
@daipom daipom force-pushed the feature-nodowntime branch from 685ba07 to 55c021b Compare December 10, 2024 01:25
Watson1978 and others added 3 commits December 10, 2024 10:26
Add missing `prerm` script to include it in debian package

Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
This allows users to manually manage plugins by setting
FLUENT_PACKAGE_SERVICE_RESTART to manual.

For example, there will be cases where a user wants to use a particular
plugin version.

Signed-off-by: Daijiro Fukuda <[email protected]>
…l phase (#758)

We were going to support two methods of downgrading with zero downtime.

1. Running v6.x => Install v5.x package by overwriting
2. Running v6.x => Replace `FLUENT_PACKAGE_SERVICE_RESTART` value to
`manual` => uninstall v6.x => Install v5.x

The second method is to keep the Fluentd process running after
uninstalling it.
We decided to remove `2.` method because it may cause confusion for
users.

Signed-off-by: Shizuo Fujita <[email protected]>
@daipom daipom force-pushed the feature-nodowntime branch from 55c021b to 4fed1fe Compare December 10, 2024 01:26
daipom and others added 2 commits December 10, 2024 11:05
We don't need to suppress this macro because this macro handles preset,
not restart.

I have confirmed that this macro of the package for RHEL 9 is expanded
as follows.

```bash
if [ $1 -eq 1 ] && [ -x "/usr/lib/systemd/systemd-update-helper" ]; then
    # Initial installation
    /usr/lib/systemd/systemd-update-helper install-system-units fluentd.service || :
fi
```

Signed-off-by: Daijiro Fukuda <[email protected]>
…#761)

Signed-off-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Co-authored-by: Daijiro Fukuda <[email protected]>
@kenhys kenhys mentioned this pull request Dec 11, 2024
Points

* FROM-package just leaves tmp files if it supports the features.
* TO-package trigger the features if there are those tmp files.
* Thus, we don't need to check the version.
* Make installing plugin and restarting the same condition.

Before

1. from-prerm(upgrade): Do nothing.
2. to-preinst(upgrade): Collect plugin-list. Confirm version.
3. Install TO-package
4. from-postrm(upgrade): Do nothing.
5. Uninstall FROM-package
6. to-postinst(configure): Install plugin and restart if need.

After

1. from-prerm(upgrade):
   * Check auto or not.
   * Leave plugin-list and pid if need.
2. to-preinst(upgrade): Set tmp files for TO-package.
3. Install TO-package
4. from-postrm(upgrade): Clean tmp files of FROM-package.
5. Uninstall FROM-package
6. to-postinst(configure): Install plugin and restart if need.

Signed-off-by: Daijiro Fukuda <[email protected]>
Points

* FROM-package just leaves tmp files if it supports the features.
* TO-package trigger the features if there are those tmp files.
* Thus, can ensure that both FROM and TO support the feature.
* Make installing plugin and restarting the same condition.
* Disable `%systemd_postun_with_restart` completely to align
specifications with DEB.

Before

1. to-pre(2): Collect plugin-list.
2. Install TO-package
3. to-post(2): Install plugin.
4. from-preun(1): Do nothing.
5. Uninstall FROM-package
6. from-postun(1): Restart if need.

After

1. to-pre(2): Collect plugin-list.
2. Install TO-package
3. to-post(2): Do nothing.
4. from-preun(1):
   * Check auto or not.
   * Leave plugin-install flag and pid if need.
5. Uninstall FROM-package
6. from-postun(1): Disable `%systemd_postun_with_restart`.
7. to-posttrans: Install plugin and restart if need.

Signed-off-by: Daijiro Fukuda <[email protected]>
@daipom daipom marked this pull request as ready for review December 11, 2024 08:29
@daipom daipom changed the title Update/Reload without downtime Zero downtime update Dec 11, 2024
@daipom
Copy link
Contributor Author

daipom commented Dec 11, 2024

@kenhys kenhys force-pushed the feature-nodowntime branch from 7e511e9 to 33c1c1c Compare December 11, 2024 11:33
@kenhys
Copy link
Contributor

kenhys commented Dec 11, 2024

Fixed DCO.

Copy link
Contributor

@kenhys kenhys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kenhys
Copy link
Contributor

kenhys commented Dec 11, 2024

All checks has passed.

@kenhys kenhys merged commit 7a46ab5 into master Dec 11, 2024
208 checks passed
@kenhys kenhys deleted the feature-nodowntime branch December 11, 2024 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants