Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper SSL Failures When Certificate Is Rolled #359

Open
d80tb7 opened this issue Apr 25, 2023 · 6 comments
Open

Zookeeper SSL Failures When Certificate Is Rolled #359

d80tb7 opened this issue Apr 25, 2023 · 6 comments

Comments

@d80tb7
Copy link

d80tb7 commented Apr 25, 2023

Describe the bug

Zookeeper doesn't handle SSL certificate rolling gracefully. Specifically, if a certificate is rolled Zookeeper will continue to use the old, expired cert until it is restarted, which can lead to an outage as other components will be unable to communicate with it.

I'm not sure if this is an issue with the Pulsar Helm chart, or with Pulsar itself. If the latter, please let me know and I'll raise the issue there.

To Reproduce

Steps to reproduce the behavior:
The following is valid for Pulsar 2.92 using Helm chart 2.92

  1. Deploy Pulsar into a K8s cluster using the Helm chart with tls enabled for zookeeper and certs managed by certmanager
  2. Wait for the certificate to be rolled
  3. See connections to zookeeper fail with "io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed...Caused by java.security.cert.CertificateExpiredException "
  4. Restart Pulsar Pods and see that errors go away

Expected behavior
Pulsar should continue to operate normally when a certificate is rolled

@bhavyaravilla
Copy link

bhavyaravilla commented Jan 10, 2024

I have the same issue. And zookeeper keeps failing with the below errors

2024-01-10T11:30:44,178+0000 [epollEventLoopGroup-7-1] ERROR org.apache.zookeeper.server.NettyServerCnxnFactory - Unsuccessful handshake with session 0x02024-01-10T11:30:44,178+0000 [epollEventLoopGroup-7-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Exception caughtio.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_expired at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499) ~[io.netty-netty-codec-4.1.93.Final.jar:4.1.93.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[io.netty-netty-codec-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) ~[io.netty-netty-transport-classes-epoll-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499) ~[io.netty-netty-transport-classes-epoll-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[io.netty-netty-transport-classes-epoll-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final] at java.lang.Thread.run(Thread.java:833) ~[?:?]Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_expired at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?] at sun.security.ssl.Alert.createSSLException(Alert.java:117) ~[?:?] at sun.security.ssl.TransportContext.fatal(TransportContext.java:365) ~[?:?] at sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293) ~[?:?] at sun.security.ssl.TransportContext.dispatch(TransportContext.java:204) ~[?:?] at sun.security.ssl.SSLTransport.decode(SSLTransport.java:172) ~[?:?] at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:736) ~[?:?] at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:691) ~[?:?] at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:506) ~[?:?] at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:482) ~[?:?] at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679) ~[?:?] at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:297) ~[io.netty-netty-handler-4.1.93.Final.jar:4.1.93.Final] at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353) ~[io.netty-netty-handler-4.1.93.Final.jar:4.1.93.Final]

@d80tb7
Copy link
Author

d80tb7 commented Apr 27, 2024

I think the issue here is that although the Pulsar Helm Chart sets the zookeeper.client.certReload property, this isn't enough. All that property does is to get Zookeeper to update the certs when the truststore or keystore files change. When cert-manager updates the certs, this will cause the cert failes in pulsar/certs/zookeeper/ to update but nothing is going to update the keystore.

The other Pulsar components (e.g. the bookie) solve this by having code inside them that watches the files under /pulsar/certs/ and then updates the keystore accordingly. Zookeeper doesn't have such code and therefore it seems to me that the certs will never be refreshed.

@Loahrs
Copy link

Loahrs commented Apr 29, 2024

I am encountering the same issue with version 3.3.0 of the helm chart. The Pulsar Pods threw SSL-Exception( "notAfter: 15.04.2024").

Restarting the pods solved the issue.

@HaimKortovich
Copy link

Restarting zookeeper did not fix the error:

` Caused by: java.security.cert.CertificateExpiredException: NotAfter: Tue Oct 01 17:19:15 UTC 2024

at sun.security.x509.CertificateValidity.valid(CertificateValidity.java:277) ~[?:?]

at sun.security.x509.X509CertImpl.checkValidity(X509CertImpl.java:621) ~[?:?]

at sun.security.provider.certpath.BasicChecker.verifyValidity(BasicChecker.java:190) ~[?:?]

at sun.security.provider.certpath.BasicChecker.check(BasicChecker.java:144) ~[?:?]

at sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(PKIXMasterCertPathValidator.java:125) ~[?:?]

at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:224) ~[?:?]

at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:144) ~[?:?]

at sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(PKIXCertPathValidator.java:83) ~[?:?]

at java.security.cert.CertPathValidator.validate(CertPathValidator.java:309) ~[?:?]

at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:364) ~[?:?]

at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:275) ~[?:?]

at sun.security.validator.Validator.validate(Validator.java:264) ~[?:?]

at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:285) ~[?:?]

at sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(X509TrustManagerImpl.java:138) ~[?:?]

at io.netty.handler.ssl.EnhancingX509ExtendedTrustManager.checkClientTrusted(EnhancingX509ExtendedTrustManager.java:62) ~[io.netty-netty-handler-4.1.108.Final.jar:4.1.108.Final]

at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkClientCerts(CertificateMessage.java:1273) ~[?:?]

at sun.security.ssl.CertificateMessage$T13CertificateConsumer.onConsumeCertificate(CertificateMessage.java:1198) ~[?:?]

at sun.security.ssl.CertificateMessage$T13CertificateConsumer.consume(CertificateMessage.java:1175) ~[?:?]

at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:396) ~[?:?]

at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:480) ~[?:?]

at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1277) ~[?:?]

at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1264) ~[?:?]

at java.security.AccessController.doPrivileged(AccessController.java:712) ~[?:?]

at sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:1209) ~[?:?]

at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1651) ~[io.netty-netty-handler-4.1.108.Final.jar:4.1.108.Final]

at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1497) ~[io.netty-netty-handler-4.1.108.Final.jar:4.1.108.Final]

at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) ~[io.netty-netty-handler-4.1.108.Final.jar:4.1.108.Final]

at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) ~[io.netty-netty-handler-4.1.108.Final.jar:4.1.108.Final]

at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530) ~[io.netty-netty-codec-4.1.108.Final.jar:4.1.108.Final]

at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469) ~[io.netty-netty-codec-4.1.108.Final.jar:4.1.108.Final]

... 15 more`

@lhotari
Copy link
Member

lhotari commented Oct 30, 2024

Another issue report: #524

@beelis
Copy link

beelis commented Jan 10, 2025

Same issue with version 3.7.0 (pulsar 4.0.0). After a restart of all pulsar pods, the zookeeper could recover.

After that, the pulsar functions were not able to connect to the broker, they also seem to have the old cert cached:

2025/01/10 08:31:36.353 asm_amd64.s:1700: [info] Connecting to broker remote_addr=pulsar+ssl://pulsar-broker:6651 2025/01/10 08:31:36.368 asm_amd64.s:1700: [warning] Failed to connect to broker. error=tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2025-01-10T08:31:36Z is after 2025-01-05T18:12:06Z remote_addr=pulsar+ssl://pulsar-broker:6651

Restarting the pulsar function pod did not help, only a redeploy of the function solved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants