This repository has been archived by the owner on Apr 1, 2024. It is now read-only.
ISSUE-19568: [Bug] DeadLock risk of producer when resend messages in exception. #5540
Open
1 of 2 tasks
Original Issue: apache#19568
Search before asking
Version
2.9
Minimal reproduce step
When i try to add support for resending messages that is timeout out, i modify the
PerformanceProducer
like below.and soon i encounter the deadlock problem.
I emulate the risk of message loss by TC command, for example:
this command will drop some messages that broker responds to producer. So when producer send some messages to broker, the ack for sending may be dropped for 30%, as a consequence the messages in
pendingMessages
may be timeout, TimeOut exception will be throw to trigger resending logic i add above.But soon the whole process is stuck, and i dump the stack and find out the cause of problem.
the timer task for
triggerBatchMessageAndSend
is stuck because other threads hold theProducerImpl
object. the timer task fortriggerBatchMessageAndSend
is used for batch messages and send them to brokers, then the broker send back the ack to producer, producer will release the semaphore used to control the size ofpendingMessages
.and i find that below thread hold the
ProducerImpl
object, and it is waiting for semaphore which is released by thread above, which result into the deadlock.Stack above provide help for further analysis. When messages timeout occur, thread
pulsar-client-io
will hold theProducerImpl
object, try to fail them, and throw TimeoutException, which will be handle inCompletableFuture..exceptionally
block. As i add resending logic inCompletableFuture..exceptionally
block, threadpulsar-client-io
will try to acquire semaphore synchronously, ifpermits
of semaphore equals to zero, threadpulsar-client-io
will be stuck. And the timer task for release semaphore will also be stuck for acquiringProducerImpl
object, which is hold by threadpulsar-client-io
.What did you expect to see?
Producing process will producing messages without errors.
What did you see instead?
producing process is stuck.
Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: