Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults #354

ldkafka · 2019-09-23T04:16:00Z

What steps will reproduce the problem?

I am working on getting this info. It happens on a live system with a few thousand jobs per day where a few hundred segfault and get re-queued indefinitely.

The job implements \yii\queue\RetryableJobInterface and has:
public function canRetry($attempt, $error) {
return ($attempt < 3 ) && ($error instanceof TemporaryException);
}

What's expected?

Not sure if the segfault is a Queue issue, but at least the "Attempts" mechanism should work so we do not end up in an infinite race... a job should really not be retried more than twice, but I get the attempt counter (in the logs) up to 400+ (then I have to flush the queue to stop this).

What do you get instead?

Infinite re-queuing. The segfault must happen in a very awkward place in between the attempt counter being increased and canRetry call...

Additional info

Using Redis queue.

Q	A
Yii version	2.0.27
PHP version	v7.0.33-0+deb9u5
Operating system	Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3 (2019-09-02) x86_64 GNU/Linux

ldkafka · 2019-09-23T08:45:14Z

A lot of jobs are left in the reserved state, which is also where the attempt counter is incremented via hincrby in the redis driver. I believe these to be all the jobs that have segfaulted, but then get re-run.

ldkafka · 2019-09-25T05:39:15Z

It seems that the segfault is occurring after the job finishes (at the garbage collecting stage) in the Zend memory manager. Similar to documented bugs like https://bugs.php.net/bug.php?id=71662

Switching off the Zend_MM with USE_ZEND_ALLOC=0 stops the segfaults.

The question that remains is if the queue manager can deal with a segfault in the job and behave as expected in terms of queue/attempt management?

samdark · 2019-09-30T07:45:57Z

No, it can't. Segfault can't be caught.

ldkafka · 2019-09-30T07:56:53Z

I don't think the segfault needs to be caught. My thoughts are more along the line of adjusting the attempt increment/retry logic (so there is a safeguard before the job runs not after).

samdark · 2019-09-30T08:19:04Z

Do you have an idea about implementation?

ldkafka · 2019-10-02T04:34:44Z

I'll have a look

samdark closed this as completed Sep 30, 2019

samdark reopened this Sep 30, 2019

bizley added the status:to be verified Needs to be reproduced and validated. label Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults #354

Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults #354

ldkafka commented Sep 23, 2019

ldkafka commented Sep 23, 2019

ldkafka commented Sep 25, 2019

samdark commented Sep 30, 2019

ldkafka commented Sep 30, 2019

samdark commented Sep 30, 2019

ldkafka commented Oct 2, 2019

Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults #354

Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults #354

Comments

ldkafka commented Sep 23, 2019

What steps will reproduce the problem?

What's expected?

What do you get instead?

Additional info

ldkafka commented Sep 23, 2019

ldkafka commented Sep 25, 2019

samdark commented Sep 30, 2019

ldkafka commented Sep 30, 2019

samdark commented Sep 30, 2019

ldkafka commented Oct 2, 2019