Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults #354

Open
ldkafka opened this issue Sep 23, 2019 · 6 comments
Labels
status:to be verified Needs to be reproduced and validated.

Comments

@ldkafka
Copy link

ldkafka commented Sep 23, 2019

What steps will reproduce the problem?

I am working on getting this info. It happens on a live system with a few thousand jobs per day where a few hundred segfault and get re-queued indefinitely.

The job implements \yii\queue\RetryableJobInterface and has:
public function canRetry($attempt, $error) {
return ($attempt < 3 ) && ($error instanceof TemporaryException);
}

What's expected?

Not sure if the segfault is a Queue issue, but at least the "Attempts" mechanism should work so we do not end up in an infinite race... a job should really not be retried more than twice, but I get the attempt counter (in the logs) up to 400+ (then I have to flush the queue to stop this).

What do you get instead?

Infinite re-queuing. The segfault must happen in a very awkward place in between the attempt counter being increased and canRetry call...

Additional info

Using Redis queue.

Q A
Yii version 2.0.27
PHP version v7.0.33-0+deb9u5
Operating system Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3 (2019-09-02) x86_64 GNU/Linux
@ldkafka
Copy link
Author

ldkafka commented Sep 23, 2019

A lot of jobs are left in the reserved state, which is also where the attempt counter is incremented via hincrby in the redis driver. I believe these to be all the jobs that have segfaulted, but then get re-run.

@ldkafka
Copy link
Author

ldkafka commented Sep 25, 2019

It seems that the segfault is occurring after the job finishes (at the garbage collecting stage) in the Zend memory manager. Similar to documented bugs like https://bugs.php.net/bug.php?id=71662

Switching off the Zend_MM with USE_ZEND_ALLOC=0 stops the segfaults.

The question that remains is if the queue manager can deal with a segfault in the job and behave as expected in terms of queue/attempt management?

@samdark
Copy link
Member

samdark commented Sep 30, 2019

No, it can't. Segfault can't be caught.

@samdark samdark closed this as completed Sep 30, 2019
@ldkafka
Copy link
Author

ldkafka commented Sep 30, 2019

I don't think the segfault needs to be caught. My thoughts are more along the line of adjusting the attempt increment/retry logic (so there is a safeguard before the job runs not after).

@samdark samdark reopened this Sep 30, 2019
@samdark
Copy link
Member

samdark commented Sep 30, 2019

Do you have an idea about implementation?

@ldkafka
Copy link
Author

ldkafka commented Oct 2, 2019

I'll have a look

@bizley bizley added the status:to be verified Needs to be reproduced and validated. label Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:to be verified Needs to be reproduced and validated.
Projects
None yet
Development

No branches or pull requests

3 participants