Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request/Question] Allow failed jobs to be retried on different workers when using linear backoff with zero delay #2789

Open
dzmm opened this issue Sep 26, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@dzmm
Copy link

dzmm commented Sep 26, 2024

Currently, when using linear backoff with a delay of 0, failed jobs are retried on the same worker. However, in some scenarios, a worker might be on a malfunctioning machine, and we need the ability to retry the job on a different worker.

Current Behavior

With linear backoff and zero delay, failed jobs are always retried on the same worker that initially failed to process them.

Desired Behavior

Even with linear backoff and zero delay, failed jobs should have the option to be retried on different workers, allowing for better fault tolerance and recovery from worker-specific issues.

is there anyway I can do this on current version of bullmq?

@manast
Copy link
Contributor

manast commented Sep 26, 2024

I do not think that by design it will work like you are describing it.

Most likely what is happening is that since the worker just finished processing this job, and the delay is zero, it gets to pick it up. If the worker was malfunctioning it would not pick the same job again. But there is a chance that some other worker that also is idling picks it up.

In any case it would be impossible to guarantee that the same worker that failed the job would not pick it again, so there really is not a lot we can do here.

@manast manast added the enhancement New feature or request label Nov 16, 2024
@manast
Copy link
Contributor

manast commented Nov 16, 2024

For this to work, a worker must keep some kind of list of jobIds of recent failed jobs so that it will ignore them and thus give a chance for other workers to pick them up. It is not completely trivial to implement though, and this list of jobs must be passed to the moveToActive Lua script in every call, or be stores in some specific Redis key...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants