Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped workflows appear to be stuck and maxed out #61

Open
hardcodet opened this issue Oct 20, 2023 · 7 comments
Open

Grouped workflows appear to be stuck and maxed out #61

hardcodet opened this issue Oct 20, 2023 · 7 comments

Comments

@hardcodet
Copy link

I upgraded Bull from 6.3.4 to 6.6.1 recently, and noticed that our scheduling went completely havoc :)
Downgrading today didn't fix anything, so while it would be a big coincidence, the upgrade may be a red herring.

We are using Bull to schedule "workflows" at a given time, in most of the cases immediately, but sometimes also at a given time. We are often using groups with a concurrency of 1 to ensure sequential execution in certain contexts to avoid race conditions.

We also have a routine that detects workflows that have not been executed when scheduled. If it detects those, it reschedules them for immediate execution (in batches of max 500 workflows at a time). This typically doesn't happen, but since after the package upgrade (which was deployed a bit after 20:00), this has skyrocketed:

image

I have a testing endpoint that returns any group: Fetching groups often returns just an empty array (which is good, as it indicates there's simply no pending groups). But I am now frequently also am getting groups like those:

[
    {
        "id": "6530246b25b8336ddc285058",
        "status": "maxed",
        "jobs": []
    },
    {
        "id": "642ac25f4b2ea56c60f5f157",
        "status": "maxed",
        "jobs": [
            {
                "name": "LIAM API.INT",
                "data": {
                    "workflowId": "65310c5893ebef7622ae1ac8",
                    "fingerprint": "1697713242404"
                },
                "opts": {
                    "attempts": 20,
                    "delay": 0,
                    "group": {
                        "id": "642ac25f4b2ea56c60f5f157"
                    },
                    "backoff": {
                        "delay": 10000,
                        "type": "fixed"
                    },
                    "jobId": "LIAM API.INT_64396127-32ce-4352-8013-056402e7e3f6",
                    "removeOnComplete": true
                },
                "id": "LIAM API.INT_64396127-32ce-4352-8013-056402e7e3f6",
                "progress": 0,
                "returnvalue": null,
                "stacktrace": [],
                "attemptsMade": 0,
                "delay": 0,
                "timestamp": 1697713242793,
                "queueQualifiedName": "bull:LIAM API.INT",
                "gid": "642ac25f4b2ea56c60f5f157"
            }
        ]
    },
    {
        "id": "65304473b145909c2307b358",
        "status": "maxed",
        "jobs": []
    }
]

As you can see, they are all maxed, often without any jobs at all. Now, I saw that we also had a lot of load during that time from an integration team. That might have played into those maxed out groups, but those look super fishy.
Any idea what I could do to help triangulate those?

@roggervalf
Copy link
Collaborator

roggervalf commented Oct 20, 2023

hey @hardcodet, I took a look in our changelog and in the changes between those versions, but for now I don't see anything weird https://docs.bullmq.io/bullmq-pro/changelog, one question, the upgrade that you did includes any other changes, asking in order to isolate this

@hardcodet
Copy link
Author

Hi Rogger
No, it was a pretty isolated change, didn't update any other packages either. But as mentioned, I'm still seeing maxed out groups today, even though I'm long back to the old version. It's really been a coincidence that we also have had massive load on that integration system due to an external team running heavy automation tests, so the issue might have been there before.

@hardcodet
Copy link
Author

hardcodet commented Oct 20, 2023

It's also noteworthy that those maxed out groups disappear after a while. I usually have a few when querying for groups, but never too many. So they are not indefinitely stuck (even though I just manually deleted one that seems to have been around for a few hours already).

On part may be the automated cleanup routines I run on groups, when I maintenance routine of mine discovers that I have group workflows were due for execution, but didn't run. I assume that routine would clear out those empty groups without any jobs though, I guess.

This is the routine that I wrote a while back with your help, which sometimes kicks in due to stuck groups (and kicked in a lot in the last few days):

/**
 * Resets the status of a given Bull group, which should force it to
 * resume any (stuck) jobs.
 */
async resetGroup(groupId: string, throwException: boolean) {
    const queue = this.getQueue(this.config.qualifiedAppName);

    const redis = new Redis(this.config.redisUri, {
        maxRetriesPerRequest: 20,
        enableReadyCheck: false,
        retryStrategy: (times: number) => Math.min(Math.exp(times), 15000),
    });

    const groupPrefix = `bull:${queue.name}:groups`;
    // reset the count to zero
    await redis.hdel(`${groupPrefix}:active:count`, groupId);
    this.logger.info(`Sent redis HDEL in order to reset the count of stuck group ${groupId}.`);
    // change the status of the group. first remove it from the max status:
    await redis.zrem(`${groupPrefix}:max`, groupId);
    this.logger.info(`Sent redis ZREM in order to reset the maxed status of stuck group ${groupId}.`);
    // then reinsert it in the groups set (we use a high score so that this group starts next):
    await redis.zadd(`${groupPrefix}`, 9999999, groupId);
    this.logger.info(`Sent redis ZADD in order to force BullMQ to resume processing of stuck group ${groupId}. The group's jobs should start executing again now.`);

    return true;
}

@hardcodet
Copy link
Author

hardcodet commented Oct 20, 2023

FYI, after having simply deleted most pending workflows, and thus killed the rescheduling cycle, things have normalized again. There was one last empty maxed group that didn't want to go away, so I deleted that manually, and that was the end of it. I'm also back on 6.6.1, and didn't see any more issues (it's save to say that I will wait quite a while before rolling that out to PROD though - we'll have to monitor that situation over the next weeks).
So it actually might have been the load of jobs that overloaded the system. Still, it would be good to know why it happened (and also, what was with those empty maxed out groups - can't say if they are to be expected, or a helpful indicator of something).

@roggervalf
Copy link
Collaborator

hi @hardcodet, we were refactoring how we are tracking maxed groups in v7.9.2 in case you want to give it a try

@TNonet
Copy link

TNonet commented Nov 14, 2024

Is there an update on this situation? Does upgrading to v7.9.2+ fix the issue or just decrease the likelihood of this situation?

@manast
Copy link
Contributor

manast commented Nov 14, 2024

Is there an update on this situation? Does upgrading to v7.9.2+ fix the issue or just decrease the likelihood of this situation?

AFAIK there is no issue currently, but if this has happened to you using one of the latest versions we would like to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants