Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Has not waited the lock" on queue with supervisor numprocs more than one #234

Open
m1roff opened this issue May 8, 2018 · 41 comments
Open
Labels

Comments

@m1roff
Copy link

m1roff commented May 8, 2018

How to solve? Or may be im doing something wrong?

if set numprocs=1 in supervisor config - no errors!

What steps will reproduce the problem?

config from common/config/main.php

'queue' => [
            'class' => \yii\queue\db\Queue::class,
            'as log' => \yii\queue\LogBehavior::class,
            'db' => 'db', // DB connection component or its config
            'tableName' => '{{%queue}}', // Table name
            'channel' => 'default', // Queue channel key
            'mutex' => \yii\mutex\PgsqlMutex::class, // Mutex that used to sync queries
            'mutexTimeout' => 0,
            'ttr' => 5 * 60, // Max time for anything job handling
            'attempts' => 5, // Max number of attempts
        ],

supervisor config

[program:m-prod-yii-queue-worker]
command=/usr/bin/php /www/m/http/yii queue/listen --verbose=1 --color=0
autostart=true
autorestart=true
numprocs=2
process_name = %(program_name)s_%(process_num)02d
redirect_stderr=true
stdout_logfile=/www/m/log/yii-queue-worker.log

Error trace

yii\base\Exception: Has not waited the lock. in /www/m/http/vendor/yiisoft/yii2-queue/src/drivers/db/Queue.php:179
Stack trace:
#0 [internal function]: yii\queue\db\Queue->yii\queue\db\{closure}(Object(yii\db\Connection))
#1 /www/m/http/vendor/yiisoft/yii2/db/Connection.php(1059): call_user_func(Object(Closure), Object(yii\db\Connection))
#2 /www/m/http/vendor/yiisoft/yii2-queue/src/drivers/db/Queue.php(211): yii\db\Connection->useMaster(Object(Closure))
#3 /www/m/http/vendor/yiisoft/yii2-queue/src/drivers/db/Queue.php(78): yii\queue\db\Queue->reserve()
#4 [internal function]: yii\queue\db\Queue->yii\queue\db\{closure}(Object(Closure))
#5 /www/m/http/vendor/yiisoft/yii2-queue/src/cli/Queue.php(117): call_user_func(Object(Closure), Object(Closure))
#6 /www/m/http/vendor/yiisoft/yii2-queue/src/drivers/db/Queue.php(93): yii\queue\cli\Queue->runWorker(Object(Closure))
#7 /www/m/http/vendor/yiisoft/yii2-queue/src/drivers/db/Command.php(76): yii\queue\db\Queue->run(true, 3)
#8 [internal function]: yii\queue\db\Command->actionListen(3)
#9 /www/m/http/vendor/yiisoft/yii2/base/InlineAction.php(57): call_user_func_array(Array, Array)
#10 /www/m/http/vendor/yiisoft/yii2/base/Controller.php(157): yii\base\InlineAction->runWithParams(Array)
#11 /www/m/http/vendor/yiisoft/yii2/console/Controller.php(148): yii\base\Controller->runAction('listen', Array)
#12 /www/m/http/vendor/yiisoft/yii2/base/Module.php(528): yii\console\Controller->runAction('listen', Array)
#13 /www/m/http/vendor/yiisoft/yii2/console/Application.php(180): yii\base\Module->runAction('queue/listen', Array)
#14 /www/m/http/vendor/yiisoft/yii2/console/Application.php(147): yii\console\Application->runAction('queue/listen', Array)
#15 /www/m/http/vendor/yiisoft/yii2/base/Application.php(386): yii\console\Application->handleRequest(Object(yii\console\Request))
#16 /www/m/http/yii(27): yii\base\Application->run()
#17 {main}

Additional info

Q A
Yii vesion 2.0.16-dev
PHP version 7.1.10
Operating system ubuntu16.04.1
@kaspirovski
Copy link

Which PostgreSQL version?

@m1roff
Copy link
Author

m1roff commented Jul 2, 2018

psql (PostgreSQL) 10.3 (Ubuntu 10.3-1.pgdg16.04+1)

@kaspirovski
Copy link

kaspirovski commented Jul 4, 2018

I have the same problem on psql (PostgreSQL) 10.4 (Ubuntu 10.4-0ubuntu0.18.04) so it seems to be problem with PostgreSQL version. It works great on PostgreSQL 9.4.

@akorz
Copy link

akorz commented Jul 6, 2018

the same story with MySQL

@rob006
Copy link

rob006 commented Jul 6, 2018

How large is your queue?

@akorz
Copy link

akorz commented Jul 6, 2018

for me, it's 5 workers

@rob006
Copy link

rob006 commented Jul 6, 2018

I mean how many jobs do you have in queue table.

@akorz
Copy link

akorz commented Jul 6, 2018

right now I have a very little load. May be 1 job per minute

@rob006
Copy link

rob006 commented Jul 6, 2018

I have ~300k waiting jobs in queue and DB driver (MariaDB) becomes unusable - it took like 2-3 seconds to reserve a job (which executes in less than 0.1 second, so queue spends most of its time on reserving jobs).

@SamMousa
Copy link
Contributor

You are getting lock contention, meaning that all workers are trying to obtain a lock at the same time.
For those kinds of loads it makes more sense to use a real queue like beanstalkd.
(Note that is is really easy 0 configuration to setup, and it will make your life easier and your queue faster)

@samokspv
Copy link

samokspv commented Jan 4, 2019

the same story with MySQL

+1

@geopamplona
Copy link

I have the same problem in a Postgresql database "PostgreSQL 10.4 (Debian 10.4-2.pgdg90 + 1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18 + deb9u1) 6.3.0 20170516, 64-bit "

it seems a subject of the database configuration.

Is there any idea how to solve this? What effects can this have?
Does this error leave unresolved tasks?

@JorgePalaciosZaratiegui

+1
Why does the error 'Has not waited the lock' occur? What is the cause?

@rob006
Copy link

rob006 commented Jan 20, 2019

OP problem is related to mutex settings, see this answer on SO.

In general if you get such errors, you should switch mutex backend to some more reliable implementation (MysqlMutex works fine for me), and/or increase mutexTimeout (setting it to 0 will throw this exception on every concurrency issue, which is very impractical for real queue).

@samdark samdark added the type:docs Documentation label Jan 20, 2019
@samdark
Copy link
Member

samdark commented Jan 20, 2019

Doesn't seem there's anything to fix in the code but definitely worth documenting it.

@samdark samdark added the status:ready for adoption Feel free to implement this issue. label Jan 20, 2019
@darrylkuhn
Copy link

We too have been bitten by the Has not waited the lock exception. Our queue is currently configured as follows:

'queue' => [
    'class' => \yii\queue\db\Queue::class,
    'db' => 'queue-db',
    'tableName' => '{{%queue}}',
    'channel' => 'default',
    'attempts' => 3, // Max number of attempts
    'ttr' => 60, // Maximum duration a job should run before it is considered abandoned
    'mutex' => \yii\mutex\MysqlMutex::class,
    'as jobMonitor' => \zhuravljov\yii\queue\monitor\JobMonitor::class,
    'as workerMonitor' => \zhuravljov\yii\queue\monitor\WorkerMonitor::class,
]

We are processing ~15,000 - 20,000 jobs an hour with 8 concurrent workers. We see when DB load gets high the lock takes longer and times out. Our current thinking is to move the mutex to something other than MySQL (redis in our case) so that high db load does not impact the worker's ability to take/close locks. In my evaluation of the code it seems fine to have a back-end different than the mutex provider (e.g. keep mysql back-end, but move the mutex to redis). Just wanted to ping the community to get your thoughts on this approach - any red flags?

@rob006
Copy link

rob006 commented Oct 23, 2019

@darrylkuhn It will probably not change anything, because it is highly unlikely that mutex is the bottleneck here. The problem is in the process, which holds lock. DB driver has know performance issues - in big queues reserving job may take a while. If you want to change something, I would rather replace DB queue driver with something else.

@SamMousa
Copy link
Contributor

Use a better driver like beanstalk for job queues. The database is a bad place for a job queue.

@darrylkuhn
Copy link

@rob006 and @SamMousa understood we'll probably end up taking that approach. Thanks for the feedback.

@mathematicalman
Copy link

Fix it by change two methods - https://github.com/yiisoft/yii2-queue/pull/362/files

@samdark samdark added this to the 2.3.1 milestone Nov 30, 2019
@samdark samdark removed the status:ready for adoption Feel free to implement this issue. label Nov 30, 2019
@rowansimms
Copy link

Fix it by change two methods - https://github.com/yiisoft/yii2-queue/pull/362/files

Absolutely confirmed fix for my MariaDB instance.

@darrylkuhn
Copy link

We had this problem 8 months ago and moved to Amazon SQS however the lack of visibility and the inability to delay more than 15 minutes has us looking for another back end again. Can anybody speak to whether or not the lock issue here is present with the redis driver?

@SamMousa
Copy link
Contributor

SamMousa commented Jul 2, 2020

Have you tried beanstalkd?

@darrylkuhn
Copy link

@SamMousa No we haven't. We have redis infrastructure already and was hoping to leverage that however we're processing ~20,000 jobs an hour; given that workload am I correct in assuming that redis is also not an appropriate back-end? Thanks.

@SamMousa
Copy link
Contributor

SamMousa commented Jul 2, 2020

I don't think it's that much to be honest.. but redis is not a job queue. If queue jobs are ephemeral beanstalkd is very simple to set to though.

@rob006
Copy link

rob006 commented Jul 2, 2020

@darrylkuhn Redis does not have this performance problem (I'm using it to handle hundreds of thousands jobs per hour), as long as you configured your mutex correctly (mutexTimeout is not 0). But it has some other limitations, like lack of priorities, and it is not really atomic, so in unstable environment you may end up with inconsistent queue.

@darrylkuhn
Copy link

@SamMousa and @rob006 thanks both for the input - think I'll probably suck it up and spin up beanstalkd 👍

@ixapek
Copy link

ixapek commented Oct 30, 2020

In my case, the problem was the absence of an index on the done_at field.

UPDATE queue SET reserved_at=null WHERE reserved_at < :time - ttr and done_at is null;
This query initiated by moveExpired() method and works slowly, if deleteReleased=false, mutex expired, worker crashed with exception, restarted by supervisor and crash again.

After create done_at index all works fine.
I think, need update documentation for this index https://github.com/yiisoft/yii2-queue/blob/master/docs/guide/driver-db.md

@samdark samdark modified the milestones: 2.3.1, 2.3.2 Dec 23, 2020
@samdark samdark removed this from the 2.3.2 milestone Oct 23, 2021
@freddokresna
Copy link

freddokresna commented Apr 7, 2022

In my case, the problem was the absence of an index on the done_at field.

UPDATE queue SET reserved_at=null WHERE reserved_at < :time - ttr and done_at is null; This query initiated by moveExpired() method and works slowly, if deleteReleased=false, mutex expired, worker crashed with exception, restarted by supervisor and crash again.

After create done_at index all works fine. I think, need update documentation for this index https://github.com/yiisoft/yii2-queue/blob/master/docs/guide/driver-db.md

i think , this query need to add channel as filter since this update query takes 8 second in my case with 500k queue

@fl0v
Copy link

fl0v commented Apr 9, 2022

I think if you have 500k jobs then db driver is not the best choice (especially if you use several channels), use a specialized messaging service instead.
The db driver is meant to have just a small number of concurrent jobs where index is irrelevant on select and inserting/updating jobs should be as fast as possible.

@freddokresna
Copy link

freddokresna commented Apr 9, 2022

I think if you have 500k jobs then db driver is not the best choice (especially if you use several channels), use a specialized messaging service instead. The db driver is meant to have just a small number of concurrent jobs where index is irrelevant on select and inserting/updating jobs should be as fast as possible.

a week I had been dealing with this performance problem
and today I had realize several conclusions to note, maybe help another person use this queue

  1. I'm running on VM ( proxmox 6.4-4 ) and php8.1 which had a standard CPU unit, once I increase the CPU unit the problem of performance disappears ( the standard CPU unit, give me 1job per 3 seconds to achieve, once increase the CPU 1job per second )
  2. this DB queue is optimized when we use a separate table instead of one table with several channels ( in case of huge jobs )
  3. always remove the vendors folder and get the updated version ( i don't exactly know why reinstalling had some significant effect, maybe the updated version had bug fixing on the queue )
  4. 3 conccurent job per table queue

hope this help other who had the same problem

@fl0v
Copy link

fl0v commented Apr 9, 2022

  1. is a very good ideea
  2. not sure it had anything to do with it, a composer update on dev should be enough and after all new updates are tested a composer install on production should be enough. As far as i know composer itself has no issues regarding installing new versions of your packages. (Also after composer install on production you should always run migrations as part of deploy process just in case new migrations come from required packages)

@BenasPaulikas
Copy link

I'm getting this error at same time when I'm doing mysqldump. So Ill just increase mutexTimeout for now.

@BenasPaulikas
Copy link

Running backups with nice -n 19 mysqldump --single-transaction eliminated my issue. Maybe this will be helpful for someone

@i-internet
Copy link

it been more than a year now waiting this to be fixed

@samdark
Copy link
Member

samdark commented Sep 15, 2022

@i-internet which way would you fix it?

@gb5256
Copy link

gb5256 commented May 5, 2023

To add to the original post, I use numprocs=1 and do get this errors now also on my dev server.
Funny thing is that the dev server has no queue jobs at all most of time. But everytime the "listen" starts, it throws the "has not waited lock" error. Again, the table of queues is empty. Not sure what do to from here. Somebody in another issue posted that downgrading to 2.3.3 did work for them, will try this now as well.

@gb5256
Copy link

gb5256 commented May 5, 2023

I can not (!) confirm that downgrading to 2.3.3 would fix it. Again, my dev server has no jobs at all, but every 15 to 30 minutes, it throws the Lock error.

@samdark
Copy link
Member

samdark commented May 6, 2023

We have an idea for the fix. Discussing it internally.

@samdark samdark added type:bug Bug and removed type:docs Documentation labels May 6, 2023
@optmsp
Copy link

optmsp commented Nov 13, 2023

As a quick note to others on our solution..

Our queue is relatively small (500-1000 jobs at any given time), and we still had this issue. If you can't immediately switch to another driver, increasing your job runners + mutex timeout can assist greatly.

Some of our jobs can take up to 15-20 minutes. Mostly, that's because they are working with remote APIs and doing a lot of transformations and so are slow. The benefit of this however is that the jobs actually don't consume a lot of local CPU or disk IO. This allowed us to run a lot of job runners at a time, but this in turn caused us to hit the db driver 'has not waited the lock' issue.

We now run 8-16 job runners, depending on the queue channel in question, and we increased our mutex timeout to 60 seconds. Before, we would only have 2-3 active jobs, but with this setting we generally have 80% of our job runners actively processing a job. The default mutex timeout of 3 seconds on the db driver does not work well if you have more than a few jobs because of the lock contention in the db driver. You must increase it, sometimes dramatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests