-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: infinite loop for "schema engine altered" #17458
Comments
Thank you, @derekperkins ! I'm assuming that the I'm assuming that simply creating the table does not cause the same CPU usage? |
Looking at code, this is where we are ending up with that table
You will want to look at the underlying query and execute it, and see if anything looks off with that
|
@mattlord the table has existed for years, I just included the vitess/go/vt/vttablet/tabletserver/schema/engine.go Lines 486 to 487 in b8b0383
@deepthi this comment right above that line I think does match what happened. We ran a big DML that then caused an immediate crash (hopefully fixed in the future by #17239). I tried adding a comment to the |
@derekperkins the workflow would be trying to read the table from the cache and the cache is a read-through cache for VReplication in v21 as we want to ensure that we get the current table schema (#15912). So I suspect there's a bug here in the schema engine and the workflow stream trying to regularly start is triggering it repeatedly. These look to be the relevant queries here:
Does the output look correct in your case? I wonder if the times are somehow not aligned? Notice that we use |
Our servers are all on UTC, so it seems unlikely to be timezone related. Looks like they are 67 seconds different. That seems like a realistic time between OOM and coming back online, so maybe it stored the time it saw the change vs what information schema said? |
This table hasn't been altered in the last 4 days, but the timestamp is changing... In fact now that I'm checking, both tables are changing every second
|
The
It seems likely that I'm either hitting a MySQL native bug, or am somehow dealing with a corrupt table. I'm leaning towards this not being Vitess related, and if that seems to be the case, I'm happy to go ahead and close this issue. |
I tried this again, and it still works. When the stream isn't running, the |
I bootstrapped a brand new replica from backup, and it didn't change anything about the situation. It's my understanding that xtrabackup checksums everything, so I believe that eliminates table corruption. |
This started on Percona Server v8.0.39, and I attempted to upgrade to v8.0.40 to see if that might fix it, but that didn't change anything |
I updated the cache as discussed on the call today, and it didn't make a difference. UPDATE _vt.tables
SET CREATE_TIME = (
SELECT UNIX_TIMESTAMP(create_time)
FROM information_schema.tables t
WHERE t.table_schema = 'workspaces' AND TABLE_NAME = 'workspaces_rankings__pulls'
)
WHERE table_name = 'workspaces_rankings__pulls'; That made me wonder if the initial skew of 67 seconds was just the time between me running the two queries. If I run them in a single query, they do match. Maybe that is due to the update above, but my gut feeling says that the cache table is being updated correctly. When I run the same query every second for 5 seconds, both values increase by 1 second each time, so there doesn't appear to be any lag in the cache. SELECT 'INFORMATION_SCHEMA' AS source, UNIX_TIMESTAMP(create_time) AS CREATE_TIME FROM information_schema.tables t WHERE t.table_schema = 'workspaces' AND TABLE_NAME = 'workspaces_rankings__pulls'
UNION ALL
SELECT '_vt.tables', CREATE_TIME FROM _vt.`tables` where table_name = 'workspaces_rankings__pulls';
+--------------------+-------------+
| source | CREATE_TIME |
+--------------------+-------------+
| INFORMATION_SCHEMA | 1737051236 |
| _vt.tables | 1737051236 |
+--------------------+-------------+ |
Interesting... 🤔 I thought the issue was here: vitess/go/vt/vttablet/tabletserver/schema/engine.go Lines 486 to 492 in b8b0383
But perhaps I was starting from an invalid position/assumption. Either way, we need a repeatable test case to really dig further. Have you tried using the local examples and SIGKILLing vttablets involved in a workflow (./101_initial_cluster.sh && mysql < ../common/insert_commerce_data.sql && ./201_customer_tablets.sh && ./202_move_tables.sh)? |
Assigned this to myself since I've been working with Derek on it in Slack. So far I cannot repeat it and I don't see a clear Vitess bug. The issue went away when we disabled on_ddl:EXEC for the workflow. But we didn't see where these (phantom) DDL events were coming from. |
Overview of the Issue
I have a tablet that is infinitely looping saying that a single table has been updated. It is triggering this every ~100ms, using about 4 CPU when everything else is relatively idle. It has been happening for about two weeks now, and I can't get it to stop. I've tried v21.0.1, v21.0.0, and v20.0.4, and they all are identical. I can get the log to stop when I stop the materialize workflow that is referencing it, but it doesn't change the CPU utilization.
I'm reasonably confident that there were no schema changes made to the table.
Reproduction Steps
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: