Bug Report: infinite loop for "schema engine altered" #17458

derekperkins · 2025-01-02T18:12:26Z

Overview of the Issue

I have a tablet that is infinitely looping saying that a single table has been updated. It is triggering this every ~100ms, using about 4 CPU when everything else is relatively idle. It has been happening for about two weeks now, and I can't get it to stop. I've tried v21.0.1, v21.0.0, and v20.0.4, and they all are identical. I can get the log to stop when I stop the materialize workflow that is referencing it, but it doesn't change the CPU utilization.

I'm reasonably confident that there were no schema changes made to the table.

Reproduction Steps

CREATE TABLE `workspaces_rankings__pulls` (
  `pull_id` bigint NOT NULL,
  `workspace_id` bigint NOT NULL,
  `requested` bigint NOT NULL,
  `keyword_id` bigint NOT NULL,
  `is_priority` tinyint(1) NOT NULL,
  `deadline_at` bigint NOT NULL,
  `requested_at` bigint NOT NULL,
  `expected_by` bigint NOT NULL DEFAULT '0',
  `ranking_id` bigint NOT NULL,
  `tracking_id` bigint DEFAULT NULL,
  `billed_at` bigint DEFAULT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`pull_id`),
  UNIQUE KEY `requested__keyword_id__is_priority__workspace_id__uq` (`requested`,`keyword_id`,`is_priority`,`workspace_id`),
  UNIQUE KEY `ranking_id__workspace_id__uq` (`ranking_id`,`workspace_id`),
  UNIQUE KEY `workspaces_rankings__pulls__requested__workspace_id__kid__uq` (`requested`,`workspace_id`,`keyword_id`),
  KEY `workspaces_rankings__pulls__workspace_id__pull_id__idx` (`workspace_id`,`pull_id`),
  KEY `workspaces_rankings__pulls__workspace_id__pull_id__ranking_id` (`workspace_id`,`pull_id`,`ranking_id`),
  KEY `workspaces_rankings__pulls__keyword_id__requested__idx` (`keyword_id`,`requested`),
  KEY `workspaces_rankings__pulls__tracking_id__workspace_id__idx` (`tracking_id`,`workspace_id`),
  KEY `workspace_id__requested` (`workspace_id`,`requested` DESC),
  CONSTRAINT `fk__workspaces_rankings__pulls__workspaces` FOREIGN KEY (`workspace_id`) REFERENCES `workspaces` (`workspace_id`),
  CONSTRAINT `workspaces_rankings__pulls__ranking_id__positive__chk` CHECK ((`ranking_id` > 0))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED

{
  "cluster": {
    "id": "nozzle",
    "name": "Nozzle"
  },
  "keyspace": "workspaces",
  "workflow": {
    "name": "billing__usage_by_workspace",
    "source": {
      "keyspace": "workspaces",
      "shards": [
        "0"
      ]
    },
    "target": {
      "keyspace": "workspaces",
      "shards": [
        "0"
      ]
    },
    "shard_streams": {
      "0/uscentral1-0089375700": {
        "streams": [
          {
            "id": "78",
            "shard": "0",
            "tablet": {
              "cell": "uscentral1",
              "uid": 89375700
            },
            "binlog_source": {
              "keyspace": "workspaces",
              "shard": "0",
              "filter": {
                "rules": [
                  {
                    "match": "billing__usage_by_workspace",
                    "filter": "SELECT\n  workspace_id,\n  DATE(FROM_UNIXTIME(requested)) AS requested_date,\n  DATE(created_at) AS created_date,\n  1 AS type,\n  COUNT(*) AS usage\nFROM workspaces_rankings__pulls\nGROUP BY workspace_id, requested_date, created_date, type"
                  }
                ]
              },
              "on_ddl": "EXEC"
            },
            "position": "14264bab-1073-11ee-9d52-1a45877650b9:1-166458145,22acb7ef-5e2e-11ea-8055-f68e2a07ff5c:1-1188536448,23e612e9-5e2f-11ea-ae47-e2928e017498:1-90433677,34c0d728-e87b-11ed-b517-e297c59df937:1-632358534,358d48ae-5ea7-11ee-8f72-7ef6fb8306d3:1-5266853941,4d969c08-cdab-11ed-91a3-46ef6989d3da:1-1352687286,7106820a-e88c-11ea-ba90-6258029302fe:1-3978801218,9ab83c60-5eb3-11ee-9558-3a9e7effcd64:1-4200624,c63402fa-c652-11ec-a20b-06e64b3746d3:1-2426982202,c6be3a55-6713-11e9-92ff-0a580a300906:1-10775216,c7e128c3-6713-11e9-933f-0a580a301704:1-202,d8b2a3c2-819d-11ec-9a1e-c69cb6174bd0:1-758534679,f503ab19-ca9b-11e9-bc0a-626c2195e13f:1-170430446,fd13667c-7b78-11ed-a42a-e661db7a7325:1-750371887",
            "state": "Running",
            "db_name": "workspaces",
            "transaction_timestamp": {
              "seconds": "1735840096"
            },
            "time_updated": {
              "seconds": "1735840096"
            },
            "tags": [
              ""
            ],
            "rows_copied": "1578371030",
            "throttler_status": {
              "time_throttled": {}
            },
            "cells": [
              ""
            ]
          }
        ],
        "is_primary_serving": true
      }
    },
    "workflow_type": "Materialize",
    "workflow_sub_type": "None"
  }
}

Binary Version

v21.0.1, v21.0.0, and v20.0.4

Operating System and Environment details

GKE v1.31

Log Fragments

I1228 21:00:22.242713 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.277673 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.306701 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.352062 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.385178 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.406095 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.427299 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []
I1228 21:00:22.459470 2291461 engine.go:558] schema engine created [], altered [workspaces_rankings__pulls], dropped []

mattlord · 2025-01-02T18:27:34Z

Thank you, @derekperkins !

I'm assuming that the CREATE TABLE command can be executed with vtctldclient ApplySchema. Can you please share the actual vtctldclient Materialize create command that you used?

I'm assuming that simply creating the table does not cause the same CPU usage?

deepthi · 2025-01-03T06:57:37Z

Looking at code, this is where we are ending up with that table

vitess/go/vt/vttablet/tabletserver/schema/engine.go

Line 489 in b8b0383

    
           mismatchTables, err := se.getMismatchedTableNames(ctx, conn.Conn, shouldUseDatabase)

You will want to look at the underlying query and execute it, and see if anything looks off with that

vitess/go/vt/vttablet/tabletserver/schema/db.go

Line 393 in b8b0383

    
           readTableCreateTimesQuery := sqlparser.BuildParsedQuery(readTableCreateTimes, sidecar.GetIdentifier()).Query

derekperkins · 2025-01-10T10:48:29Z

Can you please share the actual vtctldclient Materialize create command that you used?

@mattlord the table has existed for years, I just included the CREATE TABLE in case it was useful. The json blob is the vreplication state for the only stream that references the workspaces_rankings__pulls table. The repeated schema engine altered log line stops emitting when I stop that stream, so it must be related.

vitess/go/vt/vttablet/tabletserver/schema/engine.go

Lines 486 to 487 in b8b0383

    
           // mismatchTables stores the tables whose createTime in our cache doesn't match the createTime stored in the database. 
        
           // This can happen if a primary crashed right after a DML succeeded, before it could reload its state. If all the replicas

@deepthi this comment right above that line I think does match what happened. We ran a big DML that then caused an immediate crash (hopefully fixed in the future by #17239). I tried adding a comment to the workspaces_rankings__pulls table, to see if that would trigger the cache to fix itself, but it didn't seem to affect anything.

mattlord · 2025-01-10T17:09:41Z

@derekperkins the workflow would be trying to read the table from the cache and the cache is a read-through cache for VReplication in v21 as we want to ensure that we get the current table schema (#15912). So I suspect there's a bug here in the schema engine and the workflow stream trying to regularly start is triggering it repeatedly.

These look to be the relevant queries here:

mysql> SELECT t.table_name, t.table_type, UNIX_TIMESTAMP(t.create_time), t.table_comment FROM information_schema.tables t WHERE t.table_schema = database();
+------------+------------+-------------------------------+---------------+
| TABLE_NAME | TABLE_TYPE | UNIX_TIMESTAMP(t.create_time) | TABLE_COMMENT |
+------------+------------+-------------------------------+---------------+
| corder     | BASE TABLE |                    1736527575 |               |
| customer   | BASE TABLE |                    1736527575 |               |
| product    | BASE TABLE |                    1736527575 |               |
+------------+------------+-------------------------------+---------------+
3 rows in set (0.00 sec)

mysql> SELECT TABLE_NAME, CREATE_TIME FROM _vt.`tables`;
+------------+-------------+
| TABLE_NAME | CREATE_TIME |
+------------+-------------+
| corder     |  1736527575 |
| customer   |  1736527575 |
| product    |  1736527575 |
+------------+-------------+
3 rows in set (0.00 sec)

Does the output look correct in your case? I wonder if the times are somehow not aligned? Notice that we use UNIX_TIMESTAMP in one query and not the other. UNIX_TIMESTAMP can transform the stored value based on the session time zone, so perhaps this is TZ related?

derekperkins · 2025-01-10T17:33:58Z

+----------------------------+------------+-------------------------------+---------------+
| TABLE_NAME                 | TABLE_TYPE | UNIX_TIMESTAMP(t.create_time) | TABLE_COMMENT |
+----------------------------+------------+-------------------------------+---------------+
| workspaces_rankings__pulls | BASE TABLE | 1736530133                    | test          |
+----------------------------+------------+-------------------------------+---------------+

+----------------------------+-------------+
| TABLE_NAME                 | CREATE_TIME |
+----------------------------+-------------+
| workspaces_rankings__pulls |  1736530200 |
+----------------------------+-------------+

Our servers are all on UTC, so it seems unlikely to be timezone related. Looks like they are 67 seconds different. That seems like a realistic time between OOM and coming back online, so maybe it stored the time it saw the change vs what information schema said?

derekperkins · 2025-01-14T18:47:37Z

This table hasn't been altered in the last 4 days, but the timestamp is changing... In fact now that I'm checking, both tables are changing every second

+----------------------------+------------+-------------------------------+---------------+
| TABLE_NAME                 | TABLE_TYPE | UNIX_TIMESTAMP(t.create_time) | TABLE_COMMENT |
+----------------------------+------------+-------------------------------+---------------+
| workspaces_rankings__pulls | BASE TABLE |                    1736880383 | test          |
+----------------------------+------------+-------------------------------+---------------+

+----------------------------+-------------+
| TABLE_NAME                 | CREATE_TIME |
+----------------------------+-------------+
| workspaces_rankings__pulls |  1736880391 |
+----------------------------+-------------+

derekperkins · 2025-01-14T19:00:31Z

The CREATE_TIME is after the UPDATE_TIME, which I found a MySQL bug report for.

SELECT * FROM information_schema.tables t WHERE t.table_schema = 'workspaces' AND TABLE_NAME = 'workspaces_rankings__pulls'\G
*************************** 1. row ***************************
  TABLE_CATALOG: def
   TABLE_SCHEMA: workspaces
     TABLE_NAME: workspaces_rankings__pulls
     TABLE_TYPE: BASE TABLE
         ENGINE: InnoDB
        VERSION: 10
     ROW_FORMAT: Compressed
     TABLE_ROWS: 1101919768
 AVG_ROW_LENGTH: 69
    DATA_LENGTH: 76867952640
MAX_DATA_LENGTH: 0
   INDEX_LENGTH: 206635966464
      DATA_FREE: 5767168
 AUTO_INCREMENT: NULL
    CREATE_TIME: 2025-01-14 18:52:41
    UPDATE_TIME: 2025-01-13 20:12:26
     CHECK_TIME: NULL
TABLE_COLLATION: utf8mb4_unicode_ci
       CHECKSUM: NULL
 CREATE_OPTIONS: row_format=COMPRESSED
  TABLE_COMMENT: test
1 row in set (0.00 sec)

It seems likely that I'm either hitting a MySQL native bug, or am somehow dealing with a corrupt table. I'm leaning towards this not being Vitess related, and if that seems to be the case, I'm happy to go ahead and close this issue.

derekperkins · 2025-01-14T19:14:39Z

I can get the log to stop when I stop the materialize workflow that is referencing it, but it doesn't change the CPU utilization

I tried this again, and it still works. When the stream isn't running, the CREATE_TIME of the table in the MySQL information schema table also stops updating. It seems that something about that materialize stream is causing the table to register being altered? This is very confusing to me

derekperkins · 2025-01-15T18:19:24Z

I bootstrapped a brand new replica from backup, and it didn't change anything about the situation. It's my understanding that xtrabackup checksums everything, so I believe that eliminates table corruption.

derekperkins · 2025-01-16T16:31:24Z

This started on Percona Server v8.0.39, and I attempted to upgrade to v8.0.40 to see if that might fix it, but that didn't change anything

derekperkins · 2025-01-16T18:15:16Z

I updated the cache as discussed on the call today, and it didn't make a difference.

UPDATE _vt.tables
SET CREATE_TIME = (
  SELECT UNIX_TIMESTAMP(create_time)
  FROM information_schema.tables t
  WHERE t.table_schema = 'workspaces' AND TABLE_NAME = 'workspaces_rankings__pulls'
)
WHERE table_name = 'workspaces_rankings__pulls';

That made me wonder if the initial skew of 67 seconds was just the time between me running the two queries. If I run them in a single query, they do match. Maybe that is due to the update above, but my gut feeling says that the cache table is being updated correctly. When I run the same query every second for 5 seconds, both values increase by 1 second each time, so there doesn't appear to be any lag in the cache.

SELECT 'INFORMATION_SCHEMA' AS source, UNIX_TIMESTAMP(create_time) AS CREATE_TIME FROM information_schema.tables t WHERE t.table_schema = 'workspaces' AND TABLE_NAME = 'workspaces_rankings__pulls'
UNION ALL
SELECT '_vt.tables', CREATE_TIME FROM _vt.`tables` where table_name = 'workspaces_rankings__pulls';

+--------------------+-------------+
| source             | CREATE_TIME |
+--------------------+-------------+
| INFORMATION_SCHEMA |  1737051236 |
| _vt.tables         |  1737051236 |
+--------------------+-------------+

mattlord · 2025-01-16T19:18:19Z

Interesting... 🤔 I thought the issue was here:

vitess/go/vt/vttablet/tabletserver/schema/engine.go

Lines 486 to 492 in b8b0383

    
           // mismatchTables stores the tables whose createTime in our cache doesn't match the createTime stored in the database. 
        
           // This can happen if a primary crashed right after a DML succeeded, before it could reload its state. If all the replicas 
        
           // are able to reload their cache before one of them is promoted, then the database information would be out of sync. 
        
           mismatchTables, err := se.getMismatchedTableNames(ctx, conn.Conn, shouldUseDatabase) 
        
           if err != nil { 
        
           	return err 
        
           }

But perhaps I was starting from an invalid position/assumption. Either way, we need a repeatable test case to really dig further. Have you tried using the local examples and SIGKILLing vttablets involved in a workflow (./101_initial_cluster.sh && mysql < ../common/insert_commerce_data.sql && ./201_customer_tablets.sh && ./202_move_tables.sh)?

mattlord · 2025-01-21T16:33:33Z

Assigned this to myself since I've been working with Derek on it in Slack. So far I cannot repeat it and I don't see a clear Vitess bug.

The issue went away when we disabled on_ddl:EXEC for the workflow. But we didn't see where these (phantom) DDL events were coming from.

derekperkins added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Jan 2, 2025

deepthi added Component: VReplication and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 3, 2025

mattlord self-assigned this Jan 21, 2025

mattlord added the Can't Repeat label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: infinite loop for "schema engine altered" #17458

Bug Report: infinite loop for "schema engine altered" #17458

derekperkins commented Jan 2, 2025

mattlord commented Jan 2, 2025

deepthi commented Jan 3, 2025 •

edited

Loading

derekperkins commented Jan 10, 2025

mattlord commented Jan 10, 2025

derekperkins commented Jan 10, 2025

derekperkins commented Jan 14, 2025 •

edited

Loading

derekperkins commented Jan 14, 2025

derekperkins commented Jan 14, 2025

derekperkins commented Jan 15, 2025

derekperkins commented Jan 16, 2025

derekperkins commented Jan 16, 2025

mattlord commented Jan 16, 2025

mattlord commented Jan 21, 2025

Bug Report: infinite loop for "schema engine altered" #17458

Bug Report: infinite loop for "schema engine altered" #17458

Comments

derekperkins commented Jan 2, 2025

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

mattlord commented Jan 2, 2025

deepthi commented Jan 3, 2025 • edited Loading

derekperkins commented Jan 10, 2025

mattlord commented Jan 10, 2025

derekperkins commented Jan 10, 2025

derekperkins commented Jan 14, 2025 • edited Loading

derekperkins commented Jan 14, 2025

derekperkins commented Jan 14, 2025

derekperkins commented Jan 15, 2025

derekperkins commented Jan 16, 2025

derekperkins commented Jan 16, 2025

mattlord commented Jan 16, 2025

mattlord commented Jan 21, 2025

deepthi commented Jan 3, 2025 •

edited

Loading

derekperkins commented Jan 14, 2025 •

edited

Loading