Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-35693: Improve SSS column sizes #3772

Open
wants to merge 2 commits into
base: 11.7
Choose a base branch
from
Open

MDEV-35693: Improve SSS column sizes #3772

wants to merge 2 commits into from

Conversation

ParadoxV5
Copy link
Contributor

@ParadoxV5 ParadoxV5 commented Jan 17, 2025

  • The Jira issue number for this PR is: MDEV-35693

Description

Resize SHOW REPLICA STATUS’s (technically INFORMATION_SCHEMA.SLAVE_STATUS’s) columns to better match their possible values

In case of intentionally but absurdly long lists, text columns that list an uncapped number of elements have expanded to accept as many bytes as we could support.

While here, I found all our integer columns are unsigned, so I updated ones that weren’t configured as unsigned before for both column types and the field storing function (grep \(\*field\+\+\)->store\(.+(?<!false|true|my_charset_bin)\); minus the one (double)).

In response to ‘MAX_SLAVE_ERRMSG’ was not declared in this scope in Embedded builds, a new #ifdef HAVE_REPLICATION guard wraps slave_status_info to skip this unused data in Replication-less builds.

What problem is the patch trying to solve?

Particularly, the first-gen Replicate_ filters were incorrectly typed as singlular Name()s during MDEV-33526.
Under Names’ 64-char limit, they could overflow (read: truncate) even before their lengths got absurd.

Do you think this patch might introduce side-effects in other parts of the server?

Let me know if resizing IS column types is a significant backward-incompatibility.

Release Notes

Increased VARCHAR width limits of SHOW SLAVE STATUS so it doesn’t truncate long lists that should be long.

Optionally list any https://mariadb.com/kb/ pages that need changing.

By the way, these three columns are missing on https://mariadb.com/kb/en/show-replica-status/#column-descriptions:

Replicate_Do_Domain_Ids
Replicate_Ignore_Domain_Ids
Parallel_Mode

How can this PR be tested?

🤔 How shall we test this? Cook up a replica so its status report has all columns filled to the brim?

Sample manual test from MDEV-35693

Of course, the original issue about Replicate_ fields specifically is reproducible by setting them to anything longer than 64 bytes.

set @@global.replicate_do_table='db_to_filter.0123456789_1,db_to_filter.0123456789_2,db_to_filter.0123456789_3,db_to_filter.0123456789_4,db_to_filter.0123456789_5,db_to_filter.0123456789_6,db_to_filter.0123456789_7';
 
show slave status\G

Unexpected: Replicate_Do_Table: db_to_filter.0123456789_7,db_to_filter.0123456789_1,db_to_filter

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

sql/sql_show.cc Outdated
Comment on lines 10666 to 10668
Column("Gtid_IO_Pos", Longtext(), NOT_NULL),
Column("Replicate_Do_Domain_Ids", Longtext(), NOT_NULL),
Column("Replicate_Ignore_Domain_Ids", Longtext(), NOT_NULL),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we continue truncating those not-Name() ones, just at a more lenient limit?
How about the others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all probably ok as-is. I wouldn't expect these do/ignore_domain_ids lists to ever be that long. Then for Gtid_IO_Pos, each GTID is %u-%u-%llu, so that'd be what, 10+1+10+1+20=42 max-len each? At which point we could show 23 max-len GTIDs (which would be rare), but say the more likely length of a GTID would be 20 chars long, leading to ~50 being shown. I haven't heard of such configurations, but I suppose if someone really knew what they were doing (and potentially using domain ids specifically to enhance replication performance via parallelization), they could hit that.. I guess I'm ok either way for Gtid_IO_Pos.

sql/sql_show.cc Outdated Show resolved Hide resolved
Column("Last_Errno", SLong(4), NOT_NULL),
Column("Last_Error", Varchar(20), NULLABLE),
Column("Last_Error", Varchar(MAX_SLAVE_ERRMSG), NULLABLE),
Column("Skip_Counter", ULong(10), NOT_NULL),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relay_log_info::slave_skip_counter is a ulonglong, but

(*field++)->store((uint32) mi->rli.slave_skip_counter);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sql_slave_skip_counter system variable is max UINT_MAX:

static Sys_var_multi_source_ulonglong Sys_slave_skip_counter(
       "sql_slave_skip_counter", "Skip the next N events from the master log",
       SESSION_VAR(slave_skip_counter), NO_CMD_LINE,
       &Master_info::get_slave_skip_counter,
       VALID_RANGE(0, UINT_MAX), DEFAULT(0), BLOCK_SIZE(1),
       ON_UPDATE(update_slave_skip_counter));

Perhaps the variable type itself should be fixed (Relay_log_info::slave_skip_counter and system_variables::slave_skip_counter). I don't think we'd need to do that in any GA release (better not to touch it if we don't need to), but I'd think we can do it here (11.7) or in the next release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, another one similar to MDEV-35706 Shrink my_thread_id to uint32_t?

sql/sql_show.cc Outdated Show resolved Hide resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U/Long(void) and U/Longlong(void) already exists, but they use MY_INT32_NUM_DECIMAL_DIGITS (11) and MY_INT32_NUM_DECIMAL_DIGITS (21) respectively for both signed and unsigned, which doesn’t match sql/sql_show.cc’s ULong(10) and ULonglong(20).

09876543210987654321 width
18446744073709551616 1<<64
          4294967296 1<<32

So, what excactly are those constant macros?

sql/sql_show.cc Show resolved Hide resolved
sql/sql_show.cc Outdated Show resolved Hide resolved
sql/sql_show.cc Show resolved Hide resolved
Column("Gtid_IO_Pos", Longtext(), NOT_NULL),
Column("Replicate_Do_Domain_Ids", Longtext(), NOT_NULL),
Column("Replicate_Ignore_Domain_Ids", Longtext(), NOT_NULL),
Column("Parallel_Mode", Varchar(sizeof("conservative")-1), NOT_NULL),
Column("SQL_Delay", ULong(10), NOT_NULL),
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm?

(*field++)->store((uint32) mi->rli.get_sql_delay());

Not only Relay_log_info::get_sql_delay() returns a signed int, but it’s also missing , true; this would invoke store(double nr) instead of store(longlong nr, bool unsigned_val).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the , true


According to rpl.rpl_delayed_slave, SQL_Delay’s unsigned in practice.
So changing int to unsigned int is not a bug but a clarifying refactor. I won’t include it here.
Same for checking whether each of the (uint32) casts make sense. We could instead use more Field::store() overloads – for the various integer sizes as well as separate handlers for unsigned ones.

@ParadoxV5 ParadoxV5 requested review from bnestere and removed request for andrelkin January 17, 2025 04:36
Copy link
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ParadoxV5 , you have some great finds here! See my notes.

Also, can you add a new test which fails before your patch to show the problems, which then succeeds with your patch?

Then for the git commit message, for the title, SSS is only common for us casually speaking on the replication team. Can you spell it out?

sql/sql_show.cc Outdated Show resolved Hide resolved
Column("Last_Errno", SLong(4), NOT_NULL),
Column("Last_Error", Varchar(20), NULLABLE),
Column("Last_Error", Varchar(MAX_SLAVE_ERRMSG), NULLABLE),
Column("Skip_Counter", ULong(10), NOT_NULL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sql_slave_skip_counter system variable is max UINT_MAX:

static Sys_var_multi_source_ulonglong Sys_slave_skip_counter(
       "sql_slave_skip_counter", "Skip the next N events from the master log",
       SESSION_VAR(slave_skip_counter), NO_CMD_LINE,
       &Master_info::get_slave_skip_counter,
       VALID_RANGE(0, UINT_MAX), DEFAULT(0), BLOCK_SIZE(1),
       ON_UPDATE(update_slave_skip_counter));

Perhaps the variable type itself should be fixed (Relay_log_info::slave_skip_counter and system_variables::slave_skip_counter). I don't think we'd need to do that in any GA release (better not to touch it if we don't need to), but I'd think we can do it here (11.7) or in the next release.

sql/sql_show.cc Show resolved Hide resolved
mysql-test/suite/funcs_1/r/is_tables_is.result Outdated Show resolved Hide resolved
sql/sql_show.cc Outdated Show resolved Hide resolved
sql/sql_show.cc Outdated
Comment on lines 10666 to 10668
Column("Gtid_IO_Pos", Longtext(), NOT_NULL),
Column("Replicate_Do_Domain_Ids", Longtext(), NOT_NULL),
Column("Replicate_Ignore_Domain_Ids", Longtext(), NOT_NULL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all probably ok as-is. I wouldn't expect these do/ignore_domain_ids lists to ever be that long. Then for Gtid_IO_Pos, each GTID is %u-%u-%llu, so that'd be what, 10+1+10+1+20=42 max-len each? At which point we could show 23 max-len GTIDs (which would be rare), but say the more likely length of a GTID would be 20 chars long, leading to ~50 being shown. I haven't heard of such configurations, but I suppose if someone really knew what they were doing (and potentially using domain ids specifically to enhance replication performance via parallelization), they could hit that.. I guess I'm ok either way for Gtid_IO_Pos.

sql/sql_show.cc Outdated Show resolved Hide resolved
sql/sql_show.cc Outdated
Column("Retried_transactions", ULong(10), NOT_NULL),
Column("Max_relay_log_size", ULonglong(10), NOT_NULL),
Column("Executed_log_entries", ULong(10), NOT_NULL),
Column("Slave_received_heartbeats", ULong(10), NOT_NULL),
Column("Slave_heartbeat_period", Float(703), NOT_NULL), // 3 decimals
Column("Gtid_Slave_Pos", Varchar(FN_REFLEN), NOT_NULL),
Column("Gtid_Slave_Pos", Longtext(), NOT_NULL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes here as for Gtid_IO_Pos

Copy link
Contributor Author

@ParadoxV5 ParadoxV5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add a new test which fails before your patch to show the problems, which then succeeds with your patch?

I wondered at “§ How can this PR be tested?”, which fields should we test and how long the strings should we assert?

Then for the git commit message, for the title, SSS is only common for us casually speaking on the replication team. Can you spell it out?

😄 When I picked up MDEV-35304, I too questioned what its description meant by ‘SSS’.
I thought it was a common jargon!

As for the title… With ‘SHOW REPLICA STATUS’ it doesn’t fit the 50-char guideline, but I suppose it’s fine omitting MDEV-35693: from the count.
(It’s time to avoid calling things ‘SLAVE’s when we can.)

Column("Last_Errno", SLong(4), NOT_NULL),
Column("Last_Error", Varchar(20), NULLABLE),
Column("Last_Error", Varchar(MAX_SLAVE_ERRMSG), NULLABLE),
Column("Skip_Counter", ULong(10), NOT_NULL),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, another one similar to MDEV-35706 Shrink my_thread_id to uint32_t?

sql/sql_show.cc Outdated Show resolved Hide resolved
sql/sql_show.cc Outdated Show resolved Hide resolved
Comment on lines +145 to +148
// utf8mb3 Varchars longer than MAX_FIELD_VARCHARLENGTH/3 become Longtexts
DBUG_ASSERT(length * 3 <= MAX_FIELD_VARCHARLENGTH);
}
Varchar(): Type(&type_handler_varchar, MAX_FIELD_VARCHARLENGTH/3, false) {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://mariadb.com/kb/en/unicode/ says 11.8 wants to change utf8 to utf8mb4.
Assuming the default charset is simply ‘utf8’, I’ll let that work change these 3s to 4s, in hopes that it notices it creates unexpected #3772 (comment) changes too.

@bnestere
Copy link
Contributor

bnestere commented Jan 19, 2025

I wondered at “§ How can this PR be tested?”, which fields should we test and how long the strings should we assert?

master_last_event_time_row.test should provide a good example to use on how to query/assert the correctness of SHOW REPLICA STATUS columns. Then as far as how long the strings should be, with 64 being the old length where truncation would happen, maybe just double it? I don't necessarily think it needs to be MAX_FIELD_VARCHARLENGTH characters long.. But perhaps make it easy to change? E.g. a lot of our tests use REPEAT('a', 128) (or whatever string and number parameters to repeat); I'd suggest using that. And note you can parameterize the length with something like

--let $len= 64
--let $long_db_a= `SELECT REPEAT('a', $len)`
--let $long_db_b= `SELECT REPEAT('b', $len)`

And then you can use it in queries using eval:

--eval SET @@global.replicate_do_db="$long_db_a,$long_db_b"

Then to

As for the title… With ‘SHOW REPLICA STATUS’ it doesn’t fit the 50-char guideline, but I suppose it’s fine omitting MDEV-35693: from the count.

Hmm, I didn't realize we had a 50-char title maximum - I probably break that rule often 😅. I'd say not to omit the MDEV-35693 part, that's the most important part. Maybe try to summarize as best you can within 50 chars, and if you can't, then just use the JIRA title and truncate it after 50.

Also, in the git commit message, the high-level description of the problem was dropped since your last commit:

In particular, many text columns – namely Replicate_ ones – could overflow (read: truncate) if the data is absurdly (but validly) long.

Such descriptions are good to say up-front to summarize the issue for non-programmers, who often read through our commit messages (I also think we should be explicit about the length where the truncation happens, so I'd say to add that they'd truncate at 64 characters to your above message.

@ParadoxV5
Copy link
Contributor Author

ParadoxV5 commented Jan 21, 2025

I wondered at “§ How can this PR be tested?”, which fields should we test and how long the strings should we assert?

master_last_event_time_row.test should provide a good example to use on how to query/assert the correctness of SHOW REPLICA STATUS columns. Then as far as how long the strings should be, with 64 being the old length where truncation would happen, maybe just double it? I don't necessarily think it needs to be MAX_FIELD_VARCHARLENGTH characters long.. […]

So, just testing those was-Named() OG-Replicate_s is fine, since they came directly from user configs and should never truncate?
Hmm, then maybe the newer Replicate_s too.

Hmm, I didn't realize we had a 50-char title maximum - I probably break that rule often 😅.

rules are made to be broken

Also, in the git commit message, the high-level description of the problem was dropped since your last commit:

In particular, many text columns – namely Replicate_ ones – could overflow (read: truncate) if the data is absurdly (but validly) long.

Such descriptions are good to say up-front to summarize the issue for non-programmers, who often read through our commit messages (I also think we should be explicit about the length where the truncation happens, so I'd say to add that they'd truncate at 64 characters to your above message.

The mention is still there but yes, the emphasis on how it’s a problem was lost. 👍

Comment on lines +427 to +436
def information_schema SLAVE_STATUS Replicate_Do_DB 15 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Do_Domain_Ids 47 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Do_Table 17 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Ignore_DB 16 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Ignore_Domain_Ids 48 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Ignore_Server_Ids 41 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Ignore_Table 18 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Rewrite_DB 56 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Wild_Do_Table 19 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
def information_schema SLAVE_STATUS Replicate_Wild_Ignore_Table 20 NULL NO varchar 21844 65532 NULL NULL NULL utf8mb3 utf8mb3_general_ci varchar(21844) select NEVER NULL NO NO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @bnestere do these count as tests for this issue?

Resize SHOW REPLICA STATUS’s (technically `INFORMATION_SCHEMA.`
`SLAVE_STATUS`’s) columns to better match their possible values

In case of intentionally but absurdly long lists,
text columns that list an uncapped number of elements
have expanded to accept as many bytes as we could support.

Particularly, the first-gen `Replicate_` filters were
incorrectly typed as singlular `Name()`s during MDEV-33526.
Under `Name`s’ 64-char limit, they could overflow
(read: truncate) even before their lengths got absurd.

In response to `‘MAX_SLAVE_ERRMSG’ was not declared in this scope` in
Embedded builds, a new `#ifdef HAVE_REPLICATION` guard wraps
`slave_status_info` to skip this unused data in Replication-less builds.

Reviewed-by: Brandon Nesterenko <[email protected]>
Update all integer columns of SHOW REPLICA STATUS (technically
INFORMATION_SCHEMA.SLAVE_STATUS) to unsigned because, well, they are (:.

Some `uint32` ones were accidentally using the `Field::store(double nr)`
overload because they forgot the `, true` for
`Field::store(longlong nr, bool unsigned_val)`.
The mistake’s harmless, fortunately, as `double` supports over 15
significant decimal digits, well over `uint32`’s 9-and-a-half.
@ParadoxV5
Copy link
Contributor Author

I wondered at “§ How can this PR be tested?”, which fields should we test and how long the strings should we assert?

master_last_event_time_row.test should provide a good example to use on how to query/assert the correctness of SHOW REPLICA STATUS columns. Then as far as how long the strings should be, with 64 being the old length where truncation would happen, maybe just double it? I don't necessarily think it needs to be MAX_FIELD_VARCHARLENGTH characters long.. […]

So, just testing those was-Named() OG-Replicate_s is fine, since they came directly from user configs and should never truncate?
Hmm, then maybe the newer Replicate_s too.

Uhh… I don’t feel right testing only a subset. Reflecting from the Broken Windows Theory, we either test none of them or we should test all of them.
Concerningly, looks like overflow test are done for the underlying variables but not for SSS reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants