sql: reduce memory usage when collecting stats on wide JSON #139381

mgartner · 2025-01-17T22:11:57Z

Consider the schema:

CREATE TABLE t (
  k UUID PRIMARY KEY,
  j JSON,
  v STRING AS (j->>'name') VIRTUAL,
  INDEX (v)
);

When collecting stats for this table, we'll sample rows with all 3 columns to build histograms from. For both k and v we'll build histograms with up to 200 buckets (by default). For j we'll only build a histogram with 2 buckets. If j contains wide values, retaining 10,000+ samples of the value to build only two buckets can consume a lot of memory, e.g., if the average size of j is 40kB, then we'll collect 400MB of samples. That's a lot of memory to consume to build a 2-bucket histogram.

Below are some concrete things we can do to reduce this memory usage. Note that they could apply to column types other than JSON, too.

1. Do not sample non-indexed JSON columns.

UPDATE: This is done. See #139766.

If a JSON column has an INVERTED INDEX then a histogram is calculated for the inverted index keys and used during query optimization. If a JSON column has a forward INDEX, then the collected histogram may be useful during optimization. But if a JSON column is not indexed, it is unlikely that the 2-bucket histogram that is collected will be useful during optimization. By default we should not generate a histogram in this case—this was actually the default behavior up until #99275 landed in 23.2 and I assume the change in behavior was unintentional. These JSON columns would not need to be sampled if no histogram is being made and memory usage during stats collection would be reduced, especially when the JSON columns are wide.

2. Truncate sampled JSON values

For the case when JSON columns have a forward INDEX, we should truncate values to limit memory usage both during stats collection and during query optimization when histograms are utilized. We already truncate some types here. Truncating JSON values is non trivial. We should consider sampled value or key-encoded JSON values which would then be trivial to truncate. Note that #139380 may also be required for JSON objects that share a lot of common key/value pairs which are encoded into the prefix of a key or value.

3. Do not retain underlying `jsonEncoded` values when sampling virtual computed columns

Consider a slightly different schema than above where v has the type JSON:

CREATE TABLE t (
  k UUID PRIMARY KEY,
  j JSON,
  v JSON AS (j->'name') VIRTUAL,
  INDEX (v)
);

Assuming we complete (1), stats collection will still use a lot of memory collecting samples of v when j is wide. As the sampler computes j->'name' for each row, the resulting datum will have the underlying type of jsonEncoded if the j datum is also jsonEncoded. Critically, the result jsonEncoded's value []byte will be a slice of j's value []byte. Therefore, even though the result datum will be copied here, the copy will still reference j's larger value []byte, preventing the underlying memory from being reclaimed by the GC. Also, our memory accounting will under-estimate the amount of memory used by samples because it only accounts for the result jsonEncoded's bytes, not j's.

The simplest solution is to unconditionally copy bytes withing jsonEncoded to a new slice when sampling. This will incur some overhead. Another option is to only copy the bytes if the original jsonEncoded bytes are larger than 400-bytes (the limit at which we truncate string-like types). We'd have to keep track of the total capacity of the underlying bytes of a jsonEncoded through operations like -> to do this.

Jira issue: CRDB-46617

The text was updated successfully, but these errors were encountered:

yuzefovich · 2025-01-18T21:58:25Z

#121749 is related.

Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.

139766: sql: do not collect histograms for non-indexed JSON columns r=mgartner a=mgartner Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`. 139797: dev: add extra startup flag to fix breakage on Darwin/ARM64 machines r=rail a=rickystewart When cross-building, Java can crash on Darwin/ARM64 machines due to an error that looks like this: ``` A fatal error has been detected by the Java Runtime Environment: SIGILL (0x4) at pc=0x0000ffffadf3fc1c, pid=39, tid=40 ``` This was addressed in #139670, which adds a startup flag to disable the use of SVE. Unfortunately this broke the build on x86-64 machines (#139756). So now we must apply the flag conditionally only where relevant. Closes: #139756 Epic: none Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.

Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.

Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.

Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if sql.stats.non_indexed_json_histograms.enabled is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.

Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.

Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.

Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.

Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.

Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.

Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.

mgartner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team labels Jan 17, 2025

mgartner added this to SQL Queries Jan 17, 2025

github-project-automation bot moved this to Triage in SQL Queries Jan 17, 2025

mgartner added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs A-sql-table-stats Table statistics (and their automatic refresh). labels Jan 17, 2025

mgartner mentioned this issue Jan 17, 2025

sql: better control over selecting columns to collect stats for #139382

Open

mgartner added the P-1 Issues/test failures with a fix SLA of 1 month label Jan 17, 2025

mgartner mentioned this issue Jan 24, 2025

sql: do not collect histograms for non-indexed JSON columns #139766

Merged

blathers-crl bot mentioned this issue Jan 27, 2025

release-24.3: sql: do not collect histograms for non-indexed JSON columns #139897

Merged

blathers-crl bot mentioned this issue Jan 27, 2025

release-25.1: sql: do not collect histograms for non-indexed JSON columns #139898

Merged

mgartner mentioned this issue Jan 30, 2025

release-24.2: sql: do not collect histograms for non-indexed JSON columns #140141

Merged

mgartner mentioned this issue Jan 30, 2025

release-24.1: sql: do not collect histograms for non-indexed JSON columns #140142

Merged

mgartner mentioned this issue Jan 30, 2025

release-23.2: sql: do not collect histograms for non-indexed JSON columns #140144

Merged

blathers-crl bot mentioned this issue Jan 31, 2025

release-24.3.5-rc: sql: do not collect histograms for non-indexed JSON columns #140265

Merged

blathers-crl bot mentioned this issue Jan 31, 2025

release-24.2.9-rc: sql: do not collect histograms for non-indexed JSON columns #140267

Merged

blathers-crl bot mentioned this issue Jan 31, 2025

release-24.1.12-rc: sql: do not collect histograms for non-indexed JSON columns #140268

Merged

blathers-crl bot mentioned this issue Jan 31, 2025

release-23.2.20-rc: sql: do not collect histograms for non-indexed JSON columns #140269

Merged

DrewKimball added P-3 Issues/test failures with no fix SLA and removed P-1 Issues/test failures with a fix SLA of 1 month labels Feb 4, 2025

DrewKimball moved this from Triage to Backlog in SQL Queries Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: reduce memory usage when collecting stats on wide JSON #139381

sql: reduce memory usage when collecting stats on wide JSON #139381

mgartner commented Jan 17, 2025 •

edited

Loading

yuzefovich commented Jan 18, 2025

sql: reduce memory usage when collecting stats on wide JSON #139381

sql: reduce memory usage when collecting stats on wide JSON #139381

Comments

mgartner commented Jan 17, 2025 • edited Loading

1. Do not sample non-indexed JSON columns.

2. Truncate sampled JSON values

3. Do not retain underlying jsonEncoded values when sampling virtual computed columns

yuzefovich commented Jan 18, 2025

mgartner commented Jan 17, 2025 •

edited

Loading

3. Do not retain underlying `jsonEncoded` values when sampling virtual computed columns