Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: reduce memory usage when collecting stats on wide JSON #139381

Open
mgartner opened this issue Jan 17, 2025 · 1 comment
Open

sql: reduce memory usage when collecting stats on wide JSON #139381

mgartner opened this issue Jan 17, 2025 · 1 comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team

Comments

@mgartner
Copy link
Collaborator

mgartner commented Jan 17, 2025

Consider the schema:

CREATE TABLE t (
  k UUID PRIMARY KEY,
  j JSON,
  v STRING AS (j->>'name') VIRTUAL,
  INDEX (v)
);

When collecting stats for this table, we'll sample rows with all 3 columns to build histograms from. For both k and v we'll build histograms with up to 200 buckets (by default). For j we'll only build a histogram with 2 buckets. If j contains wide values, retaining 10,000+ samples of the value to build only two buckets can consume a lot of memory, e.g., if the average size of j is 40kB, then we'll collect 400MB of samples. That's a lot of memory to consume to build a 2-bucket histogram.

Below are some concrete things we can do to reduce this memory usage. Note that they could apply to column types other than JSON, too.

1. Do not sample non-indexed JSON columns.

UPDATE: This is done. See #139766.

If a JSON column has an INVERTED INDEX then a histogram is calculated for the inverted index keys and used during query optimization. If a JSON column has a forward INDEX, then the collected histogram may be useful during optimization. But if a JSON column is not indexed, it is unlikely that the 2-bucket histogram that is collected will be useful during optimization. By default we should not generate a histogram in this case—this was actually the default behavior up until #99275 landed in 23.2 and I assume the change in behavior was unintentional. These JSON columns would not need to be sampled if no histogram is being made and memory usage during stats collection would be reduced, especially when the JSON columns are wide.

2. Truncate sampled JSON values

For the case when JSON columns have a forward INDEX, we should truncate values to limit memory usage both during stats collection and during query optimization when histograms are utilized. We already truncate some types here. Truncating JSON values is non trivial. We should consider sampled value or key-encoded JSON values which would then be trivial to truncate. Note that #139380 may also be required for JSON objects that share a lot of common key/value pairs which are encoded into the prefix of a key or value.

3. Do not retain underlying jsonEncoded values when sampling virtual computed columns

Consider a slightly different schema than above where v has the type JSON:

CREATE TABLE t (
  k UUID PRIMARY KEY,
  j JSON,
  v JSON AS (j->'name') VIRTUAL,
  INDEX (v)
);

Assuming we complete (1), stats collection will still use a lot of memory collecting samples of v when j is wide. As the sampler computes j->'name' for each row, the resulting datum will have the underlying type of jsonEncoded if the j datum is also jsonEncoded. Critically, the result jsonEncoded's value []byte will be a slice of j's value []byte. Therefore, even though the result datum will be copied here, the copy will still reference j's larger value []byte, preventing the underlying memory from being reclaimed by the GC. Also, our memory accounting will under-estimate the amount of memory used by samples because it only accounts for the result jsonEncoded's bytes, not j's.

The simplest solution is to unconditionally copy bytes withing jsonEncoded to a new slice when sampling. This will incur some overhead. Another option is to only copy the bytes if the original jsonEncoded bytes are larger than 400-bytes (the limit at which we truncate string-like types). We'd have to keep track of the total capacity of the underlying bytes of a jsonEncoded through operations like -> to do this.

Jira issue: CRDB-46617

@mgartner mgartner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team labels Jan 17, 2025
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Jan 17, 2025
@mgartner mgartner added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs A-sql-table-stats Table statistics (and their automatic refresh). labels Jan 17, 2025
@mgartner mgartner added the P-1 Issues/test failures with a fix SLA of 1 month label Jan 17, 2025
@yuzefovich
Copy link
Member

#121749 is related.

mgartner added a commit to mgartner/cockroach that referenced this issue Jan 24, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 24, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 24, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 24, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
craig bot pushed a commit that referenced this issue Jan 27, 2025
139766: sql: do not collect histograms for non-indexed JSON columns r=mgartner a=mgartner

Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.


139797: dev: add extra startup flag to fix breakage on Darwin/ARM64 machines r=rail a=rickystewart

When cross-building, Java can crash on Darwin/ARM64 machines due to an error that looks like this:

```
A fatal error has been detected by the Java Runtime Environment:
SIGILL (0x4) at pc=0x0000ffffadf3fc1c, pid=39, tid=40
```

This was addressed in #139670, which adds a startup flag to disable the use of SVE. Unfortunately this broke the build on x86-64 machines (#139756). So now we must apply the flag conditionally only where relevant.

Closes: #139756

Epic: none
Release note: None

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Jan 27, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
blathers-crl bot pushed a commit that referenced this issue Jan 27, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 30, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the
cluster setting `sql.stats.non_indexed_json_histograms.enabled` to
`true`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 30, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 30, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit that referenced this issue Jan 30, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if sql.stats.non_indexed_json_histograms.enabled
is set to false. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
ANALYZE and CREATE STATISTICS.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 30, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 30, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 30, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit that referenced this issue Jan 30, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 31, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 31, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 31, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to false. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
ANALYZE and CREATE STATISTICS.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 31, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to false. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
ANALYZE and CREATE STATISTICS.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 31, 2025
Informs cockroachdb#139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to false. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
ANALYZE and CREATE STATISTICS.
blathers-crl bot pushed a commit that referenced this issue Jan 31, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
blathers-crl bot pushed a commit that referenced this issue Jan 31, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
blathers-crl bot pushed a commit that referenced this issue Jan 31, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to `false`. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
`ANALYZE` and `CREATE STATISTICS`.
blathers-crl bot pushed a commit that referenced this issue Jan 31, 2025
Informs #139381

Release note (sql change): Since v23.2 table statistics histograms have
been collected for non-indexed JSON columns. Histograms are no longer
collected for these columns if `sql.stats.non_indexed_json_histograms.enabled`
is set to false. This reduces memory usage during table
statistics collection, for both automatic and manual collection via
ANALYZE and CREATE STATISTICS.
@DrewKimball DrewKimball added P-3 Issues/test failures with no fix SLA and removed P-1 Issues/test failures with a fix SLA of 1 month labels Feb 4, 2025
@DrewKimball DrewKimball moved this from Triage to Backlog in SQL Queries Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

3 participants