-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: reduce memory usage when collecting stats on wide JSON #139381
Labels
A-sql-table-stats
Table statistics (and their automatic refresh).
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-3
Issues/test failures with no fix SLA
T-sql-queries
SQL Queries Team
Comments
mgartner
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-sql-queries
SQL Queries Team
labels
Jan 17, 2025
mgartner
added
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
A-sql-table-stats
Table statistics (and their automatic refresh).
labels
Jan 17, 2025
#121749 is related. |
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 24, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 24, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 24, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 24, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
craig bot
pushed a commit
that referenced
this issue
Jan 27, 2025
139766: sql: do not collect histograms for non-indexed JSON columns r=mgartner a=mgartner Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`. 139797: dev: add extra startup flag to fix breakage on Darwin/ARM64 machines r=rail a=rickystewart When cross-building, Java can crash on Darwin/ARM64 machines due to an error that looks like this: ``` A fatal error has been detected by the Java Runtime Environment: SIGILL (0x4) at pc=0x0000ffffadf3fc1c, pid=39, tid=40 ``` This was addressed in #139670, which adds a startup flag to disable the use of SVE. Unfortunately this broke the build on x86-64 machines (#139756). So now we must apply the flag conditionally only where relevant. Closes: #139756 Epic: none Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Jan 27, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
blathers-crl bot
pushed a commit
that referenced
this issue
Jan 27, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 30, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`. This can be reverted by setting the cluster setting `sql.stats.non_indexed_json_histograms.enabled` to `true`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 30, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 30, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
that referenced
this issue
Jan 30, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if sql.stats.non_indexed_json_histograms.enabled is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 30, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 30, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 30, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
that referenced
this issue
Jan 30, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 31, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 31, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 31, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 31, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.
mgartner
added a commit
to mgartner/cockroach
that referenced
this issue
Jan 31, 2025
Informs cockroachdb#139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.
blathers-crl bot
pushed a commit
that referenced
this issue
Jan 31, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
blathers-crl bot
pushed a commit
that referenced
this issue
Jan 31, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
blathers-crl bot
pushed a commit
that referenced
this issue
Jan 31, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to `false`. This reduces memory usage during table statistics collection, for both automatic and manual collection via `ANALYZE` and `CREATE STATISTICS`.
blathers-crl bot
pushed a commit
that referenced
this issue
Jan 31, 2025
Informs #139381 Release note (sql change): Since v23.2 table statistics histograms have been collected for non-indexed JSON columns. Histograms are no longer collected for these columns if `sql.stats.non_indexed_json_histograms.enabled` is set to false. This reduces memory usage during table statistics collection, for both automatic and manual collection via ANALYZE and CREATE STATISTICS.
DrewKimball
added
P-3
Issues/test failures with no fix SLA
and removed
P-1
Issues/test failures with a fix SLA of 1 month
labels
Feb 4, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-table-stats
Table statistics (and their automatic refresh).
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-3
Issues/test failures with no fix SLA
T-sql-queries
SQL Queries Team
Consider the schema:
When collecting stats for this table, we'll sample rows with all 3 columns to build histograms from. For both
k
andv
we'll build histograms with up to 200 buckets (by default). Forj
we'll only build a histogram with 2 buckets. Ifj
contains wide values, retaining 10,000+ samples of the value to build only two buckets can consume a lot of memory, e.g., if the average size ofj
is 40kB, then we'll collect 400MB of samples. That's a lot of memory to consume to build a 2-bucket histogram.Below are some concrete things we can do to reduce this memory usage. Note that they could apply to column types other than JSON, too.
1.
Do not sample non-indexed JSON columns.UPDATE: This is done. See #139766.
If a JSON column has anINVERTED INDEX
then a histogram is calculated for the inverted index keys and used during query optimization. If a JSON column has a forwardINDEX
, then the collected histogram may be useful during optimization. But if a JSON column is not indexed, it is unlikely that the 2-bucket histogram that is collected will be useful during optimization. By default we should not generate a histogram in this case—this was actually the default behavior up until #99275 landed in 23.2 and I assume the change in behavior was unintentional. These JSON columns would not need to be sampled if no histogram is being made and memory usage during stats collection would be reduced, especially when the JSON columns are wide.2. Truncate sampled JSON values
For the case when JSON columns have a forward
INDEX
, we should truncate values to limit memory usage both during stats collection and during query optimization when histograms are utilized. We already truncate some types here. Truncating JSON values is non trivial. We should consider sampled value or key-encoded JSON values which would then be trivial to truncate. Note that #139380 may also be required for JSON objects that share a lot of common key/value pairs which are encoded into the prefix of a key or value.3. Do not retain underlying
jsonEncoded
values when sampling virtual computed columnsConsider a slightly different schema than above where
v
has the typeJSON
:Assuming we complete (1), stats collection will still use a lot of memory collecting samples of
v
whenj
is wide. As the sampler computesj->'name'
for each row, the resulting datum will have the underlying type ofjsonEncoded
if thej
datum is alsojsonEncoded
. Critically, the resultjsonEncoded
'svalue []byte
will be a slice ofj
'svalue []byte
. Therefore, even though the result datum will be copied here, the copy will still referencej
's largervalue []byte
, preventing the underlying memory from being reclaimed by the GC. Also, our memory accounting will under-estimate the amount of memory used by samples because it only accounts for the resultjsonEncoded
's bytes, notj
's.The simplest solution is to unconditionally copy bytes withing
jsonEncoded
to a new slice when sampling. This will incur some overhead. Another option is to only copy the bytes if the originaljsonEncoded
bytes are larger than 400-bytes (the limit at which we truncate string-like types). We'd have to keep track of the total capacity of the underlying bytes of ajsonEncoded
through operations like->
to do this.Jira issue: CRDB-46617
The text was updated successfully, but these errors were encountered: