Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: Add more testing around stats #27117

Merged
merged 4 commits into from
May 16, 2024

Conversation

ParkMyCar
Copy link
Member

This PR adds some more testing around our Part statistics. Specifically it adds two new tests:

  1. A proptest for correctness. We generate arbitrary ColumnTypes, and use that ColumnType to generate an arbitrary Vec<Row>, then we calculate stats on that collection of Rows and assert that every Row would be included in the stats.
  2. A test for stats stability. We use proptest with a constant seed to generate 1,000 instances of RelationDescs with at most 4 columns, then a collection of at most 8 Rows for these RelationDescs. We generate statistics for all 1,000 scenarios and then take a JSON snapshot of the stats. This test helps us track if any changes occur to our statistics generation.

I'm curious what folks thoughts are on the second test, I'm more than happy to not merge it and use it only to validate #27009, if we don't think it provides a ton of signal.

Motivation

Protect against stats breaking, e.g. in changes like #27009

Tips for reviewer

The PR is broken up into 2 commits:

  1. Proptest strategies to generate Datums from a ColumnType, and adding the first test.
  2. The snapshot test.

Checklist

@ParkMyCar ParkMyCar requested review from a team May 15, 2024 21:43
@ParkMyCar ParkMyCar force-pushed the persist/stats-testing branch from 5dfb8b4 to 8eb7cef Compare May 15, 2024 21:49
@ParkMyCar ParkMyCar requested review from bkirwi and danhhz May 15, 2024 21:53
@ggevay
Copy link
Contributor

ggevay commented May 16, 2024

arb_datum_for_column could also be used for scalar function property testing! For each scalar function, we could generate some random inputs of the appropriate type using arb_datum_for_column, call the function, and check whether propagates_nulls, is_monotone, and all the other things on LazyUnaryFunc are actually true about the results we got. I'll open a follow-up PR for this. I'd be willing to bet a non-trivial amount of money that this will reveal some bugs. E.g., just a few minutes ago we found some wrong inverse annotations on some exotic cast functions. (We already have test_smoketest_all_builtins, but that currently uses some hardcoded interesting_datums as inputs. Making it use random inputs should be more effective.)

@danhhz
Copy link
Contributor

danhhz commented May 16, 2024

Very cool! @bkirwi mind taking the review on this one?

@ParkMyCar
Copy link
Member Author

ParkMyCar commented May 16, 2024

@ggevay nice! Glad y'all can also find a use for the proptest impls here!

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/repr/src/scalar.rs Outdated Show resolved Hide resolved
@ParkMyCar ParkMyCar enabled auto-merge (squash) May 16, 2024 18:12
@ParkMyCar ParkMyCar merged commit eba1071 into MaterializeInc:main May 16, 2024
74 checks passed
@ParkMyCar ParkMyCar deleted the persist/stats-testing branch January 13, 2025 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants