-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: Parallelize 'run-tests' #1408
Open
ronh-rs
wants to merge
9
commits into
I0fced9dd8d763c87c33bca3931284ef13765ebca
Choose a base branch
from
I8211a10ba41179d2b0ed1d888588863011df5afc
base: I0fced9dd8d763c87c33bca3931284ef13765ebca
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
build: Parallelize 'run-tests' #1408
ronh-rs
wants to merge
9
commits into
I0fced9dd8d763c87c33bca3931284ef13765ebca
from
I8211a10ba41179d2b0ed1d888588863011df5afc
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix issue with joining on textual vs numerical or temporal columns. We always use the left side column to scan and the right side column to do the lookup, hence when the right side column is text, then the non-textual column will be converted into text in order to perform the lookup, which might not be properly matched. Concider textual values: '100', '0100', '000100', which all should match numerical value 100, but when we convert number 100 to text, we will get '100', so it will find only one match. But if we select the textual column to scan, and the numerical one to do the lookup, the above example will find all the matches. Fixes: READ-4974 Change-Id: I5f956c9692e9d859de38fb783da749a02c1d191c Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8454 Tested-by: Buildkite CI Reviewed-by: Jason Brown <[email protected]>
During restart of the replicator, we need to find the correct position to start streaming events. We have two concept of positions, the schema position, which is updated after snapshot, and the per table position, which is updated as each table gets events, or until we update the min lower bound using ReplicationAction::LogPosition. Our restart happens in three steps, first we attempt to snapshot new tables, then we catch up with events between min applied position and the current upstream position, then we start streaming events from the current upstream position. In order to find the min position, we were using ReplicationOffset::max_offset, which has a clash between its name saying it returns the max offset and documentation saying it returns the min offset. The function is actually returning the max offset. When starting readyset, we were getting the max_offset, then running snapshot, which was updating the schema offset to current position, and then running catch-up. In case of an error during catch-up, we will restart the replicator, which causes us to fetch the max_offset again, and this time it will be equal to upstream current position, causing the catch-up to be skipped on the second time. This patch changes the logic to properly use the min_offset when starting replicator. Fixes: REA-5098 Change-Id: I79b28005ce53964508319222a2f45807a9cd0ac0 Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8541 Tested-by: Buildkite CI Reviewed-by: Johnathan Davis <[email protected]>
readysetbot
force-pushed
the
I8211a10ba41179d2b0ed1d888588863011df5afc
branch
2 times, most recently
from
January 3, 2025 16:53
e22980a
to
c37868c
Compare
Note that we actually will have multiple threads with the same name (e.g. tokio threads), but we can't use the thread ID (turning it into a `u64` is behind an unstable feature flag), so they get collected together into one counter for all threads with the same name. This mainly means that multiple adapter threads will be aggregated, and at our current stage I don't think we would get anything meaningful out of breaking that up anyhow (i.e. we couldn't identify which tokio task was responsible anyway). So this mainly enables separating out domains to be able to identify if the problem is in dataflow, since domains do have a 1-1 thread-name correspondence. Also note that although we drop the counters for threads that are no longer reported, the metrics crate has no facility for "unregistering" metrics, so those counters will be reset to 0 and then continuously reported. This will likely lead to clutter as multiple compaction threads, specifically, will continue to be reported after they have ended. Change-Id: Idf7a08b18fb85681e368c07c3d94ef5d4f4ac0bf Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8418 Reviewed-by: Jason Brown <[email protected]> Tested-by: Buildkite CI
This also adds a default `MALLOC_CONF` which enables profiling on startup, but leaves it inactive, with a default sample rate and interval that in my limited testing results in reasonable fidelity with about 3 dumps per minute during a small workload with lots of upqueries and evictions. Change-Id: I2e6b41e137e152ba21f2fda5cdf9e1b973127f7a Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8502 Tested-by: Buildkite CI Reviewed-by: Jason Brown <[email protected]>
Currently can only be enabled, disabled, and manually dumped; other options like sample rate and dumping to a file at an interval can still only be configured via the `MALLOC_CONF` environment variable (or `_RJEM_MALLOC_CONF` on macOS). Can be used like so: ```sh $ curl -X POST localhost:6033/jemalloc/profiling/activate Memory profiling activated $ curl localhost:6033/jemalloc/profiling/dump > jeprof.heap $ jeprof --svg target/release/readyset jeprof.heap > jeprof.heap.svg $ curl -X POST localhost:6033/jemalloc/profiling/deactivate Memory profiling deactivated ``` Change-Id: Ifc8efa3ccd67d01123b42e0588a45f2c26c947cd Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8503 Reviewed-by: Jason Brown <[email protected]> Tested-by: Buildkite CI
Not seeing a worthwhile way to deduplicate this other than putting http handling in the readyset-alloc crate which doesn't seem ideal, so this is just copy and pasted. As explained in the comment, having this in both places (port 6033 and 6034, effectively) might give the false impression that you can toggle profiling on or off just for the adapter and not the server, and vice versa, but that is not the case in standalone deployments where they are in the same process and share a jemalloc instance. Despite the duplication and the risk of confusion, I think it's worth having it in both places so that one day when we have a distributed deployment, we aren't asking ourselves why we need a restart to turn on profiling in the adapter but not the server. Change-Id: I89ecf1b284893148f67933cd3f12acbccdbf983d Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8504 Tested-by: Buildkite CI Reviewed-by: Jason Brown <[email protected]>
readysetbot
force-pushed
the
I8211a10ba41179d2b0ed1d888588863011df5afc
branch
from
January 3, 2025 17:34
c37868c
to
02e4941
Compare
`cargo test` only runs test in parallel within one test module. [`cargo nextest`] can execute tests in parallel across crates, but in doing so executes each test in its own process (supposedly for reduced flakiness). This means our existing in-process mutexes provided by the `serial_test` crate don't work when running under nextest. We write our own `#[serial]` procedural macro which, in addition to adding the same existing in-process mutex from the `serial_test` crate, also renames tests such that they can be identified by the nextest filterset DSL. (Although it has a limited ability to parse `cfg` attrs, there is no facility for inspecting arbitrary attrs attached to the test or anything like that; otherwise we could just use the existing `#[serial]` and somehow specify it in the filterset.) This allows creating [test groups] which don't allow parallel execution within them. We then get the benefits of parallel test execution across crates; so for the bulk of the time, 1 test in the mysql group is running and 1 test in the postgres group is running, getting us to (in a simple test on an M3 Max) 181% CPU usage and a commensurate 81% speedup in clock time. ```sh $ time TZ=UTC RUN_SLOW_TESTS=1 cargo nextest run \ -E 'not package(readyset-clustertest)' ... Summary [ 393.304s] 2667 tests run: 2667 passed, 84 skipped 620.70s user 207.41s system 181% cpu 7:35.10 total $ time TZ=UTC RUN_SLOW_TESTS=1 cargo test --all \ --exclude readyset-clustertest ... 710.37s user 154.68s system 104% cpu 13:44.92 total ``` [`cargo nextest`]: https://nexte.st/ [test groups]: https://nexte.st/docs/configuration/test-groups/ Change-Id: I75a6b762bb18b4ebe7927cfe60eb7cefea038ee8 Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8507 Tested-by: Buildkite CI Reviewed-by: Johnathan Davis <[email protected]>
This won't make a runtime huge difference in CI, but exercises the new nextest configuration. Change-Id: I0fced9dd8d763c87c33bca3931284ef13765ebca Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8544 Tested-by: Buildkite CI Reviewed-by: Ron Hough <[email protected]>
Because nextest does not support running doctests, we are parallelizing the run-tests.sh run in buildkite to run nextest in one job and doctests in another. This means spending a little more on CPU, but shortening the total wall clock duration of a build (plus gaining the features provided by nextest). Change-Id: I8211a10ba41179d2b0ed1d888588863011df5afc Reviewed-on: https://gerrit.readyset.name/c/readyset/+/8547 Tested-by: Buildkite CI Reviewed-by: Michael Zink <[email protected]>
readysetbot
force-pushed
the
I8211a10ba41179d2b0ed1d888588863011df5afc
branch
from
January 3, 2025 22:27
02e4941
to
57c4997
Compare
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Because nextest does not support running doctests, we are parallelizing
the run-tests.sh run in buildkite to run nextest in one job and doctests
in another. This means spending a little more on CPU, but shortening
the total wall clock duration of a build (plus gaining the features
provided by nextest).