-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: inline small writes directly in consensus State #19383
Conversation
1fec540
to
9a8207f
Compare
f37fd27
to
a9ea316
Compare
0fc23e8
to
b897d1c
Compare
There still a few small bits of cleanup (noticed I missed some WIPs in src/persist), but gonna mark this ready for review to get the ball rolling |
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:The pull request carries a high risk, with a score of 83, indicating a significant likelihood of introducing bugs, as PRs with similar characteristics have historically been 124% more likely to cause bugs compared to the repository baseline. This risk is informed by predictors such as the average line count and executable lines within files, and it's worth noting that three files modified are known hotspots for bugs. While the repository's observed bug trend is increasing, the predicted trend is decreasing. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Overall feels quite solid, though it is Big, and I'll need another pass to get through the details. (I can se it doesn't seem that easy to split into multiple PRs, except for maybe the testing-macro bits...)
Commit number three is pretty hard to split up, yeah, but the first two commits could pretty easily be peeled off if that makes anything easier! The first one would be missing a good bit of context (e.g. your feedback about |
I would prefer this if it's not a pain!
I think I'm good with this one as-is, but thanks! |
👍 I started on this optimistically, so should be incoming shortly |
Okay, the prep PR is up in #26429! Note that I haven't yet addressed any of your initial comments here: as much as possible I kept the commits in that PR the same as they are here (modulo merge conflicts and a couple small tweaks to things like WIP comments). Will push updates as append-only commits to the other PR to make it easy to see the diffs from what you've already looked at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I've addressed the first round of comments.
I'll rebase this on top of the prep PR at your convenience. (Not blindly doing it now in case you had some pending comments, which github reviews would thoroughly barf on.)
Scalability FrameworkSome insights of the scalability run in builds/7220 that was run with
|
I noticed the same in a previous nightlies I kicked off a while ago, so it seems like a stable result. Adapter internally does batching of writes, so I wouldn't expect it to necessarily be better, but it sure is surprising that it's worse. No particular theories yet of what's going on there. |
@bkirwi Went ahead and pushed the rebase so we could pick up the fix for my backpressure bug. As much as possible I tried to keep everything that's changed since your first round of comments in the fixup commit (though had to resolve the merge conflicts with the other branch -- e.g. removing |
@nrainer-materialize It's possible that both of those feature benchmark scenarios are dominated by the cost of doing a read-write transaction, which is pretty inefficient in mz. Something that does a blind write would probably be more impacted by this. Looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about:
- AFAIK this is the first time we've impose backpressure on writes of any kind. (In practice I suspect writers should all be able to handle some latency during compare-and-append, but it'll be interesting to watch. In theory I will want to think more about if/how this particular way of inflating state is worse than any other ones callers might invent...
- For very small batches, flushing the batch out to Persist may actually increase the size of state. (Do we want some threshold under which an inline part can be appended no matter what?)
src/persist-client/src/batch.rs
Outdated
"persist_inline_update_max_bytes", | ||
0, | ||
"\ | ||
The (inclusive) maximum total size of inline writes in metadata before \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config nits:
- "Threshold" and "max" don't feel like they convey the distinction you're making here... maybe "single" and "total" or something?
- It's mildly surprising that one of these is inclusive and one is exclusive. (My usual bias is to exclusive upper bounds, but consistency either way seems good.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Threshold" and "max" don't feel like they convey the distinction you're making here... maybe "single" and "total" or something?
Yeah, put these in early as placeholders and figured we'd hash it out in review. I like single and total. I also realize they should probably be "inline_write(s)" instead of "inline_updates" for discoverability since I've been pretty consistently calling the feature the former.
Done!
It's mildly surprising that one of these is inclusive and one is exclusive. (My usual bias is to exclusive upper bounds, but consistency either way seems good.)
Fair! I like exclusive here for "threshold" because then we don't need a separate sentinel for "off". Turns out they were both exclusive anyway and the doc was just wrong
@@ -176,12 +176,18 @@ pub struct HandleDebugState { | |||
#[serde(tag = "type")] | |||
pub enum BatchPart<T> { | |||
Hollow(HollowBatchPart<T>), | |||
Inline { | |||
updates: LazyInlineBatchPart, | |||
key_lower: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would probably be fine to omit the key_lower
, in case that seems appealing! For tiny writes it might take up as much space as the write itself, and the main motivation for it (avoiding unnecessary fetches) doesn't really apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd love to remove this!
I remember adding it because something in iter.rs was failing an assert. That was early last week and a lot was in flux in the branch, so could be that I'm remembering incorrectly or that it's no longer an issue. I'll try it again and see what happens.
In the meantime, were you thinking that we hardcode vec![]
as the key_lower for inline parts or that we'd decode it on demand and return the real key lower? The first seems tricky because of the case where we have a run where the last part is inline (and it needs to have a greater key_lower than the previous ones, correct? Maybe this is the assert I was hitting). The second is a bit unfortunate because the bytes are in there but prost won't let us get them out without allocs for key_offsets, val_offsets, ts, diff in the decoded proto, and I wasn't sure how hot a path key_lower is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime, were you thinking [...]
The former!
The first seems tricky because of the case where we have a run where the last part is inline [...]
I had not been thinking about this case! It seems a bit pathological to have a run with N>0 100MiB parts and then 1 inline part. We could probably ban it, though I appreciate that might be more work than just threading the lower around.
and it needs to have a greater key_lower than the previous ones, correct?
That is certainly the case now; I don't know if anything relies on it, though it seems plausible that we're asserting on it somewhere. If so we could probably relax such an assert, assuming that seems like the easiest path...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcode vec![]
and disallow inline in a run if an earlier part is not inline sounds reasonable to me. I'll try typing it up and let you know how it goes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed it up (without "disallow inline in a run if an earlier part is not inline", since it seemed to be working in bin/envd without), let's see what CI says
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI and nightlies seem happy with it! \o/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works! (And should be easy enough to do the other half of it later if we decide it's important.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few questions, but on reflection I don't think any of them are blocking. Exciting stuff!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is the first time we've impose backpressure on writes of any kind. (In practice I suspect writers should all be able to handle some latency during compare-and-append, but it'll be interesting to watch. In theory I will want to think more about if/how this particular way of inflating state is worse than any other ones callers might invent...
In my mental model, because this short-circuits the cmd, the "backpressure" case is the same as our current behavior on main wrt talking to crdb/s3. There is an extra wasted encode to the proto columnar records, but feels like that would be a rounding error compared to the network stuff.
I do think it's "worse" than other ways of inflating state because this is someone using persist totally to spec, whereas e.g. registering a bunch of readers/writers is something we'd push back on as out of spec.
For very small batches, flushing the batch out to Persist may actually increase the size of state. (Do we want some threshold under which an inline part can be appended no matter what?)
Yeah, I think this is totally reasonable. Otoh, given the above, I'm inclined to what feels to me like the conservative option here and err on the side of keeping a hard limit on the total size of things inline in state for the initial rollout. We can always circle back here if we notice this be an issue, so left a TODO.
src/persist-client/src/batch.rs
Outdated
"persist_inline_update_max_bytes", | ||
0, | ||
"\ | ||
The (inclusive) maximum total size of inline writes in metadata before \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Threshold" and "max" don't feel like they convey the distinction you're making here... maybe "single" and "total" or something?
Yeah, put these in early as placeholders and figured we'd hash it out in review. I like single and total. I also realize they should probably be "inline_write(s)" instead of "inline_updates" for discoverability since I've been pretty consistently calling the feature the former.
Done!
It's mildly surprising that one of these is inclusive and one is exclusive. (My usual bias is to exclusive upper bounds, but consistency either way seems good.)
Fair! I like exclusive here for "threshold" because then we don't need a separate sentinel for "off". Turns out they were both exclusive anyway and the doc was just wrong
@@ -87,6 +87,8 @@ | |||
"enable_worker_core_affinity": "true", | |||
"persist_batch_delete_enabled": "true", | |||
"persist_fast_path_limit": "1000", | |||
"persist_inline_update_max_bytes": "1048576", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to 4096 as well in an attempt to maximally stress backpressure in CI. This is not the config I'd expect us to run in prod, which is usually what we put here, but seems worth the discrepancy to me.
@@ -176,12 +176,18 @@ pub struct HandleDebugState { | |||
#[serde(tag = "type")] | |||
pub enum BatchPart<T> { | |||
Hollow(HollowBatchPart<T>), | |||
Inline { | |||
updates: LazyInlineBatchPart, | |||
key_lower: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcode vec![]
and disallow inline in a run if an earlier part is not inline sounds reasonable to me. I'll try typing it up and let you know how it goes
src/persist-client/src/iter.rs
Outdated
// Inline parts are only written directly by the user and so may | ||
// be unconsolidated. | ||
// | ||
// WIP: This makes me realize that having the version of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh @bkirwi thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question! Quick thoughts:
It's not useful for the specific thing that having the writer version has been useful for,or the planned future use of garbage collection. (Edit: realized the first part of this is not quite true, though in practice if we did have issues with the sorting on inline parts I'm not sure we'd want to deal with it in the same way.)- The fact that these parts are meant to be short-lived makes it less important. (If we regret the encoding, it's pretty feasible to eg. rewrite all of them instead of keeping compatability.)
- Seems pretty easy to add later.
So I guess while I think it's a good idea and we may need it sooner or later, I'd be tempted to punt for now. But it's a good callout!
Yeah, on board with all of the above. I think what you have here is the right place to start, and we'll learn more from getting this rolled out! |
b537906
to
d1da4bd
Compare
Okay, modulo the unexpected regressions in the scalability benchmark at high concurrencies, I think this is otherwise good to go! I spent about half a day investigating that yesterday, and was able to verify that my mental model of the crdb and s3 calls happening was correct, but wasn't able to quite explain it. Given that this is a measurable improvement in Insert performance at low concurrencies (the overwhelming common case for prod) as well as a notable improvement in update latencies and DDL at all concurrencies, I'm not sure the high concurrency Insert regression is a blocker. It would be very good to at least understand what's going on, but I don't want this branch to rot, so if I can't get to the bottom of it by mid-day-ish tomorrow, I think it might be time to merge anyway. |
Aha! Aljoscha pointed me to the possibility that the writes in the scalability's InsertWorkload are all getting consolidated down and indeed that's the case.
I've also run it with and without inline writes against my staging env so I could look at the persist graphs. It seems more and more like the high-concurrency InsertWorkload thing is an artifact of measurement, rather than a real result. Probably what's happening is something like getting worse batching in coord because the appends are faster. |
// | ||
// The empty key might not be a tight lower bound, but it is a valid | ||
// lower bound. If a caller is interested in a tighter lower bound, | ||
// the data is inline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
@@ -176,12 +176,18 @@ pub struct HandleDebugState { | |||
#[serde(tag = "type")] | |||
pub enum BatchPart<T> { | |||
Hollow(HollowBatchPart<T>), | |||
Inline { | |||
updates: LazyInlineBatchPart, | |||
key_lower: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works! (And should be easy enough to do the other half of it later if we decide it's important.)
Feel free to add an ignore entry for the scalability framework in |
This allows us to efficiently use the `Bytes` that we get of out decoding a ProtoColumnarRecords (i.e. no copies).
Previously, a compare_and_append operation always required writing the data to Blob (s3) and then only putting a pointer to it in State (stored in Consensus/crdb). The latencies of s3 puts are both considerably higher and with a much fatter tail than crdb latencies. So, inline small batch parts directly where we would have put the pointer. - The threshold for small is made a LD param, so we can tune it. - Skip keeping pushdown stats for these parts. - Represent them using the new LazyProto abstraction developed for part stats. This keeps the inline parts as bytes::Bytes (which is in turn just a slice of the encoded ProtoStateRollup/ProtoStateDiff), only decoding them when they would have been "fetched". This considerably lessens the cost of deserializing these in places that aren't interested in the updates themselves (e.g. the controllers on envd).
Okay CI and nightlies are happy, we have a plausible theory for the scalability bench results at high concurrencies, scalability bench shows improvements at low-concurrency-Insert/all-concurrency-Update/all-concurrency-DDL, and feature bench shows improvement in the new ManySmall{Insert,Update} scenarios. I'm going to go ahead and get this merged so we can try turning it on in staging next week. TFTRs everyone!! |
🙌 |
Previously, a compare_and_append operation always required writing the
data to Blob (s3) and then only putting a pointer to it in State (stored
in Consensus/crdb). The latencies of s3 puts are both considerably
higher and with a much fatter tail than crdb latencies. So, inline small
batch parts directly where we would have put the pointer.
stats. This keeps the inline parts as bytes::Bytes (which is in turn
just a slice of the encoded ProtoStateRollup/ProtoStateDiff), only
decoding them when they would have been "fetched". This considerably
lessens the cost of deserializing these in places that aren't
interested in the updates themselves (e.g. the controllers on envd).
Touches MaterializeInc/database-issues#7413
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way) and therefore is tagged with aT-proto
label.