[design doc] attribution and profiling #29762

mgree · 2024-09-26T21:39:12Z

How should we attribute profiling information?

Motivation

This PR adds a known-desirable feature.

https://github.com/MaterializeInc/database-issues/issues/6551

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

sthm · 2024-09-27T13:51:54Z

There are several possible profiling metrics to track---current memory usage, peak memory usage, current number of rows, row throughput, worker skew, latency (average, P50, P99, max), timestamp. Which should we do first? (I would propose some form of memory usage and some form of latency.)

From my experience, the most challenging things to debug today are memory usage and worker skew. Latency tends to be the result of a problem that can be identified looking at memory usage and worker skew. CPU usage is sometimes useful to optimize hydration time as well. So if we can only expose two metrics in the initial version, I would suggest memory usage and worker skew.

mgree · 2024-10-02T13:49:45Z

... So if we can only expose two metrics in the initial version, I would suggest memory usage and worker skew.

I'm not 100% clear on how these metrics are stored for dataflows, but I suspect that the hardest part is mapping MIR nodes to dataflow IDs---after which finding any given metric ought to be merely a question of lookup.

bosconi · 2024-10-07T19:36:59Z

@sthm to connect the dots between your and @mgree 's last comments: From talking with Michael, I believe the plan is to ship memory stats in the first version.

In fact, I see memory as the sample stat in the new section on implementation strategy.

design doc on attribution and profiling

24e48c3

mgree added 2 commits October 3, 2024 14:33

add implementation strategy section

c378051

note per convo w/jon

6d8c031

antiguru approved these changes Oct 17, 2024

View reviewed changes

mgree merged commit dde6b6f into MaterializeInc:main Oct 17, 2024
9 checks passed

mgree deleted the design-doc-attribution-profiling branch October 17, 2024 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[design doc] attribution and profiling #29762

[design doc] attribution and profiling #29762

mgree commented Sep 26, 2024 •

edited

Loading

sthm commented Sep 27, 2024 •

edited

Loading

mgree commented Oct 2, 2024

bosconi commented Oct 7, 2024 •

edited

Loading

[design doc] attribution and profiling #29762

[design doc] attribution and profiling #29762

Conversation

mgree commented Sep 26, 2024 • edited Loading

Motivation

Checklist

sthm commented Sep 27, 2024 • edited Loading

mgree commented Oct 2, 2024

bosconi commented Oct 7, 2024 • edited Loading

mgree commented Sep 26, 2024 •

edited

Loading

sthm commented Sep 27, 2024 •

edited

Loading

bosconi commented Oct 7, 2024 •

edited

Loading