Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[design doc] attribution and profiling #29762

Merged
merged 3 commits into from
Oct 17, 2024

Conversation

mgree
Copy link
Contributor

@mgree mgree commented Sep 26, 2024

How should we attribute profiling information?

Rendered markdown.

Motivation

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@sthm
Copy link
Contributor

sthm commented Sep 27, 2024

There are several possible profiling metrics to track---current memory usage, peak memory usage, current number of rows, row throughput, worker skew, latency (average, P50, P99, max), timestamp. Which should we do first? (I would propose some form of memory usage and some form of latency.)

From my experience, the most challenging things to debug today are memory usage and worker skew. Latency tends to be the result of a problem that can be identified looking at memory usage and worker skew. CPU usage is sometimes useful to optimize hydration time as well. So if we can only expose two metrics in the initial version, I would suggest memory usage and worker skew.

@mgree
Copy link
Contributor Author

mgree commented Oct 2, 2024

... So if we can only expose two metrics in the initial version, I would suggest memory usage and worker skew.

I'm not 100% clear on how these metrics are stored for dataflows, but I suspect that the hardest part is mapping MIR nodes to dataflow IDs---after which finding any given metric ought to be merely a question of lookup.

@bosconi
Copy link
Member

bosconi commented Oct 7, 2024

@sthm to connect the dots between your and @mgree 's last comments: From talking with Michael, I believe the plan is to ship memory stats in the first version.

In fact, I see memory as the sample stat in the new section on implementation strategy.

@mgree mgree merged commit dde6b6f into MaterializeInc:main Oct 17, 2024
9 checks passed
@mgree mgree deleted the design-doc-attribution-profiling branch October 17, 2024 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants