Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chain node timing metrics #11011

Merged
merged 11 commits into from
Feb 21, 2025
Merged

Add chain node timing metrics #11011

merged 11 commits into from
Feb 21, 2025

Conversation

mhofman
Copy link
Member

@mhofman mhofman commented Feb 18, 2025

closes: #10960

Description

Formalizes some chain timing measurements and report them as explicit open telemetry metrics.

Security Considerations

None

Scaling Considerations

Histograms create a few raw metrics (depending on bucketing), and this reports 8 new histograms metrics, resulting in about 100 new raw values.

Documentation Considerations

None external.

Testing Considerations

TBD

Upgrade Considerations

None of this affects consensus, but being part of the chain software, this requires deploying new chain software.

@mhofman mhofman requested a review from a team as a code owner February 18, 2025 08:21
@mhofman mhofman requested a review from AgoricTriage February 18, 2025 08:21
@mhofman mhofman self-assigned this Feb 18, 2025
@mhofman mhofman requested a review from gibson042 February 18, 2025 08:22
@mhofman mhofman removed their assignment Feb 18, 2025
@gibson042 gibson042 force-pushed the chain-node-timing-metrics branch from dd71f6a to d28eb12 Compare February 20, 2025 15:18
Copy link

cloudflare-workers-and-pages bot commented Feb 20, 2025

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: a149714
Status: ✅  Deploy successful!
Preview URL: https://74a4058c.agoric-sdk.pages.dev
Branch Preview URL: https://chain-node-timing-metrics.agoric-sdk.pages.dev

View logs

@gibson042 gibson042 force-pushed the chain-node-timing-metrics branch from d28eb12 to 0f0d765 Compare February 20, 2025 16:54
@gibson042
Copy link
Member

@mhofman I don't think you can approve a PR that started off as yours, so let's handle it less formally and I'll apply my approval as a proxy for yours (once granted).

Copy link
Member Author

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

I agree that switching to performance.now() might be better and I think you make the necessary refactoring for that. Should we go all the way in this PR?

I'm not sure after all we should calculate block lag after the "after commit hangover". It might be better to just measure as close to the cosmos triggers as possible.

@@ -496,6 +568,12 @@ export async function launch({
initialQueueLengths,
});

const blockMetrics = Object.fromEntries(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have proper typing for this.fromEntries tends to erase key types at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 1278 to 1279
const blockLag = times.previousAfterCommitBlockPosix - blockTime * 1000;
times.previousAfterCommitBlockPosix = toPosix(hangoverTimestamp);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here is somewhat confusing. I'm actually wondering if it wouldn't be better to calculate the lag as purely the cosmos lag, aka ignoring the afterCommitHangover time, and use the beginBlockTimestamp instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

cloudflare-workers-and-pages bot commented Feb 20, 2025

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 95fbd76
Status:⚡️  Build in progress...

View logs

@gibson042
Copy link
Member

I agree that switching to performance.now() might be better and I think you make the necessary refactoring for that. Should we go all the way in this PR?

Not just yet; Date.now() is good enough that I don't consider that to be urgent.

Copy link
Member Author

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@gibson042 gibson042 force-pushed the chain-node-timing-metrics branch from 7dc75d4 to 95fbd76 Compare February 20, 2025 23:40
@gibson042 gibson042 added the automerge:rebase Automatically rebase updates, then merge label Feb 20, 2025
Copy link

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 95fbd76
Status: ✅  Deploy successful!
Preview URL: https://d28cbc07.agoric-sdk.pages.dev
Branch Preview URL: https://chain-node-timing-metrics.agoric-sdk.pages.dev

View logs

gibson042 and others added 11 commits February 20, 2025 19:52
await background tasks closer to attribution so we can measure them
…ails

The buckets have so much overlap that it makes sense to use a single
shared definition for all of them except blockLag (which now extends
that definition to include more boundaries beyond 30 seconds).
The buckets have so much overlap that it makes sense to use a single
shared definition for all of them except blockLag (which now extends
that definition to include more boundaries beyond 30 seconds).
@gibson042 gibson042 force-pushed the chain-node-timing-metrics branch from 95fbd76 to a913df3 Compare February 21, 2025 00:52
Copy link

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: a913df3
Status:⚡️  Build in progress...

View logs

Copy link

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: a913df3
Status: ✅  Deploy successful!
Preview URL: https://340589de.agoric-sdk.pages.dev
Branch Preview URL: https://chain-node-timing-metrics.agoric-sdk.pages.dev

View logs

@mergify mergify bot merged commit 2a71f04 into master Feb 21, 2025
84 checks passed
@mergify mergify bot deleted the chain-node-timing-metrics branch February 21, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve block timings metrics
2 participants