Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand current performance bottlenecks in signature generation #72

Open
jakmeier opened this issue Dec 19, 2024 · 6 comments
Open

Comments

@jakmeier
Copy link
Contributor

Description

Currently, when the dev network starts and generates triples and pre-signatures at full speed, we only observe around 40 messages per second on each node. (see graph and comment here)

We need to understand what the limiting factors are.

In a perfect world, we are only limited by the CPU-time it takes to perform the cryptographic work and the network delay. We can achieve this if we ensure 3 things.

  1. Incoming messages are always immediately delivered to cait-sith through Protocol::message.
  2. We are always calling Protocol::poke until it tells us to wait.
  3. Any messages generated are immediately sent.

In theory, these 3 tasks can always run in parallel, without blocking each other. Anytime one of these three tasks is not running, we potentially introduce overhead that could be avoided.

In practice, Protocol::poke and Protocol::message both require mutable access to Protocol, so they will not be able to run in parallel on the same Protocol instance. But presumably Protocol::message doesn't do actual work and only records the incoming message (to be checked), so the overhead of doing it serially should be minimal.

In any case, we should try to find out if any of these three tasks is delayed significantly due to implementation inefficiencies.

Possible steps

  1. We can add specific metrics to help us understand how much time we spend on each task. (Created as a separate issue: Add more performance metrics #71)
  2. Look at general execution traces from tracing to see if anything looks suspicious. See here for how it's done in nearcore.
    • Note: I see we have some tracing tooling already in the code base but I am not sure how much it is used and how well maintained. It might be worth it to invest some time into this by adding appropriate spans and setting up good tooling to look at the timing of the traces.
  3. Observe machines while they are under load with general tools like htop(1), iotop(8), perf(1), or more specialized tools like tokio-console if we can compile the nodes with nightly tokio features.
@jakmeier
Copy link
Contributor Author

If anyone already has data to help understand the performance bottlenecks better, or knows a good way to find it out, please share it here. :)

@jakmeier
Copy link
Contributor Author

Related issue with relevant analysis: #32

@volovyks
Copy link
Contributor

Are you suggesting replacing Prometheus with Tracing? Or built these new metrics using it?
It looks promising, but it will require refactoring.

@jakmeier
Copy link
Contributor Author

jakmeier commented Jan 2, 2025

No, I would still keep Prometheus for old and new metrics.

Tracing is an additional tool for deeper performance investigations. It can work more generally, including places where we haven't added metrics. And it can potentially give more fine-grained information, telling you exactly how many micro seconds are spend on each function. But it might require you to add more tracing information at runtime to be useful.

I see today logs are already done with the tracing macros (e.g. tracing::warn!) and we even have some span info added with macros, too (e.g. tracing::info_span! and #[tracing::instrument]). This means, you are already producing at least some tracing data.

This data is then consumed by a tracing subscriber. This code suggest you have this integrated with Google's Stackdriver. I'm not familiar with Stackdriver but perhaps that's all you need to look at the execution traces in detail and you can use it already today for investigating performance issues. Maybe you can find out which functions a node spends most time in.

At nearcore, Jaeger is used for presenting the traces. The tokio documentation chapter Next steps with Tracing has an example how to set this up. But I wouldn't add something new before understanding what you already have (Stackdriver).

@volovyks
Copy link
Contributor

volovyks commented Jan 2, 2025

Google provides Google Cloud Profiler and Tracing functionality. However, it is not turned on for our project, and it does not appear to support Rust projects natively. But I understand what you mean. Profiling and its flame graphs should give us many insights.
@auto-mausx have you worked with it? Have you seen it working for Rust? I have not found much information about it.

@auto-mausx
Copy link
Contributor

So as with any tracing profiler, we will need to send traces that make the visualization useful. I have used GCP Tracing before, albeit just as a simple functionality check. We can utilize Google or grafana as they both have Tracing support via OpenTelemetry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants