-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand current performance bottlenecks in signature generation #72
Comments
If anyone already has data to help understand the performance bottlenecks better, or knows a good way to find it out, please share it here. :) |
Related issue with relevant analysis: #32 |
Are you suggesting replacing Prometheus with Tracing? Or built these new metrics using it? |
No, I would still keep Prometheus for old and new metrics. Tracing is an additional tool for deeper performance investigations. It can work more generally, including places where we haven't added metrics. And it can potentially give more fine-grained information, telling you exactly how many micro seconds are spend on each function. But it might require you to add more tracing information at runtime to be useful. I see today logs are already done with the tracing macros (e.g. This data is then consumed by a tracing subscriber. This code suggest you have this integrated with Google's Stackdriver. I'm not familiar with Stackdriver but perhaps that's all you need to look at the execution traces in detail and you can use it already today for investigating performance issues. Maybe you can find out which functions a node spends most time in. At nearcore, Jaeger is used for presenting the traces. The tokio documentation chapter Next steps with Tracing has an example how to set this up. But I wouldn't add something new before understanding what you already have (Stackdriver). |
Google provides Google Cloud Profiler and Tracing functionality. However, it is not turned on for our project, and it does not appear to support Rust projects natively. But I understand what you mean. Profiling and its flame graphs should give us many insights. |
So as with any tracing profiler, we will need to send traces that make the visualization useful. I have used GCP Tracing before, albeit just as a simple functionality check. We can utilize Google or grafana as they both have Tracing support via OpenTelemetry. |
Description
Currently, when the dev network starts and generates triples and pre-signatures at full speed, we only observe around 40 messages per second on each node. (see graph and comment here)
We need to understand what the limiting factors are.
In a perfect world, we are only limited by the CPU-time it takes to perform the cryptographic work and the network delay. We can achieve this if we ensure 3 things.
In theory, these 3 tasks can always run in parallel, without blocking each other. Anytime one of these three tasks is not running, we potentially introduce overhead that could be avoided.
In practice,
Protocol::poke
andProtocol::message
both require mutable access toProtocol
, so they will not be able to run in parallel on the same Protocol instance. But presumablyProtocol::message
doesn't do actual work and only records the incoming message (to be checked), so the overhead of doing it serially should be minimal.In any case, we should try to find out if any of these three tasks is delayed significantly due to implementation inefficiencies.
Possible steps
tracing
to see if anything looks suspicious. See here for how it's done in nearcore.The text was updated successfully, but these errors were encountered: