-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more performance metrics #71
Comments
These are just my suggestions, other ideas for useful metrics are, of course, welcome! |
Thanks for the detailed answer @volovyks ! I think the existing I am actually more interested in every sender delay except the HTTP call itself. For example, how much time is lost inside the deque field of the MessageQueue is something we can actually optimize and could potentially be a big contributor to the overall latency. But it could also be that we spend more time waiting on a lock before we can even insert the message to the MessageQueue. Thus, I suggested to measure everything between "poke returns a message" and "message is now sent on the wire" as a separate metric.
Probably, yes. But it seems we currently don't know this number and we need to know this number to understand if our observed protocol latency is reasonable or not. Also, metrics that are not supposed to change can be valuable, as it is easy to spot when something goes wrong. For example, when nodes crash, maybe it takes extra rounds to resolve the issue.
Kind of. Instead of "Time spent sending messages" I would describe it as "Client overhead for sending messages". This is all the time spent on processing a send request by the cryptographic protocol, except the actual HTTP call. It's the overhead added by our client code that I want to know. The motivation is to find on a high-level where most of the latency goes. For now, this will tell us if it is worth to optimize anything between "poke returns a message" and "message is on the wire". In the future, it will also be useful to triage performance issues in production when they happen, to quickly find out if the message sending component is the culprit.
No, my intention was to have a time per protocol invocation, not per message round. Maybe "network delay" is bad naming. I really mean the delay of everything that's not on the node collecting the metric. From a node's perspective in distributed systems, there are local delays and remote (or network) delays. That's what I meant. In summary, this diagram shows the different stages that I want to measure, locally: Poke called <----------------------------------------
| ^
| (poke_delay) |
V |
Poke returns Action::SendPrivate |
| |
| (client_send_delay) |
V |
HTTP Call starts | repeat until done
| | protocol_time = the total delay of all rounds
| (HTTP delay) |
V |
HTTP response arrived |
| |
------------------------------------------------|
These are things we can measure. I thought with But I start to think this is not a great idea. Maybe we should just keep a time measure for how much time we spend in the state where poke returns |
Using presignature as an example, I think what happens in our code is more like this:
Each protocol will go through undetermined number of rounds of pokes (because there might be many Action::wait returned at the end of a poke), the wall clock time between generator created and protocol completion is gonna be
Here, all delays except for http delay can potentially be improved. protocol_time is already instrumented as In terms of metric instrumentation, among all these delays, the easy ones to implement are:
And harder ones to implement, this is after a first scan of code, so may change as I look closer:
|
follow up on the harder ones to implement: client_send_delay: need to track down for each sent message's presignature id, what is the ending timestamp of the last poke
http delay: need to attribute the sent messages to presignature id
between poke delay: need to track down per presignature id, when is the last message sent (timestamp when the response is received)
|
|
I reviewed that PR, yes the metrics calculations will be affected. Because the way I implement the metrics now assume that:
I think in #93 , the thing that breaks the 2 assumptions are that the message sending will now be happening not after each round of poking, but in parallel at random times, and there's no guarantee of exhausting the messages. My question for @ChaoticTempest is: |
Why are we calculating based on the batch of messages and not per message?
Potentially
For both these, the cait-sith protocol has roughly 6 rounds, it should not advance if it does not receive the equivalent information to advance the rounds. However, I have not looked too deeply in it, so it could skip rounds and do all sorts of things, and have the message appear in the next round is a possibility. We shouldn't be making this assumption about our protocols either ways because that would mean knowing the internals of them and would potentially lead us to adding race conditions. |
Why are we calculating based on the batch of messages and not per message?
would potentially lead us to adding race conditions.
if b) and c) cannot be assumed, then it is hard to conclude when each round ends. I'd suggest we take a closer look at the internals.
The above will confirm 1) if there are actually always 6 rounds; 2) if not, how is it like; 3) time distribution among rounds and wait time before actually starting protocol And in messaging layer PR, maybe @ChaoticTempest can add:
This is not perfectly matching what we initially wanted: per round break down, but will get us started, and see if certain delays are obviously high. |
Well, since messages are being sent async now, we don't really care so much about how long they take to be sent. So protocol time for presignature simply is now:
Because we're now going to be assuming the protocols will take exactly 6 rounds, when in the future it can take any amount. For example, EDDSA will only have two rounds of communication. Hardcoding for specific rounds is not ideal and would be leaking the internals of how protocols work just for the sake of metrics which will likely be broken later as we make further changes.
So, with what I said before, I don't recommend us having a per round breakdown at all. Just the per protocol time and http time would suffice. |
I think with the async implementation now, I'll probably just add the following metrics:
so total presignature latency = completion timestamp of presig - generator creation time, which should = before poke delay + accrued wait time , and we can check the difference between accrued wait time and http delay to see if the time spent outside of http are significant, that time could be comprised of:
|
Yeah this sounds very reasonable. :) Per-protocol metrics should be good enough, since we only want to figure out how much time we spent on each stage. Then we can narrow down where the bottleneck is. Per-round breakdowns could be nice in some cases but are hopefully not required. Also, the time spent in the |
I will add the mentioned metrics!
|
I realized the http delay I proposed in the issue earlier because between the batches of messages that get sent out, one batch's timeline of sending (between when it's ready to send and when it's finished sending) could overlap with another batch, so it will be hard to say how to add up the http delay among different batches of messages that belong to one protocol. So here I think we can look at SEND_ENCRYPTED_LATENCY(per partition send http delay) and count of pokes per protocol, and roughly estimate the total http delay per protocol as SEND_ENCRYPTED_LATENCY * count of pokes per protocol. So the list of metrics I will add now:
|
Yes I think I see what you mean with the HTTP delay. I like your solution. With the latest changes in #94, things are looking good for the listed metrics. One metric from my original list seems absent, though. The one about local CPU time spent on an actual poke() call.
I think it would make sense to add those as well, in a separate follow-up PR. |
Description
Goal: Have Grafana metrics that tell us how much of the protocol invocation delay is due to CPU work, how much is due to network delays, and how much is due to unnecessary overhead.
Measuring each of these directly may be difficult. But we can measure a couple of things and get to the same high-level result.
We can have the following metrics, with tags per protocol type:
poke
, measure how much CPU time we spend on it and sum it up for allpoke
calls for the protocol from its creation until it is done. Submit the total poke time per protocol invocation to a histogram metric.Action::SendMany
orAction::SendPrivate
and stop it once we are done sending the messages. Submit each time individually to a histogram metric.Each of these data points is collected locally on each node individually but aggregated as a global histogram. Showing the histogram in Grafana directly can already be useful. Additionally, we can use the sum & count aggregations from the collected histogram to show more interesting statistics:
Ideally, CPU time and network delay should make up >90% of the total time, with the sending overhead being small. Also, the network delay should be close to the observed ping delay. Otherwise, we have to investigate.
edit 2025-01-02, adding the diagram
Poke called <---------------------------------------- | ^ | (poke_delay) | V | Poke returns Action::SendPrivate | | | | (client_send_delay) | V | HTTP Call starts | repeat until done | | protocol_time = the total delay of all rounds | (HTTP delay) | V | HTTP response arrived | | | ------------------------------------------------|
The text was updated successfully, but these errors were encountered: