Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more performance metrics #71

Open
3 tasks
jakmeier opened this issue Dec 19, 2024 · 15 comments · May be fixed by #94
Open
3 tasks

Add more performance metrics #71

jakmeier opened this issue Dec 19, 2024 · 15 comments · May be fixed by #94

Comments

@jakmeier
Copy link
Contributor

jakmeier commented Dec 19, 2024

Description

Goal: Have Grafana metrics that tell us how much of the protocol invocation delay is due to CPU work, how much is due to network delays, and how much is due to unnecessary overhead.

Measuring each of these directly may be difficult. But we can measure a couple of things and get to the same high-level result.

We can have the following metrics, with tags per protocol type:

  • protocol_time: Measure the wall-clock time between initiating a protocol until it finishes. (Probably already exists as metric in some form.)
  • poke_time: Every time we call poke, measure how much CPU time we spend on it and sum it up for all poke calls for the protocol from its creation until it is done. Submit the total poke time per protocol invocation to a histogram metric.
  • client_send_delay: Start the clock anytime we process a Action::SendMany or Action::SendPrivate and stop it once we are done sending the messages. Submit each time individually to a histogram metric.

Each of these data points is collected locally on each node individually but aggregated as a global histogram. Showing the histogram in Grafana directly can already be useful. Additionally, we can use the sum & count aggregations from the collected histogram to show more interesting statistics:

  • Relative CPU time: Display ( poke_time_sum / protocol_time_sum ) to show what percentage of the time we are CPU-bound.
  • Message rounds per protocol invocation: Display (client_send_delay_count / protocol_time_count) to show how many messages we send (globally) per protocol invocation.
  • Relative sending overhead: (client_send_delay_sum / protocol_time_sum) to show the percentage of the time we are spending on sending messages out.
  • Average network delay: (protocol_time_sum - poke_time_sum - client_send_sum) / protocol_time_count

Ideally, CPU time and network delay should make up >90% of the total time, with the sending overhead being small. Also, the network delay should be close to the observed ping delay. Otherwise, we have to investigate.

edit 2025-01-02, adding the diagram

         Poke called    <----------------------------------------
                |                                               ^ 
                |   (poke_delay)                                | 
                V                                               | 
  Poke returns Action::SendPrivate                              | 
                |                                               | 
                |   (client_send_delay)                         | 
                V                                               | 
      HTTP Call starts                                          | repeat until done
                |                                               | protocol_time = the total delay of all rounds
                |   (HTTP delay)                                | 
                V                                               | 
      HTTP response arrived                                     | 
                |                                               | 
                ------------------------------------------------| 
@jakmeier
Copy link
Contributor Author

These are just my suggestions, other ideas for useful metrics are, of course, welcome!

@volovyks
Copy link
Contributor

yes, protocol time exists, but it may need some unification:
image
image
image

For client_send_delay we have this:
image
Is it what you are referring to?

Message rounds per protocol invocation - I expect this to be static. But you never know.

Can you elaborate on Relative sending overhead? Is it Time spent on sending messages/Time spent on protocol in total?

Average network delay - if there are multiple rounds of sending messages, do you want to know the average time it takes to perform this operation?

@jakmeier
Copy link
Contributor Author

jakmeier commented Jan 2, 2025

Thanks for the detailed answer @volovyks !

I think the existing SEND_ENCRYPTED_LATENCY only covers the time for the actual HTTP call. This is an important number to know but also one that is hard to optimize, since it is mostly a result of the infrastructure connecting different nodes.

I am actually more interested in every sender delay except the HTTP call itself. For example, how much time is lost inside the deque field of the MessageQueue is something we can actually optimize and could potentially be a big contributor to the overall latency. But it could also be that we spend more time waiting on a lock before we can even insert the message to the MessageQueue. Thus, I suggested to measure everything between "poke returns a message" and "message is now sent on the wire" as a separate metric.

Message rounds per protocol invocation - I expect this to be static. But you never know.

Probably, yes. But it seems we currently don't know this number and we need to know this number to understand if our observed protocol latency is reasonable or not.

Also, metrics that are not supposed to change can be valuable, as it is easy to spot when something goes wrong. For example, when nodes crash, maybe it takes extra rounds to resolve the issue.

Can you elaborate on Relative sending overhead? Is it Time spent on sending messages/Time spent on protocol in total?

Kind of. Instead of "Time spent sending messages" I would describe it as "Client overhead for sending messages". This is all the time spent on processing a send request by the cryptographic protocol, except the actual HTTP call. It's the overhead added by our client code that I want to know.

The motivation is to find on a high-level where most of the latency goes. For now, this will tell us if it is worth to optimize anything between "poke returns a message" and "message is on the wire". In the future, it will also be useful to triage performance issues in production when they happen, to quickly find out if the message sending component is the culprit.

Average network delay - if there are multiple rounds of sending messages, do you want to know the average time it takes to perform this operation?

No, my intention was to have a time per protocol invocation, not per message round. Maybe "network delay" is bad naming. I really mean the delay of everything that's not on the node collecting the metric. From a node's perspective in distributed systems, there are local delays and remote (or network) delays. That's what I meant.

In summary, this diagram shows the different stages that I want to measure, locally:

         Poke called    <----------------------------------------
                |                                               ^ 
                |   (poke_delay)                                | 
                V                                               | 
  Poke returns Action::SendPrivate                              | 
                |                                               | 
                |   (client_send_delay)                         | 
                V                                               | 
      HTTP Call starts                                          | repeat until done
                |                                               | protocol_time = the total delay of all rounds
                |   (HTTP delay)                                | 
                V                                               | 
      HTTP response arrived                                     | 
                |                                               | 
                ------------------------------------------------| 

These are things we can measure. I thought with Average network delay we can calculate the remote delay by subtracting all local delays from the total.

But I start to think this is not a great idea. Maybe we should just keep a time measure for how much time we spend in the state where poke returns Output::Wait until it returns something else again. Not entirely sure.

@ppca
Copy link
Contributor

ppca commented Jan 10, 2025

Using presignature as an example, I think what happens in our code is more like this:

generator created                                                             
                |   (before first poke delay)                               
               V                                              
        Poke called   <---------------------------
                |                                             ^ 
                |   (poke delay)                      | 
                V                                             | 
  Poke returns Action                            | 
                |                                              | (between poke delay)
                |   (client_send_delay)          |          
                V                                             | 
      HTTP Call starts                              | repeat until done
                |                                               | protocol_time = the total delay of all rounds + before first poke delay         
                |   (HTTP delay)                      | 
                V                                              | 
      HTTP response arrived                   | 
                |                                               | 
                ----------------------------| 

Each protocol will go through undetermined number of rounds of pokes (because there might be many Action::wait returned at the end of a poke), the wall clock time between generator created and protocol completion is gonna be

protocol_time = before_poke_delay + sum(poke_delay) + sum(client_send_delay) + sum (http delay) + sum(between poke delay)

Here, all delays except for http delay can potentially be improved.

protocol_time is already instrumented as PRESIGNATURE_LATENCY in code, and its value = time of getting Action::Return() in poke() - time of corresponding presignature generator creation.

In terms of metric instrumentation, among all these delays, the easy ones to implement are:

  • before_poke_delay
  • poke delay

And harder ones to implement, this is after a first scan of code, so may change as I look closer:

  • client_send_delay: need to track down for each sent message's presignature id, what is the ending timestamp of the last poke, and client_send_delay = current time - last poke ending timestamp
  • http delay: need to attribute the sent messages to presignature id
  • between poke delay: need to track down per presignature id, when is the last message sent (timestamp when the response is received)

@ppca
Copy link
Contributor

ppca commented Jan 10, 2025

follow up on the harder ones to implement:

client_send_delay: need to track down for each sent message's presignature id, what is the ending timestamp of the last poke

  • will use the latest timestamp of the message per presignature id in the message queue. The message's timestamp was assigned to the time when the message was put into the message queue, which should be the last time the protocol is poked because in our current code (
    async fn progress<C: CryptographicCtx + Send + Sync>(
    ), we do follow of the order of first poking all protocols (triple, presig, sig) until there's nothing to poke, where if there's Action:SendPrivate or Action::SendMany messages will be added to the message queue, with the timestamp of that moment, and then messages will be sent.

http delay: need to attribute the sent messages to presignature id

  • each presignature message comes with its presignature id

between poke delay: need to track down per presignature id, when is the last message sent (timestamp when the response is received)

  • we can add a field in PresignatureManager to track the latest update timestamp (poke, send message all included) of a protocol, and the between poke delay = start of this poke - latest timestamp of this protocol

@ppca ppca linked a pull request Jan 10, 2025 that will close this issue
@ppca ppca linked a pull request Jan 10, 2025 that will close this issue
@volovyks
Copy link
Contributor

volovyks commented Jan 13, 2025

client_send_delay metric, which is probably the most important one, can be affected by #93
Now protocols are emitting messages. And they do not know when they are sent. Potensially that increases between_poke_delay by client_send_delay + http_delay.
We can add metrics directly to our messaging layer, such as message_queue_time (the time a message spends in the queue). cc @ChaoticTempest

@ppca
Copy link
Contributor

ppca commented Jan 13, 2025

I reviewed that PR, yes the metrics calculations will be affected. Because the way I implement the metrics now assume that:
1)all messages that are generated by poking will be exhaustively sent right after poking;
2)messages generated by the same poking will be sent in a batch altogether
(according to code in mpc/chain-signatures/node/src/protocol/cryptography.rs, so I could calculate:

  • client_send_delay = timestamp right before sending batch messages - timestamp of the earliest timestamp in all messages per presignature id
  • http_delay = timestamp right after finish sending batch messages - timestamp right before sending batch messages
  • between_poke_delay = timestamp of next poke start - timestamp of last time finish sending

I think in #93 , the thing that breaks the 2 assumptions are that the message sending will now be happening not after each round of poking, but in parallel at random times, and there's no guarantee of exhausting the messages.
I could still track a client_send_delay per message = time of pre sending - time of message creation (it's in the message struct), and still track http_delay per message, but it's going to be hard to calculate the delay per round.

My question for @ChaoticTempest is:
a) let's say after poking a protocol, action is SendMany, is it possible in your PR that some of the messages as part of SendMany are sent together, and some of them are sent in the next time MessageExecutor::execute() runs?
b) if the scenario in a) happens and 4 of the participants received a message, but the rest of the participants did not receive message as part of this SendMany from my node, then my node does poke() on the corresponding protocol, will the action always be Wait?
c) let's say after poking a protocol, action is SendPrivate, is it possible that before this message is sent, poke() on the corresponding protocol will return another Action that's not Wait?

@ChaoticTempest
Copy link
Contributor

I reviewed that PR, yes the metrics calculations will be affected. Because the way I implement the metrics now assume that:
1)all messages that are generated by poking will be exhaustively sent right after poking;
2)messages generated by the same poking will be sent in a batch altogether
(according to code in mpc/chain-signatures/node/src/protocol/cryptography.rs, so I could calculate:
client_send_delay = timestamp right before sending batch messages - timestamp of the earliest timestamp in all messages per presignature id
http_delay = timestamp right after finish sending batch messages - timestamp right before sending batch messages
between_poke_delay = timestamp of next poke start - timestamp of last time finish sending

Why are we calculating based on the batch of messages and not per message?

a) let's say after poking a protocol, action is SendMany, is it possible in your PR that some of the messages as part of SendMany are sent together, and some of them are sent in the next time MessageExecutor::execute() runs?

Potentially

b) if the scenario in a) happens and 4 of the participants received a message, but the rest of the participants did not receive message as part of this SendMany from my node, then my node does poke() on the corresponding protocol, will the action always be Wait?
c) let's say after poking a protocol, action is SendPrivate, is it possible that before this message is sent, poke() on the corresponding protocol will return another Action that's not Wait?

For both these, the cait-sith protocol has roughly 6 rounds, it should not advance if it does not receive the equivalent information to advance the rounds. However, I have not looked too deeply in it, so it could skip rounds and do all sorts of things, and have the message appear in the next round is a possibility. We shouldn't be making this assumption about our protocols either ways because that would mean knowing the internals of them and would potentially lead us to adding race conditions.

@ppca
Copy link
Contributor

ppca commented Jan 14, 2025

Why are we calculating based on the batch of messages and not per message?

  • because we want to track from generator creation to finishing presig generation, how much time are spent on each part. Since we don't send messages one by one, if we calculate per message and add them all up, then the time taken is going to be much larger than the actual time between poke and send.

would potentially lead us to adding race conditions.

if b) and c) cannot be assumed, then it is hard to conclude when each round ends. I'd suggest we take a closer look at the internals.
Meanwhile, I could add metrics that measure:

  • The time from generator creation to first poke that returns SendMany/SendPrivate per protocol. This is basically delay of actually starting to process the protocol.
  • The time between one poke that returns SendMany/SendPrivate to the next one that returns SendMany/SendPrivate/Return. This is basically the total time taken per round.
  • count of pokes that return SendMany/SendPrivate/Return per protocol. This is roughly number of rounds.

The above will confirm 1) if there are actually always 6 rounds; 2) if not, how is it like; 3) time distribution among rounds and wait time before actually starting protocol

And in messaging layer PR, maybe @ChaoticTempest can add:

  • client_send_delay metric that measures delay between message being put in queue to being ready to sent per message
  • http_delay metric that measures http delay per message

This is not perfectly matching what we initially wanted: per round break down, but will get us started, and see if certain delays are obviously high.

@ChaoticTempest
Copy link
Contributor

because we want to track from generator creation to finishing presig generation, how much time are spent on each part. Since we don't send messages one by one, if we calculate per message and add them all up, then the time taken is going to be much larger than the actual time between poke and send.

Well, since messages are being sent async now, we don't really care so much about how long they take to be sent. So protocol time for presignature simply is now:

accrued_wait_time = sum(time between last Action::Wait till Action::SendMany or Action::SendPrivate)
presig_time = (finish - start) - accrued_wait_time

why is that?
if b) and c) cannot be assumed, then it is hard to conclude when each round ends. I'd suggest we take a closer look at the internals.

Because we're now going to be assuming the protocols will take exactly 6 rounds, when in the future it can take any amount. For example, EDDSA will only have two rounds of communication. Hardcoding for specific rounds is not ideal and would be leaking the internals of how protocols work just for the sake of metrics which will likely be broken later as we make further changes.

And in messaging layer PR, maybe @ChaoticTempest can add:
client_send_delay metric that measures delay between message being put in queue to being ready to sent per message
http_delay metric that measures http delay per message
This is not perfectly matching what we initially wanted: per round break down, but will get us started, and see if certain delays are obviously high.

So, with what I said before, I don't recommend us having a per round breakdown at all. Just the per protocol time and http time would suffice.

@ppca
Copy link
Contributor

ppca commented Jan 16, 2025

I think with the async implementation now, I'll probably just add the following metrics:

  • http delay per protocol: add up the http delay for each batch of messages sent that included messages for this protocol;
  • accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
  • before poke delay per protocol: time between generator creation and first poke

so total presignature latency = completion timestamp of presig - generator creation time, which should = before poke delay + accrued wait time , and we can check the difference between accrued wait time and http delay to see if the time spent outside of http are significant, that time could be comprised of:

  • inside the presig manager, the loop over generators means each protocol waits for its turn to do poke();
  • when action is Sendmnay/SendPrivate, and we send the message to the channel, there will be some time before that message actually gets sent over the wire by the outbox

@jakmeier
Copy link
Contributor Author

I think with the async implementation now, I'll probably just add the following metrics:

  • http delay per protocol: add up the http delay for each batch of messages sent that included messages for this protocol;
  • accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
  • before poke delay per protocol: time between generator creation and first poke

Yeah this sounds very reasonable. :)
Thanks also for adding the "before poke" stage that I missed!

Per-protocol metrics should be good enough, since we only want to figure out how much time we spent on each stage. Then we can narrow down where the bottleneck is. Per-round breakdowns could be nice in some cases but are hopefully not required.

Also, the time spent in the client_send_delay was sticking out to me as important when messages were still sent synchronously. Now with the async implementation, probably we have already improved this metric massively.
Nevertheless, we need the metric to confirm or deny that this is no longer a bottleneck. (if it ever was one)

@ppca
Copy link
Contributor

ppca commented Jan 23, 2025

I will add the mentioned metrics!

  • http delay per protocol: add up the http delay for each batch of messages sent that included messages for this protocol;
  • accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
  • before poke delay per protocol: time between generator creation and first poke
  • client send delay: time between message creation and message being sent, this is per message

@ppca
Copy link
Contributor

ppca commented Jan 24, 2025

I realized the http delay I proposed in the issue earlier because between the batches of messages that get sent out, one batch's timeline of sending (between when it's ready to send and when it's finished sending) could overlap with another batch, so it will be hard to say how to add up the http delay among different batches of messages that belong to one protocol. So here I think we can look at SEND_ENCRYPTED_LATENCY(per partition send http delay) and count of pokes per protocol, and roughly estimate the total http delay per protocol as SEND_ENCRYPTED_LATENCY * count of pokes per protocol.

So the list of metrics I will add now:

  • accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
  • before poke delay per protocol: time between generator creation and first poke
  • client send delay per message: time between message creation and message being sent, this is per message
  • total pokes per protocol: the total number of pokes that return SendMany/SendPrivate/Return

@jakmeier
Copy link
Contributor Author

Yes I think I see what you mean with the HTTP delay. I like your solution.

With the latest changes in #94, things are looking good for the listed metrics.

One metric from my original list seems absent, though. The one about local CPU time spent on an actual poke() call.

poke_time: Every time we call poke, measure how much CPU time we spend on it and sum it up for all poke calls for the protocol from its creation until it is done. Submit the total poke time per protocol invocation to a histogram metric.

I think it would make sense to add those as well, in a separate follow-up PR.
Then we can compare how much of the delay is due to CPU processing of the cryptographic protocol, vs how long is spent on communication between nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants