Add more performance metrics #71

jakmeier · 2024-12-19T13:49:47Z

Description

Goal: Have Grafana metrics that tell us how much of the protocol invocation delay is due to CPU work, how much is due to network delays, and how much is due to unnecessary overhead.

Measuring each of these directly may be difficult. But we can measure a couple of things and get to the same high-level result.

We can have the following metrics, with tags per protocol type:

protocol_time: Measure the wall-clock time between initiating a protocol until it finishes. (Probably already exists as metric in some form.)
poke_time: Every time we call poke, measure how much CPU time we spend on it and sum it up for all poke calls for the protocol from its creation until it is done. Submit the total poke time per protocol invocation to a histogram metric.
client_send_delay: Start the clock anytime we process a Action::SendMany or Action::SendPrivate and stop it once we are done sending the messages. Submit each time individually to a histogram metric.

Each of these data points is collected locally on each node individually but aggregated as a global histogram. Showing the histogram in Grafana directly can already be useful. Additionally, we can use the sum & count aggregations from the collected histogram to show more interesting statistics:

Relative CPU time: Display ( poke_time_sum / protocol_time_sum ) to show what percentage of the time we are CPU-bound.
Message rounds per protocol invocation: Display (client_send_delay_count / protocol_time_count) to show how many messages we send (globally) per protocol invocation.
Relative sending overhead: (client_send_delay_sum / protocol_time_sum) to show the percentage of the time we are spending on sending messages out.
Average network delay: (protocol_time_sum - poke_time_sum - client_send_sum) / protocol_time_count

Ideally, CPU time and network delay should make up >90% of the total time, with the sending overhead being small. Also, the network delay should be close to the observed ping delay. Otherwise, we have to investigate.

edit 2025-01-02, adding the diagram

         Poke called    <----------------------------------------
                |                                               ^ 
                |   (poke_delay)                                | 
                V                                               | 
  Poke returns Action::SendPrivate                              | 
                |                                               | 
                |   (client_send_delay)                         | 
                V                                               | 
      HTTP Call starts                                          | repeat until done
                |                                               | protocol_time = the total delay of all rounds
                |   (HTTP delay)                                | 
                V                                               | 
      HTTP response arrived                                     | 
                |                                               | 
                ------------------------------------------------|

The text was updated successfully, but these errors were encountered:

jakmeier · 2024-12-19T14:34:28Z

These are just my suggestions, other ideas for useful metrics are, of course, welcome!

volovyks · 2024-12-20T16:08:04Z

yes, protocol time exists, but it may need some unification:

For client_send_delay we have this:

Is it what you are referring to?

Message rounds per protocol invocation - I expect this to be static. But you never know.

Can you elaborate on Relative sending overhead? Is it Time spent on sending messages/Time spent on protocol in total?

Average network delay - if there are multiple rounds of sending messages, do you want to know the average time it takes to perform this operation?

jakmeier · 2025-01-02T09:24:03Z

Thanks for the detailed answer @volovyks !

I think the existing SEND_ENCRYPTED_LATENCY only covers the time for the actual HTTP call. This is an important number to know but also one that is hard to optimize, since it is mostly a result of the infrastructure connecting different nodes.

I am actually more interested in every sender delay except the HTTP call itself. For example, how much time is lost inside the deque field of the MessageQueue is something we can actually optimize and could potentially be a big contributor to the overall latency. But it could also be that we spend more time waiting on a lock before we can even insert the message to the MessageQueue. Thus, I suggested to measure everything between "poke returns a message" and "message is now sent on the wire" as a separate metric.

Message rounds per protocol invocation - I expect this to be static. But you never know.

Probably, yes. But it seems we currently don't know this number and we need to know this number to understand if our observed protocol latency is reasonable or not.

Also, metrics that are not supposed to change can be valuable, as it is easy to spot when something goes wrong. For example, when nodes crash, maybe it takes extra rounds to resolve the issue.

Can you elaborate on Relative sending overhead? Is it Time spent on sending messages/Time spent on protocol in total?

Kind of. Instead of "Time spent sending messages" I would describe it as "Client overhead for sending messages". This is all the time spent on processing a send request by the cryptographic protocol, except the actual HTTP call. It's the overhead added by our client code that I want to know.

The motivation is to find on a high-level where most of the latency goes. For now, this will tell us if it is worth to optimize anything between "poke returns a message" and "message is on the wire". In the future, it will also be useful to triage performance issues in production when they happen, to quickly find out if the message sending component is the culprit.

Average network delay - if there are multiple rounds of sending messages, do you want to know the average time it takes to perform this operation?

No, my intention was to have a time per protocol invocation, not per message round. Maybe "network delay" is bad naming. I really mean the delay of everything that's not on the node collecting the metric. From a node's perspective in distributed systems, there are local delays and remote (or network) delays. That's what I meant.

In summary, this diagram shows the different stages that I want to measure, locally:

         Poke called    <----------------------------------------
                |                                               ^ 
                |   (poke_delay)                                | 
                V                                               | 
  Poke returns Action::SendPrivate                              | 
                |                                               | 
                |   (client_send_delay)                         | 
                V                                               | 
      HTTP Call starts                                          | repeat until done
                |                                               | protocol_time = the total delay of all rounds
                |   (HTTP delay)                                | 
                V                                               | 
      HTTP response arrived                                     | 
                |                                               | 
                ------------------------------------------------|

These are things we can measure. I thought with Average network delay we can calculate the remote delay by subtracting all local delays from the total.

But I start to think this is not a great idea. Maybe we should just keep a time measure for how much time we spend in the state where poke returns Output::Wait until it returns something else again. Not entirely sure.

ppca · 2025-01-10T20:02:29Z

Using presignature as an example, I think what happens in our code is more like this:

generator created                                                             
                |   (before first poke delay)                               
               V                                              
        Poke called   <---------------------------
                |                                             ^ 
                |   (poke delay)                      | 
                V                                             | 
  Poke returns Action                            | 
                |                                              | (between poke delay)
                |   (client_send_delay)          |          
                V                                             | 
      HTTP Call starts                              | repeat until done
                |                                               | protocol_time = the total delay of all rounds + before first poke delay         
                |   (HTTP delay)                      | 
                V                                              | 
      HTTP response arrived                   | 
                |                                               | 
                ----------------------------|

Each protocol will go through undetermined number of rounds of pokes (because there might be many Action::wait returned at the end of a poke), the wall clock time between generator created and protocol completion is gonna be

protocol_time = before_poke_delay + sum(poke_delay) + sum(client_send_delay) + sum (http delay) + sum(between poke delay)

Here, all delays except for http delay can potentially be improved.

protocol_time is already instrumented as PRESIGNATURE_LATENCY in code, and its value = time of getting Action::Return() in poke() - time of corresponding presignature generator creation.

In terms of metric instrumentation, among all these delays, the easy ones to implement are:

before_poke_delay
poke delay

And harder ones to implement, this is after a first scan of code, so may change as I look closer:

client_send_delay: need to track down for each sent message's presignature id, what is the ending timestamp of the last poke, and client_send_delay = current time - last poke ending timestamp
http delay: need to attribute the sent messages to presignature id
between poke delay: need to track down per presignature id, when is the last message sent (timestamp when the response is received)

ppca · 2025-01-10T23:37:01Z

follow up on the harder ones to implement:

client_send_delay: need to track down for each sent message's presignature id, what is the ending timestamp of the last poke

will use the latest timestamp of the message per presignature id in the message queue. The message's timestamp was assigned to the time when the message was put into the message queue, which should be the last time the protocol is poked because in our current code (

mpc/chain-signatures/node/src/protocol/cryptography.rs

Line 360 in 5a628ea

async fn progress<C: CryptographicCtx + Send + Sync>(

), we do follow of the order of first poking all protocols (triple, presig, sig) until there's nothing to poke, where if there's Action:SendPrivate or Action::SendMany messages will be added to the message queue, with the timestamp of that moment, and then messages will be sent.

http delay: need to attribute the sent messages to presignature id

each presignature message comes with its presignature id

between poke delay: need to track down per presignature id, when is the last message sent (timestamp when the response is received)

we can add a field in PresignatureManager to track the latest update timestamp (poke, send message all included) of a protocol, and the between poke delay = start of this poke - latest timestamp of this protocol

volovyks · 2025-01-13T12:30:11Z

client_send_delay metric, which is probably the most important one, can be affected by #93
Now protocols are emitting messages. And they do not know when they are sent. Potensially that increases between_poke_delay by client_send_delay + http_delay.
We can add metrics directly to our messaging layer, such as message_queue_time (the time a message spends in the queue). cc @ChaoticTempest

ppca · 2025-01-13T21:05:11Z

I reviewed that PR, yes the metrics calculations will be affected. Because the way I implement the metrics now assume that:
1)all messages that are generated by poking will be exhaustively sent right after poking;
2)messages generated by the same poking will be sent in a batch altogether
(according to code in mpc/chain-signatures/node/src/protocol/cryptography.rs, so I could calculate:

client_send_delay = timestamp right before sending batch messages - timestamp of the earliest timestamp in all messages per presignature id
http_delay = timestamp right after finish sending batch messages - timestamp right before sending batch messages
between_poke_delay = timestamp of next poke start - timestamp of last time finish sending

I think in #93 , the thing that breaks the 2 assumptions are that the message sending will now be happening not after each round of poking, but in parallel at random times, and there's no guarantee of exhausting the messages.
I could still track a client_send_delay per message = time of pre sending - time of message creation (it's in the message struct), and still track http_delay per message, but it's going to be hard to calculate the delay per round.

My question for @ChaoticTempest is:
a) let's say after poking a protocol, action is SendMany, is it possible in your PR that some of the messages as part of SendMany are sent together, and some of them are sent in the next time MessageExecutor::execute() runs?
b) if the scenario in a) happens and 4 of the participants received a message, but the rest of the participants did not receive message as part of this SendMany from my node, then my node does poke() on the corresponding protocol, will the action always be Wait?
c) let's say after poking a protocol, action is SendPrivate, is it possible that before this message is sent, poke() on the corresponding protocol will return another Action that's not Wait?

ChaoticTempest · 2025-01-14T02:19:36Z

I reviewed that PR, yes the metrics calculations will be affected. Because the way I implement the metrics now assume that:
1)all messages that are generated by poking will be exhaustively sent right after poking;
2)messages generated by the same poking will be sent in a batch altogether
(according to code in mpc/chain-signatures/node/src/protocol/cryptography.rs, so I could calculate:
client_send_delay = timestamp right before sending batch messages - timestamp of the earliest timestamp in all messages per presignature id
http_delay = timestamp right after finish sending batch messages - timestamp right before sending batch messages
between_poke_delay = timestamp of next poke start - timestamp of last time finish sending

Why are we calculating based on the batch of messages and not per message?

a) let's say after poking a protocol, action is SendMany, is it possible in your PR that some of the messages as part of SendMany are sent together, and some of them are sent in the next time MessageExecutor::execute() runs?

Potentially

b) if the scenario in a) happens and 4 of the participants received a message, but the rest of the participants did not receive message as part of this SendMany from my node, then my node does poke() on the corresponding protocol, will the action always be Wait?
c) let's say after poking a protocol, action is SendPrivate, is it possible that before this message is sent, poke() on the corresponding protocol will return another Action that's not Wait?

For both these, the cait-sith protocol has roughly 6 rounds, it should not advance if it does not receive the equivalent information to advance the rounds. However, I have not looked too deeply in it, so it could skip rounds and do all sorts of things, and have the message appear in the next round is a possibility. We shouldn't be making this assumption about our protocols either ways because that would mean knowing the internals of them and would potentially lead us to adding race conditions.

ppca · 2025-01-14T17:14:15Z

Why are we calculating based on the batch of messages and not per message?

because we want to track from generator creation to finishing presig generation, how much time are spent on each part. Since we don't send messages one by one, if we calculate per message and add them all up, then the time taken is going to be much larger than the actual time between poke and send.

would potentially lead us to adding race conditions.

why is that? @ChaoticTempest

if b) and c) cannot be assumed, then it is hard to conclude when each round ends. I'd suggest we take a closer look at the internals.
Meanwhile, I could add metrics that measure:

The time from generator creation to first poke that returns SendMany/SendPrivate per protocol. This is basically delay of actually starting to process the protocol.
The time between one poke that returns SendMany/SendPrivate to the next one that returns SendMany/SendPrivate/Return. This is basically the total time taken per round.
count of pokes that return SendMany/SendPrivate/Return per protocol. This is roughly number of rounds.

The above will confirm 1) if there are actually always 6 rounds; 2) if not, how is it like; 3) time distribution among rounds and wait time before actually starting protocol

And in messaging layer PR, maybe @ChaoticTempest can add:

client_send_delay metric that measures delay between message being put in queue to being ready to sent per message
http_delay metric that measures http delay per message

This is not perfectly matching what we initially wanted: per round break down, but will get us started, and see if certain delays are obviously high.

ChaoticTempest · 2025-01-14T21:08:58Z

because we want to track from generator creation to finishing presig generation, how much time are spent on each part. Since we don't send messages one by one, if we calculate per message and add them all up, then the time taken is going to be much larger than the actual time between poke and send.

Well, since messages are being sent async now, we don't really care so much about how long they take to be sent. So protocol time for presignature simply is now:

accrued_wait_time = sum(time between last Action::Wait till Action::SendMany or Action::SendPrivate)
presig_time = (finish - start) - accrued_wait_time

why is that?
if b) and c) cannot be assumed, then it is hard to conclude when each round ends. I'd suggest we take a closer look at the internals.

Because we're now going to be assuming the protocols will take exactly 6 rounds, when in the future it can take any amount. For example, EDDSA will only have two rounds of communication. Hardcoding for specific rounds is not ideal and would be leaking the internals of how protocols work just for the sake of metrics which will likely be broken later as we make further changes.

And in messaging layer PR, maybe @ChaoticTempest can add:
client_send_delay metric that measures delay between message being put in queue to being ready to sent per message
http_delay metric that measures http delay per message
This is not perfectly matching what we initially wanted: per round break down, but will get us started, and see if certain delays are obviously high.

So, with what I said before, I don't recommend us having a per round breakdown at all. Just the per protocol time and http time would suffice.

ppca · 2025-01-16T00:46:02Z

I think with the async implementation now, I'll probably just add the following metrics:

http delay per protocol: add up the http delay for each batch of messages sent that included messages for this protocol;
accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
before poke delay per protocol: time between generator creation and first poke

so total presignature latency = completion timestamp of presig - generator creation time, which should = before poke delay + accrued wait time , and we can check the difference between accrued wait time and http delay to see if the time spent outside of http are significant, that time could be comprised of:

inside the presig manager, the loop over generators means each protocol waits for its turn to do poke();
when action is Sendmnay/SendPrivate, and we send the message to the channel, there will be some time before that message actually gets sent over the wire by the outbox

jakmeier · 2025-01-16T08:04:14Z

I think with the async implementation now, I'll probably just add the following metrics:

http delay per protocol: add up the http delay for each batch of messages sent that included messages for this protocol;

accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return

before poke delay per protocol: time between generator creation and first poke

Yeah this sounds very reasonable. :)
Thanks also for adding the "before poke" stage that I missed!

Per-protocol metrics should be good enough, since we only want to figure out how much time we spent on each stage. Then we can narrow down where the bottleneck is. Per-round breakdowns could be nice in some cases but are hopefully not required.

Also, the time spent in the client_send_delay was sticking out to me as important when messages were still sent synchronously. Now with the async implementation, probably we have already improved this metric massively.
Nevertheless, we need the metric to confirm or deny that this is no longer a bottleneck. (if it ever was one)

ppca · 2025-01-23T16:51:52Z

I will add the mentioned metrics!

http delay per protocol: add up the http delay for each batch of messages sent that included messages for this protocol;
accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
before poke delay per protocol: time between generator creation and first poke
client send delay: time between message creation and message being sent, this is per message

ppca · 2025-01-24T00:48:02Z

I realized the http delay I proposed in the issue earlier because between the batches of messages that get sent out, one batch's timeline of sending (between when it's ready to send and when it's finished sending) could overlap with another batch, so it will be hard to say how to add up the http delay among different batches of messages that belong to one protocol. So here I think we can look at SEND_ENCRYPTED_LATENCY(per partition send http delay) and count of pokes per protocol, and roughly estimate the total http delay per protocol as SEND_ENCRYPTED_LATENCY * count of pokes per protocol.

So the list of metrics I will add now:

accrued wait time per protocol: add up the wall clock time difference between pokes that return SendMany/SendPrivate/Return
before poke delay per protocol: time between generator creation and first poke
client send delay per message: time between message creation and message being sent, this is per message
total pokes per protocol: the total number of pokes that return SendMany/SendPrivate/Return

jakmeier · 2025-01-30T10:20:13Z

Yes I think I see what you mean with the HTTP delay. I like your solution.

With the latest changes in #94, things are looking good for the listed metrics.

One metric from my original list seems absent, though. The one about local CPU time spent on an actual poke() call.

poke_time: Every time we call poke, measure how much CPU time we spend on it and sum it up for all poke calls for the protocol from its creation until it is done. Submit the total poke time per protocol invocation to a histogram metric.

I think it would make sense to add those as well, in a separate follow-up PR.
Then we can compare how much of the delay is due to CPU processing of the cryptographic protocol, vs how long is spent on communication between nodes.

jakmeier mentioned this issue Dec 19, 2024

Understand current performance bottlenecks in signature generation #72

Open

volovyks assigned ppca Jan 9, 2025

ppca linked a pull request Jan 10, 2025 that will close this issue

Triple/Presig/Signature protocol performance metrics #94

Open

ppca mentioned this issue Jan 13, 2025

feat: add MessageChannel for async protocol message receiving and sending #93

Merged

volovyks added vertical scaling P1 labels Jan 21, 2025

volovyks assigned ChaoticTempest Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more performance metrics #71

Add more performance metrics #71

jakmeier commented Dec 19, 2024 •

edited

Loading

jakmeier commented Dec 19, 2024

volovyks commented Dec 20, 2024

jakmeier commented Jan 2, 2025

ppca commented Jan 10, 2025 •

edited

Loading

ppca commented Jan 10, 2025

volovyks commented Jan 13, 2025 •

edited

Loading

ppca commented Jan 13, 2025 •

edited

Loading

ChaoticTempest commented Jan 14, 2025

ppca commented Jan 14, 2025 •

edited

Loading

ChaoticTempest commented Jan 14, 2025

ppca commented Jan 16, 2025

jakmeier commented Jan 16, 2025

ppca commented Jan 23, 2025 •

edited

Loading

ppca commented Jan 24, 2025

jakmeier commented Jan 30, 2025

Add more performance metrics #71

Add more performance metrics #71

Comments

jakmeier commented Dec 19, 2024 • edited Loading

Description

jakmeier commented Dec 19, 2024

volovyks commented Dec 20, 2024

jakmeier commented Jan 2, 2025

ppca commented Jan 10, 2025 • edited Loading

ppca commented Jan 10, 2025

volovyks commented Jan 13, 2025 • edited Loading

ppca commented Jan 13, 2025 • edited Loading

ChaoticTempest commented Jan 14, 2025

ppca commented Jan 14, 2025 • edited Loading

ChaoticTempest commented Jan 14, 2025

ppca commented Jan 16, 2025

jakmeier commented Jan 16, 2025

ppca commented Jan 23, 2025 • edited Loading

ppca commented Jan 24, 2025

jakmeier commented Jan 30, 2025

jakmeier commented Dec 19, 2024 •

edited

Loading

ppca commented Jan 10, 2025 •

edited

Loading

volovyks commented Jan 13, 2025 •

edited

Loading

ppca commented Jan 13, 2025 •

edited

Loading

ppca commented Jan 14, 2025 •

edited

Loading

ppca commented Jan 23, 2025 •

edited

Loading