feat: add MessageChannel for async protocol message receiving and sending #93

ChaoticTempest · 2025-01-10T20:17:05Z

Lots of code here. The crux of what this PR does is:

new MessageChannel that processes inbound and outbound messages in a background task.
makes receiving of messages from /msg go into MessageChannel to be sorted and finally be received by protocols thorugh the MessageReceiver interface.
protocols now directly call MessageChannel::send to send messages outbound so they no longer block when doing this operation.
some cleanup where we now have a NodeClient to call msg and state endpoints.

volovyks

Nice design for message amnagement
And good refactoring

volovyks · 2025-01-13T10:52:01Z

chain-signatures/node/src/protocol/cryptography.rs

-        mesh_state: MeshState,
+        _ctx: C,
+        _cfg: Config,
+        _mesh_state: MeshState,


Should we drop this and other unused parameters?

some state actually needs them so we can't just drop them right now. We should instead be simplifying the interface so that we can go back to how we did ctx.mesh_state() instead of passing them. This will be left for another PR if anyone wants to take it up

ppca · 2025-01-13T20:11:21Z

chain-signatures/node/src/node_client.rs

+        let resp = self
+            .http
+            .get(url)
+            .timeout(Duration::from_millis(self.options.state_timeout))


I tried this approach before, but doing .timeout() directly used to cause inconsistent timeouts. Like I set timeout to be 100ms, but actually it takes much longer to timeout. That's why I changed to use tokio::timeout(), which was much more consistent.
Seems to be a common observation, but I forgot the reason why. But I'm not particularly good at multi-task or multi-threading stuff, I'd say better check this issue out online and see if it will continue to be an issue for us now.

interesting, let me check myself then

teach me why after you find out 🙏

So, I made a test for this, and there's barely any inconsistent timeouts that I would notice. Inside one of the reqwest request_timeout test, adding the elapsed check consistently yields 101ms with 100ms timeout. So this shouldn't be an issue

#[tokio::test] async fn request_timeout() { let _ = env_logger::try_init(); let server = server::http(move |_req| { async { // delay returning the response tokio::time::sleep(Duration::from_millis(300)).await; http::Response::default() } }); let client = reqwest::Client::builder().no_proxy().build().unwrap(); let url = format!("http://{}/slow", server.addr()); let time = std::time::Instant::now(); let res = client .get(&url) .timeout(Duration::from_millis(100)) .send() .await; println!("elapsed: {:?}", time.elapsed()); let err = res.unwrap_err(); if cfg!(not(target_arch = "wasm32")) { assert!(err.is_timeout() && !err.is_connect()); } else { assert!(err.is_timeout()); } assert_eq!(err.url().map(|u| u.as_str()), Some(url.as_str())); }

This was the PR where I changed to use tokio::timeout: near/mpc#889,

The description there:

The previous implementation caused troubles: 1) 1s timeout have already stuck the protocol; 2) 500ms timeout still see many timeouts on /state endpoint. I am guessing the .timeout() that came with reqwest package cannot successfully time out an async call, the counting of time may not have considered task/thread switching.
So I switched to using tokio::time::timeout and now problem solved.

It might be useful to verify on dev on taking one node offline, and see if the 1s timeout here would stuck the protocol. If not, then I guess our recent optimization of code has made this earlier issue disappear.

ppca · 2025-01-13T20:12:22Z

chain-signatures/node/src/node_client.rs

+            .send()
+            .await?;
+
+        let status = resp.status();


This function would need a timeout too otherwise this might get stuck in unlucky situation like the aurora mainnet incident where their /msg endpoint is unavailable

this one already has the default timeout when the Client gets built

Do you know what the default timeout is? If my memory was correct, I think that was also an inconsistent timeout :( likely similar to the other one I mentioned.

Its in the new function, where the builder is created

chain-signatures/node/src/protocol/cryptography.rs

ppca · 2025-01-13T20:21:32Z

chain-signatures/node/src/protocol/message.rs

+
+impl MessageExecutor {
+    pub async fn execute(mut self) {
+        let mut interval = tokio::time::interval(Duration::from_millis(100));


Does this mean execute() will not clear the inbox if the time taken exceeds 100ms?

I'm not sure what you mean by clear here. The inbox never gets cleared unless the protocols themselves takes the specific messages associated to that protocol on state.recv

For interval, when the amount of work exceeds 100ms, the next interval.tick() immediately runs without a sleep, so we don't sleep at all.

I meant exhaust all messages, which is what used to happen. But I get it here because this channel runs its execute() in parallel and there's no definite time that it runs, so having a cap on how much time it consumes is important, otherwise it keeps receiving message so it will always be running

hmm, good idea. I'll add it in a subsequent PR or maybe this PR if the timeout needs to be adjusted, but this loop is what was in the protocol loop originally so very unlikely it'll block us

chain-signatures/node/src/protocol/presignature.rs

ppca

also appreciate your answer in #71 as I am adding metrics to measure delay on each step per protocol, and this change will affect how my definition and implementation should be. Greatly appreciated! 🙏

ChaoticTempest · 2025-01-14T03:54:11Z

I'll merge this in and see what dev does overnight with our metrics

ChaoticTempest added 12 commits January 10, 2025 00:10

Added MessageChannel for receiving messages from endpoints

76df3be

Replace MessageQueue with MessageChannel

0fcc50c

Made NodeClient for requesting node requests

b2d4818

No longer need to find me()

3e3789b

Cleanup encryption

c4eac01

Cleanup encryption again by files

8794edc

Rename http_client to node_client

9d5f3ab

Rename to inbox and outbox

d6848d1

Added back queue size metric

612a5a4

Made triple directly use channel

6ba7227

Parallelize sending messages

234feaf

Made presignatures/signatures use channel directly

316bfef

ChaoticTempest requested review from volovyks and ppca January 10, 2025 20:17

ChaoticTempest added 2 commits January 10, 2025 12:22

clippy

4f1117d

fix reshare test

3fabc62

volovyks approved these changes Jan 13, 2025

View reviewed changes

volovyks mentioned this pull request Jan 13, 2025

Add more performance metrics #71

Open

3 tasks

ppca reviewed Jan 13, 2025

View reviewed changes

ChaoticTempest merged commit 922bb12 into develop Jan 14, 2025
2 of 3 checks passed

ChaoticTempest deleted the phuong/feat/unify-message branch January 14, 2025 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MessageChannel for async protocol message receiving and sending #93

feat: add MessageChannel for async protocol message receiving and sending #93

ChaoticTempest commented Jan 10, 2025

volovyks left a comment

volovyks Jan 13, 2025

ChaoticTempest Jan 13, 2025

ppca Jan 13, 2025

ChaoticTempest Jan 13, 2025

ppca Jan 13, 2025

ChaoticTempest Jan 14, 2025

ppca Jan 14, 2025

ppca Jan 13, 2025

ChaoticTempest Jan 13, 2025

ppca Jan 13, 2025

ChaoticTempest Jan 14, 2025

ppca Jan 13, 2025

ChaoticTempest Jan 13, 2025

ppca Jan 13, 2025

ChaoticTempest Jan 14, 2025

ppca left a comment

ChaoticTempest commented Jan 14, 2025

feat: add MessageChannel for async protocol message receiving and sending #93

feat: add MessageChannel for async protocol message receiving and sending #93

Conversation

ChaoticTempest commented Jan 10, 2025

volovyks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppca left a comment

Choose a reason for hiding this comment

ChaoticTempest commented Jan 14, 2025