Profiling and benchmarking of sync message passing #19

flxo · 2024-04-12T13:52:30Z

flxo
Apr 12, 2024

Hello,

Just want to share this. I was curious about the overhead of kameo compared to "plain" channels and a task.

Profiling

Lets see where kameo spends it's cpu time. The following minimal program built in release mode (+ debug=true in the profile) and executed on a macOS on 2,4 GHz 8-Core Intel Core i9.

use kameo::{
    message::{Context, Message},
    Actor,
};

#[derive(Default)]
pub struct MyActor;

impl Actor for MyActor {}

impl Message<()> for MyActor {
    type Reply = ();

    async fn handle(&mut self, _: (), _ctx: Context<'_, Self, Self::Reply>) -> Self::Reply {}
}

#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let my_actor_ref = kameo::spawn(MyActor::default());
    loop {
        my_actor_ref.send(()).await?;
    }
}

produces this profile. Captured with samply(❤️).

You can see the two big blocks: kameo::actor::actor_ref::ActorRef::send::{{closure}} and <kameo::actor_kind::SyncActor as kameo::actor_kind::ActorState>::handle_message::{{closure}} in the flame graph beside the usual Tokio runtime stuff. Interesting is: ~5.4% overhead due to Box::new in the tx path. ~8.3% (+4% free) for boxing on the rx side.

Comparison to a plain channel + Tokio task

I used criterion to compare a simple echo actor that is processing sync calls with a plain Tokio task that sends a reply on a received on shot Sender. Think this gives an impression about the overhead of the convenience that kameo brings.

use criterion::Criterion;
use criterion::{criterion_group, criterion_main};
use kameo::{
    message::{Context, Message},
    Actor,
};
use tokio::sync::mpsc;
use tokio::sync::oneshot;
use tokio::task;

fn actor(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .build()
        .unwrap();
    let _guard = rt.enter();

    struct BenchActor;

    impl Actor for BenchActor {}

    impl Message<u32> for BenchActor {
        type Reply = u32;

        async fn handle(&mut self, msg: u32, _ctx: Context<'_, Self, Self::Reply>) -> Self::Reply {
            msg
        }
    }
    let actor_ref = kameo::actor::spawn(BenchActor {});

    c.bench_function("actor_sync_messages", |b| {
        b.to_async(&rt).iter(|| async {
            actor_ref.send(0).await.unwrap();
        });
    });
}

fn plain(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .build()
        .unwrap();
    let _guard = rt.enter();

    // Echo task - pendant to the actor.
    let (tx, mut rx) = mpsc::unbounded_channel::<(u32, oneshot::Sender<u32>)>();
    task::spawn(async move {
        while let Some((msg, tx)) = rx.recv().await {
            tx.send(msg).unwrap();
        }
    });

    c.bench_function("plain_sync_messages_unbounded", |b| {
        b.to_async(&rt).iter(|| async {
            let (reply_tx, reply_rx) = oneshot::channel();
            tx.send((0, reply_tx)).unwrap();
            reply_rx.await.unwrap();
        });
    });

    // Echo task bounded - pendant to the actor.
    let (tx, mut rx) = mpsc::channel::<(u32, oneshot::Sender<u32>)>(10);
    task::spawn(async move {
        while let Some((msg, tx)) = rx.recv().await {
            tx.send(msg).unwrap();
        }
    });
    
    c.bench_function("plain_sync_messages_bounded", |b| {
        b.to_async(&rt).iter(|| async {
            let (reply_tx, reply_rx) = oneshot::channel();
            tx.send((0, reply_tx)).await.unwrap();
            reply_rx.await.unwrap();
        });
    });
}

criterion_group! {
    name = benches;
    config = Criterion::default();
    targets = actor, plain
}

criterion_main!(benches);

Question: Did I do something wrong here?

Output:

actor_sync_messages     time:   [750.69 ns 756.38 ns 763.39 ns]
                        change: [+0.1536% +2.0887% +4.0079%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

plain_sync_messages_unbounded
                        time:   [270.36 ns 271.76 ns 273.30 ns]
                        change: [+4.3599% +5.3530% +6.4111%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  10 (10.00%) high mild
  2 (2.00%) high severe

plain_sync_messages_bounded
                        time:   [309.61 ns 312.05 ns 314.86 ns]
                        change: [-3.6089% -1.9707% -0.2564%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

The difference is quite huge (~2.5). Clearly everything in between you and the raw channel costs performance but I wouldn't have expect that. I didn't examine async messages.

I'm evaluation this all here because we're thinking about using it in some networking application and thoughput matters.

Let me know if I missed something or there's a systematic error.

cheers,

@flxo

tqwewe · 2024-04-12T14:59:15Z

tqwewe
Apr 12, 2024
Maintainer

Hey @flxo, thank you for gathering this detailed profiling! Really awesome to see.

It's quite useful seeing what the overhead kameo brings over using plain channels. I've copied your benchmark into kameo, and added one other test where I use plain channels, but box the request and reply (similar to what kameo does), and got the following results:

actor_sync_messages     time:   [563.88 ns 573.52 ns 583.76 ns]
                        change: [-35.262% -33.188% -31.351%] (p = 0.00 < 0.05)
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

plain_sync_messages_unbounded
                        time:   [252.02 ns 256.48 ns 261.66 ns]
                        change: [-22.238% -20.815% -19.114%] (p = 0.00 < 0.05)
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

plain_sync_messages_bounded
                        time:   [289.33 ns 293.97 ns 299.12 ns]
                        change: [-22.451% -21.392% -20.015%] (p = 0.00 < 0.05)
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

plain_sync_messages_unbounded_boxed
                        time:   [343.77 ns 348.36 ns 353.45 ns]
                        change: [-34.223% -33.044% -31.742%] (p = 0.00 < 0.05)
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Based on this, the important numbers are:

Unbounded Plain: 256.48 ns
Unbounded Plain Boxed: 348.36 ns
Actor with Kameo: 573.52 ns

So it seems like besides just boxing, kameo does add some other overhead too, probably due to the use of tokio::select, catch_unwind and other things.

Question: Did I do something wrong here?

Nope, the bench looks correct, I think what we're seeing here is really just the overhead of everything kameo provides.

I'll dig into this more since it's quite interesting seeing the overhead added. But I ran the same benchmark with the latest version of actix again and got 5.6822 µs (10x slower than Kameo), and I remember actix being at the top of the benchmarks when comparing web libraries across languages, so I'm quite sure ~573.52 ns for messages in Kameo should be enough performance for almost all use cases.

I'm evaluation this all here because we're thinking about using it in some networking application and thoughput matters.

If you'd like to squeeze out the performance, it might be worth going with raw channels in this case at the expense of developer experience.

Sadly boxing the messages seemed to be the only way I could get actors working without using a big message enum in Rust.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling and benchmarking of sync message passing #19

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Profiling and benchmarking of sync message passing #19

flxo Apr 12, 2024

Replies: 1 comment

tqwewe Apr 12, 2024 Maintainer

flxo
Apr 12, 2024

tqwewe
Apr 12, 2024
Maintainer