Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3392: Consolidate flushes, leading to reduces syscalls and higher performance #3393

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Janmm14
Copy link
Contributor

@Janmm14 Janmm14 commented Sep 15, 2022

Based on Netty FlushConsolidationHandler.

I hope I did the license stuff correctly.

Fixes #3392

@Janmm14 Janmm14 force-pushed the flush-consolidation branch from 04ce471 to 541cb69 Compare September 15, 2022 19:16
@MrIvanPlays
Copy link
Contributor

this is now out for testing in IvanCord as with many more additions by Janmm14 and others (myself included) available at https://ci.mrivanplays.com/

@andreasdc
Copy link

Can the delay of packets be benchmarked?

@Outfluencer
Copy link
Collaborator

the delay should be minimal as it flushes on every channelReadComplete and any state change

@andreasdc
Copy link

the delay should be minimal as it flushes on every channelReadComplete and any state change

Sounds cool, I'm testing it and I see around 30-50% less CPU usage. 🚀

@Outfluencer
Copy link
Collaborator

Outfluencer commented Sep 26, 2022

and also the client and server are ticking at a very low frequency of 50ms so a delay for the packet handling should not be noticeable

@andreasdc
Copy link

and also the client and server are ticking at a very high frequency of 50ms so a delay for the packet handling should not be noticeable

Well 50ms is not so high frequency, but I see it like a hard-limiter to not have more than 50ms of the delay. :D

@andreasdc
Copy link

Do I understand this patch correctly? When the writes happen on the spigot side the write happens on bungee's side, the same with the flush, more or less?

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 27, 2022

Do I understand this patch correctly? When the writes happen on the spigot side the write happens on bungee's side, the same with the flush, more or less?

Thats roughly the idea of this patch.
It detects when data starts and data ends and disables flushes happening during the read process. It does "safety" flushes currently every 20 flush-attempts (aka packets). (netty flushes = system writes)
when no read is in progress, flushes are still happening instant.

@andreasdc
Copy link

andreasdc commented Sep 27, 2022

Do I understand this patch correctly? When the writes happen on the spigot side the write happens on bungee's side, the same with the flush, more or less?

Thats roughly the idea of this patch. It detects when data starts and data ends and disables flushes happening during the read process. It does "safety" flushes currently every 20 flush-attempts (aka packets). (netty flushes = system writes) when no read is in progress, flushes are still happening instant.

That's really cool, I think it will be a good idea to increase the 20 packets limit, road to 1% of CPU usage :D For example you have 20 players which each send 20 packets per second, you have 400 packets per second but you still have 20 flushes instead of more optimized amount, but of course it's way better than having flushes on every packet, I will increase the amount and see what we'll got. :)
P.S. Spigot already flushes every tick for a safetiness. 🚀
By the way DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES is not being used right now?

@andreasdc
Copy link

Also it is really important, when spigot flushes, bungee flushes, is that correct?

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 27, 2022

Do I understand this patch correctly? When the writes happen on the spigot side the write happens on bungee's side, the same with the flush, more or less?

Thats roughly the idea of this patch. It detects when data starts and data ends and disables flushes happening during the read process. It does "safety" flushes currently every 20 flush-attempts (aka packets). (netty flushes = system writes) when no read is in progress, flushes are still happening instant.

That's really cool, I think it will be a good idea to increase the 20 packets limit, road to 1% of CPU usage :D For example you have 20 players which each send 20 packets per second, you have 400 packets per second but you still have 20 flushes instead of more optimized amount, but of course it's way better than having flushes on every packet, I will increase the amount and see what we'll got. :) P.S. Spigot already flushes every second for a safetiness. 🚀 By the way DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES is not being used right now?

These packet counts are per-player. And that constant DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES is just there because the file was copied from netty and I just did as few changes as possible to it.

Also it is really important, when spigot flushes, bungee flushes, is that correct?

Short answer: yes.

Long answer: Yes, kinda. When spigot flushes, the data starts to be sent to bungee, leading to a read start on bungee. A read start prevents flushes for the given player. Bungee reads all the incoming data, flushes which usually happen after each forwarded packet are ignored. When bungee is finished with reading the current batch of recieved bytes (no more data aiting in OS tcp buffer to be read), flushes are allowed again and bungee does flush.

We don't need to do flushes every second for safetyness, as our default behaviour is still to flush. Our safetiy is the flush every 20 packets.
Also there are more safety flushes (justcopied from netty impl) which ignore the flush consolidation, like when netty says the channel is no longer writeable (likely output buffer full, could maybe happen with big chunk packets), when connection is about to be closed by bungee, etc.

@andreasdc
Copy link

Yes I know that they are per player, but I put the example as you are surrounded with players, each one sends 20 movement packets per second + some other packets like arm swings for example.

All right, so when I have a batch of 400 packets, I will have 20 flushes. And when I have a single packet, I will get a single flush. If yes that's really cool and I will test tomorrow with the 200 packet limit. Another great performance boost from you, you are really doing a good job, another one from my issue, you are a superhero! 🦸

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 27, 2022

If the number is being put really high, at some point the "limiting" factors will be that the current read operation finished etc. So the difference of 20 vs 200 will not be that high, as you won't have 200 packets per ticks.

and thank you

@andreasdc
Copy link

If the number is being put really high, at some point the "limiting" factors will be that the current read operation finished etc.

Shouldn't it be better? Still less flushes and flushes when spigot wants flushes, right?
Right now I see that I have 1600 packets per second, but why you say 200 packets per tick? With 10 players running you have 200 packets easily but per second.
I would even increase it to have way less flushes, it's easy to have many incoming packets.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 27, 2022

If the number is being put really high, at some point the "limiting" factors will be that the current read operation finished etc.

Shouldn't it be better? Still less flushes and flushes when spigot wants flushes, right? Right now I see that I have 1600 packets per second, but why you say 200 packets per tick? With 10 players running you have 200 packets easily but per second. I would even increase it to have way less flushes, it's easy to have many incoming packets.

It shouldnt harm to set it higher.

Only important thing is per tick, as spigot flushes per-tick and we want to send the player stuff per-tick. And if spigot flushes once per tick, bungee will do too usually.

@andreasdc
Copy link

All right, flush being called should call readComplete on bungee, right? Edge cases when it wouldn't not happen are possible?

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 27, 2022

If the data transfer would be very, very slow and the whole process of recieving data takes at least 50ms.
Should never happen tho

@Outfluencer
Copy link
Collaborator

Does spigot really flush ones per tick?
As I know normal spigot flushes every single packet

@andreasdc
Copy link

Does spigot really flush ones per tick? As I know normal spigot flushes every single packet

I meant that It does extra flush no matter what every tick, normal spigot flushes every packet that's right.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 27, 2022

Does spigot really flush ones per tick? As I know normal spigot flushes every single packet

I was talking more about paper here @Outfluencer
This might still help a little bit with normal spigot.

@Outfluencer
Copy link
Collaborator

Is it well tested now?

@andreasdc
Copy link

I think so, yea 😎

@Outfluencer
Copy link
Collaborator

Outfluencer commented Oct 21, 2022

Why do you think so, did u tested it on your network?

@andreasdc
Copy link

Yes

@md-5
Copy link
Member

md-5 commented Nov 12, 2022

What are the changes from Netty and why?

@Janmm14
Copy link
Contributor Author

Janmm14 commented Nov 12, 2022

@md-5 commented:
What are the changes from Netty and why?

Netty's FlushConsolidationHandler is created for the following use case:

A sends multiple requests to B. B's behaviour is to handle the request synchroniously and send the result back to A on the same connection/same netty pipeline.

A -> B -> A
A -> B -> A
A -> B -> A

Bungee however is not a server which processes data, it is a proxy which forwards data to another connection/netty pipeline.

C -> B -> S
*up to 50ms delay*
S -> B -> C
*up to 50ms delay*
C -> B -> S

So while we are reading "incoming from server" we are writing to "outgoing to client" (the original FlushConsolidationHandler would be useful if we'd write i. e. "outgoing to client" for every "incoming from client", but we don't do that).
The change is that the FlushConsolidationHandler is split in two parts:

  1. The logic which checks whether more data is currently waiting in buffers (FlushSignalingHandler signals whether netty is in a read loop) (channelRead and channelReadComplete event) needs to be on the recieving side of the traffic.
  2. The logic which cancels some flush() calls (BungeeFlushConsolidationHandler) needs to be on the other connection.

Detect ongoing read loop at: incoming from client -> block flushes while reading at: outgoing to server
Detect ongoing read loop at: incoming from server-> block flushes while reading at: outgoing to client

In detail:

  • New class FlushsignalingHandler.
  • Methods channelRead, channelReadComplete were moved to FlushSignalingHandler.
  • Methods exceptionCaught, disconnect, close and handlerRemoved were copied to FlushSignalingHandler.
  • Visibility changes: private -> package-private for ctx, readInProgress, resetReadAndFlushIfNeeded; public-> private for all constructors
  • Static builder method newInstance was added.
  • And the obvious rename from FlushConsolidationHandler to BungeeFlushConsolidationHandler

The FlushConsolidationHandler is not being changed often in netty. After its introduction in 2016, only 1 change in 2018 and 1 in 2020 were made.

(Due to how we set up the server's connection (same eventLoop as client connection) we don't need to handle race conditions.)

@caoli5288
Copy link
Contributor

caoli5288 commented Dec 22, 2022

I see that you change the player's flush target in ServerConnector init, is this too early? What happens if the connector fails?

Maybe it is a better time to change it at set DownstreamBridge.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Dec 22, 2022

@caoli5288
Gonna have to see whether that is actually a problem or if that code is executed means that the player will definitely get into world loading screen and therefore a reconnect to the old server is neccessary.

@caoli5288
Copy link
Contributor

caoli5288 commented Dec 22, 2022

@caoli5288 Gonna have to see whether that is actually a problem or if that code is executed means that the player will definitely get into world loading screen and therefore a reconnect to the old server is neccessary.

Move to https://github.com/SpigotMC/BungeeCord/blob/master/proxy/src/main/java/net/md_5/bungee/ServerConnector.java#L328 is fine.

On this you can have two ChannelWrapper and set target each other, the code looks more beautiful.

        user.setServer( server );
        ch.getHandle().pipeline().get( HandlerBoss.class ).setHandler( new DownstreamBridge( bungee, user, server ) );

+       ch.setFlushSignalingTarget(user.getCh().getFlushConsolidationHandler(false));
+       user.getCh().setFlushSignalingTarget(ch.getFlushConsolidationHandler(false));

        bungee.getPluginManager().callEvent( new ServerSwitchEvent( user, from ) );

        thisState = State.FINISHED;

        throw CancelSendSignal.INSTANCE;

@caoli5288
Copy link
Contributor

@andreasdc It seems your bungee does not use bungee's native aes implementation, but java's standard one. Is that intentional?

Can you send me the link somewhere so I can take a closer look?

There is a replacement system-level-api for epoll called io_uring, however thats still experimental in netty. Maybe a fork could experiment with that (doesn't improve "write speed" tho).

I do have an idea on improving speed about adding varint21 size prefix to each packet, getting rid of one buffer copy, but that would probably be a breaking change for plugins poking bungee internals like viaversion.

If your fork is not based on ivancord but waterfall, make sure to have this patch, it also avoids a buffer copy.

Netty' io_uring impl is still slower than epoll's.

Further, to add an uncompressed packetid to the header for each packet on spigot, which can reduce unnecessary decompression.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Mar 20, 2023

Netty' io_uring impl is still slower than epoll's.

Thanks for letting me know.

Further, to add an uncompressed packetid to the header for each packet on spigot, which can reduce unnecessary decompression.

This would require changes in spigot and all forks.
I doubt these 1-3 bytes of at least 256 compressed bytes in total make a statistically significant speed difference for compression (I'd personally increase the minimum size to apply compression (compression threshold) to 512/1024 or so at least).
Removing one buffer copying for every packet seems like it could do a significant difference tho.

@andreasdc
Copy link

@andreasdc It seems your bungee does not use bungee's native aes implementation, but java's standard one. Is that intentional?

Can you send me the link somewhere so I can take a closer look?

There is a replacement system-level-api for epoll called io_uring, however thats still experimental in netty. Maybe a fork could experiment with that (doesn't improve "write speed" tho).

I do have an idea on improving speed about adding varint21 size prefix to each packet, getting rid of one buffer copy, but that would probably be a breaking change for plugins poking bungee internals like viaversion.

If your fork is not based on ivancord but waterfall, make sure to have this patch, it also avoids a buffer copy.

I don't think that's intentional, I use https://github.com/2lstudios-mc/FlameCord with some changes, including avoiding creating buf copy. It gave the boost as I remember.

@Leymooo
Copy link

Leymooo commented Mar 20, 2023

This would require changes in spigot and all forks.

But will give a lot of performance boost, because it will allow to do zero copy packet passthrough from server to client.

If you know packet id before decompressing data, you can decide, should you decompress data or not(does packet id is in protocol registry), if not you can simple passthrough bytebuf from server(with varint packet lenght data) to client, and avoid any VarInt or compression handlers.
You can say, wait... client will unable to decode these packets, becuase they contains an unexprected packet id, but with some magic, you can expand Uncompressed data length VarInt by one byte, which will overwrite this packet id.

With flush consolidates and this change i was able to handle real 8000 players on ryzen 5950x with 90% cpu usage(mostly usage was by linux kernel network threads, and ~15-20% by bungeecord itself)

On screenshot is load from 7000 online( i do not capture with 8000), green is bungeecord user space CPU usage, and red is kernel network processing usage. Spark also showed that EPOLL worker threads utilization was only 20-25%(epollReady()), and other 75-80% were from waiting for data (epollWait())
image

UPD:

it will allow to do zero copy packet passthrough from server to client.

Forgot that bungeecord still has stupid Entity Id Rewrite, which makes this stuff less effective

@andreasdc
Copy link

This would require changes in spigot and all forks.

But will give a lot of performance boost, because it will allow to do zero copy packet passthrough from server to client.

If you know packet id before decompressing data, you can decide, should you decompress data or not(does packet id is in protocol registry), if not you can simple passthrough bytebuf from server(with varint packet lenght data) to client, and avoid any VarInt or compression handlers. You can say, wait... client will unable to decode these packets, becuase they contains an unexprected packet id, but with some magic, you can expand Uncompressed data length VarInt by one byte, which will overwrite this packet id.

With flush consolidates and this change i was able to handle real 8000 players on ryzen 5950x with 90% cpu usage(mostly usage was by linux kernel network threads, and ~15-20% by bungeecord itself)

On screenshot is load from 7000 online( i do not capture with 8000), green is bungeecord user space CPU usage, and red is kernel network processing usage. Spark also showed that EPOLL worker threads utilization was only 20-25%(epollReady()), and other 75-80% were from waiting for data (epollWait()) image

UPD:

it will allow to do zero copy packet passthrough from server to client.

Forgot that bungeecord still has stupid Entity Id Rewrite, which makes this stuff less effective

How is that possible?

@Janmm14
Copy link
Contributor Author

Janmm14 commented Mar 21, 2023

@Leymooo
Don't be offtopic. We talk about optimizations useful for bungeecord and not a system of custom client, bungee and server where a change similar to yours would be an obvious choice.

@Leymooo
Copy link

Leymooo commented Mar 21, 2023

@Leymooo Don't be offtopic. We talk about optimizations useful for bungeecord and not a system of custom client, bungee and server where a change similar to yours would be an obvious choice.

There is nothing about custom client. Only about a little protocol change between bungeecord and server which gives like a 500 - 700% of performance boost. Currently we switch from one bungee instance to multiple, and bungeecord process with 1200 players on ryzen 3600 uses only 270% at all(22.5% per cpu thread).

You can write 0 as 2 or 3 or 4 or 5 bytes varint. Minecraft protocol allows it. So if server can include packet id before compressed data and followed by any varint(f.e. compressed data size), you always can remove this extra packet id from bytebuf without any buffers copy and vanilla client will be able to decode it.

@xtrafrancyz
Copy link

I've already done a partial decompression of the packets in my fork master...VimeWorld:BungeeCord:master (Improved decompressor followed by a few commits with fixes).

It decompresses the first 8192 bytes of the packet and checks the packet id. No protocol changes are required and it saves near 50% of CPU. The only change needed is a network-compression-threshold - it must be equal on bungee and all backend servers.

@andreasdc
Copy link

Is it something different than this patch? #3240

@xtrafrancyz
Copy link

Is it something different than this patch? #3240

The idea is the same, the only difference is that #3240 decompresses full packets, like huge chunk packets, but my fork only does this for the first 8192 bytes. Decompression is aborted when the packet id is not defined in bungee.

@Leymooo
Copy link

Leymooo commented Mar 21, 2023

It decompresses the first 8192 bytes of the packet and checks the packet id.
The idea is the same, the only difference is that #3240 decompresses full packets, like huge chunk packets, but my fork only does this for the first 8192 bytes. Decompression is aborted when the packet id is not defined in bungee.

Decompression itself is really fast task, especially in the RAM. With 8192 bytes you still decompress like a 85% of all packets from server, and 99% from client, so this 50% of cpu save comes from eleminating recompression rather than decompressing only 8192 bytes.

BungeeCord decompressor may require to do multiple native(JNI) calls to decompress huge packets, which in my opinion does more harm than uncompressing a whole huge packet in one native call.

In conclusion: eliminating recompression is a really a good start for good performance boost, but with only 4 lines of code in backend server it possible to get much more performance.

@andreasdc
Copy link

Is it something different than this patch? #3240

@Leymooo recompression is already eliminated in this patch.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Mar 21, 2023

Is it something different than this patch? #3240

@Leymooo recompression is already eliminated in this patch.

His comment was an evaulation of the PR and points out that his idea is different and goes even further.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Mar 22, 2023

Is it something different than this patch? #3240

@Leymooo recompression is already eliminated in this patch.

He didn't write anything different.

@Janmm14
Copy link
Contributor Author

Janmm14 commented Apr 3, 2023

@md-5
Putting aside the offtopic chat in this PR, I did explain the changes to nettys flishconsolidationhandler in some comment above.

@Outfluencer
Copy link
Collaborator

Minecraft vanilla 1.20.2 has also added this now. to get the best performance we should also add it

IMG_1647

@Janmm14 Janmm14 force-pushed the flush-consolidation branch from d6ca426 to f652014 Compare September 22, 2023 22:21
@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 22, 2023

Rebased to solve conflicts (Edit: missed unused import)

@Janmm14 Janmm14 force-pushed the flush-consolidation branch from f652014 to 22ef574 Compare September 22, 2023 22:25
@andreasdc
Copy link

DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES is still not used. By the way does that mean since 1.20.2 client sends packets with flush consolidation?

@Janmm14
Copy link
Contributor Author

Janmm14 commented Sep 23, 2023

DEFAULT_EXPLICIT_FLUSH_AFTER_FLUSHES is still not used.

I just did not remove that part of code, its still just there because it was inside netty's code.


By the way does that mean since 1.20.2 client sends packets with flush consolidation?

Clients likely do not use "flush consolidation" mechanism bungee uses (and needs) and instead they stop using writeAndFlush() for every packet and change it to write() for every packet + one call to flush() at the end of each tick - the result is the same tho.

@andreasdc
Copy link

What are the changes described in "network optimizations" from the screenshot above?

…rmance

Based on Netty FlushConsolidationHandler
@Janmm14 Janmm14 force-pushed the flush-consolidation branch from 22ef574 to 0376eac Compare April 22, 2024 15:59
@Janmm14
Copy link
Contributor Author

Janmm14 commented Apr 22, 2024

Since vanilla minecraft has added flushing once per tick instead of every packet a short while ago, it would be a perfect time to think about doing similar in bungeecord. @md-5

@Outfluencer
Copy link
Collaborator

Outfluencer commented Feb 11, 2025

@Leymooo Don't be offtopic. We talk about optimizations useful for bungeecord and not a system of custom client, bungee and server where a change similar to yours would be an obvious choice.

There is nothing about custom client. Only about a little protocol change between bungeecord and server which gives like a 500 - 700% of performance boost. Currently we switch from one bungee instance to multiple, and bungeecord process with 1200 players on ryzen 3600 uses only 270% at all(22.5% per cpu thread).

You can write 0 as 2 or 3 or 4 or 5 bytes varint. Minecraft protocol allows it. So if server can include packet id before compressed data and followed by any varint(f.e. compressed data size), you always can remove this extra packet id from bytebuf without any buffers copy and vanilla client will be able to decode it.

@Leymooo
If so, could you send me a bytearray containing 5 bytes that will be decoded to the number 1 in vanilla networking?

Edit: Nvm i already have it

@Janmm14
Copy link
Contributor Author

Janmm14 commented Feb 12, 2025

In a private fork I did, i just duplicated the packet id before the compressed packet length for "server -> bungee" connection

@Janmm14 Janmm14 mentioned this pull request Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Packet sending optimization
8 participants