-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dendrite keeps dropping events or hangs their sending #3484
Comments
I regularly find that my server will hang completely for around 5 minutes, during which time no messages will send and no sync will be responded to. resource usage is not hightened when this happens |
im not even reporting how often i get it, because its like at least 3 times an hour, its constant hanging to the point of being unusable. is there like a mutex thats not being unlocked or something?? |
for me its not as constant, it happens every like 3 or 4 hours for like 10 minutes so even though its not as bad its still very annoying 😭 |
I do not use Matrix that much but pretty much I get it like every few times I use DMs pretty annoying |
the reason im wondering if its a mutex issue or not, is i'm in a lot more public rooms with a lot more traffic than ari, and i get the issue more frequently, and resource usage doesnt seem abnormal when its hung. its usually at a high resource state (again seems to happen with frequent federation traffic) but neither cpu nor ram get pushed to the maximum by dendrite or postgres |
if it was a deadlock the whole homeserver would be frozen but it clearly goes back to normal after some time, however, you do raise an interesting point, because traffic may be related |
thats why im thinking its a mutex, perhaps specific to client api. go has decent enough concurrency it probably wouldnt hold up the whole thing, and it would explain the lack of change in resource usage |
hmm, i get what u mean now |
i should also note that for me it seems to happen more frequently when messaging in large rooms. like i reliably get it sending any message in #python:matrix.org. but it does happen in smaller rooms obviously and not all large rooms seem to trigger it. |
Relevant logs:
|
i dont notice mine getting canceled, but that might be nheko retrying continously until it goes through |
okay dendrite is basically unusable for me at this point. its hanging for over 20 minutes at a time now, constantly. restarting dendrite doesnt help |
Wondering if this is related to #3447 and fetching auth events. 😕 |
i mean i kinda had this issue even before then, but it didnt get real bad until 0.13.8+97706ff (i was on that pr version before then) so i dont think its specifically that pr. i just to try something left the genshin impact room (!AGeUOyHpLMMrLYAkXW:matrix.org, top of the room list) and so far seems immediately better but like its been 5 minutes. my suspision is whatever is happening (i.e. a state fetch) isnt happening in async and is holding up the entire room which locks it up for basically 20 minutes at a time again ive kinda aways had this issue of it locking up for a few minutes, its just locking up longer and more often |
i personally only began having this issue ever since i upgraded dendrite from matrix-org to element-hq version,, before that it was fine for both i and the users, so maybe its not directly related to #3447 but rather something in the element-hq version that had caused event sendability to be affected by it |
i am increasingly sure that evacuating the genshin impact room has mostly fixed the problem. idk whats up with that room making everything completely lock up. surely if its state is being processed ill still be able to sync events from other rooms right? |
my point though is shouldnt it not lock up other rooms? because in its current state it seems to lock up all rooms |
its not just one room, its all rooms, as in, the whole server just refuses to send events for a period of time |
and in my case it appears to stop responding to sync, however clients dont seem to go offline so its also possible that its just getting empty syncs. are you sure that the roomserver doesnt block the entire server for each event in each room? |
Pretty sure, every room has it's own goroutine and own actor. Wondering if it may be the updated internal NATS server blocking. But from the looks of it, we'd not only return When the server locks up, what does the metric |
once the server locks up again i can check, would you be so kind to provide instructions of how to check it ? |
how do i check this? i dont have any kinda monitoring software set up 😅 |
Also seeing this a few times every hour, only related logs at that time
|
dendrite/roomserver/internal/input/input.go Lines 457 to 471 in add73ec
Is the relevant code here. I added some more information to the errors and it is timing out on We could make message sending async, but that probably confuses clients (and definitely will confuse Sytest) and might cause some bad UX. (i.e. you think the message got sent, but it actually was not yet sent) |
is it possible to at least make it not prevent sync of messages in other
rooms?
…On Sun, Jan 12, 2025 at 8:51 PM Till ***@***.***> wrote:
https://github.com/element-hq/dendrite/blob/add73ec8661b2a156f6d217594fe3471951cfb08/roomserver/internal/input/input.go#L457-L471
Is the relevant code here. I added some more information to the errors and
it is timing out on NextMsgWithContext, which basically means the
roomserver is processing a different event for the room in question. Note
that this can also cause dupe messages, as i.e. the client attempts to
resend the message while still processing the first message.
We *could* make message sending async, but that probably confuses clients
(and definitely will confuse Sytest) and might cause some bad UX. (i.e. you
think the message got sent, but it actually was not yet sent)
—
Reply to this email directly, view it on GitHub
<#3484 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWNJYMCOPJV5TR7OH4XP5RT2KNPARAVCNFSM6AAAAABUHZDS3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBWGMZDONRYGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@S7evinK looks like |
Also I just noticed that when the server is hung media is unable to load. I
don't see how that's being obstructed
…On Sun, Jan 12, 2025 at 8:57 PM Daniel Mason ***@***.***> wrote:
@S7evinK <https://github.com/S7evinK> looks like
dendrite_roomserver_input_backpressure goes up and all
SendEvents/ProcessRoomEvent stop entirely. I'm not sure if my graphs/data
is perfect, but it lines up with what you were asking.
image.png (view on web)
<https://github.com/user-attachments/assets/335ad237-729b-4aed-9c35-8efaf22f031e>
image.png (view on web)
<https://github.com/user-attachments/assets/8d4ce0db-e8fc-484b-8762-2403421ba751>
—
Reply to this email directly, view it on GitHub
<#3484 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWNJYME6MDLV6UY4NRSS3WD2KNPUVAVCNFSM6AAAAABUHZDS3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBWGMZTIMBYG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Yeah I suspect that's just a symptom of it locking up and the message request erroring out. |
Background information
go version
: go version go1.23.0 linux/amd64Description
Steps to reproduce
The text was updated successfully, but these errors were encountered: