Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dendrite keeps dropping events or hangs their sending #3484

Open
TruncatedDinoSour opened this issue Dec 26, 2024 · 31 comments
Open

Dendrite keeps dropping events or hangs their sending #3484

TruncatedDinoSour opened this issue Dec 26, 2024 · 31 comments

Comments

@TruncatedDinoSour
Copy link

TruncatedDinoSour commented Dec 26, 2024

Background information

  • Dendrite version or git SHA: 0.14.0+add73ec
  • SQLite3 or Postgres?: Postgres
  • Running in Docker?: Unprivileged user.
  • go version: go version go1.23.0 linux/amd64
  • Client used (if applicable): Schildichat, Element, Hydrogen, or CinnyChat is what most people use I believe on the HS

Description

  • What is the problem: When sending messages, sometimes out of random, Dendrite will hang and not send the messages. In encrypted rooms (or DMs, same thing), Dendrite will send the message on retry followed by a 'Unable to decrypt message' event (see attached image) and in non-encrypted rooms Dendrite will just send the message twice. Choosing to 'delete' does not send the events (??? or does it ??? it seems partial ???).
  • Who is affected: All users sending any kind of events. Whether it is a message, file upload, or message removal.
  • How is this bug manifesting: Impacts the usage of the homeserver, appearing as hanging events in your client, which require to be resent.
  • When did this first appear: Recently, maybe after upgrading to 0.14.0. I cannot recall whether this was back when we were on 0.13.x when we pulled dendrite master (element-hq version this time, since before when we were on matrix-org it was fine), or after we upgraded. Regardless, the search room is essentially 0.13.8 (pre- element-hq, 79b87c7) to 0.14.0 :)

image

Steps to reproduce

  • Open a Matrix client
  • Go to a room (encrypted or not)
  • Talk in the room
  • Talk until after some time dendrite decides that it's not worth sending the event and just hangs until you resend the event.
@jjj333-p
Copy link
Contributor

I regularly find that my server will hang completely for around 5 minutes, during which time no messages will send and no sync will be responded to. resource usage is not hightened when this happens

@TruncatedDinoSour
Copy link
Author

happened once again

me :
image

person :
image

@jjj333-p
Copy link
Contributor

im not even reporting how often i get it, because its like at least 3 times an hour, its constant hanging to the point of being unusable. is there like a mutex thats not being unlocked or something??

@TruncatedDinoSour
Copy link
Author

im not even reporting how often i get it, because its like at least 3 times an hour, its constant hanging to the point of being unusable. is there like a mutex thats not being unlocked or something??

for me its not as constant, it happens every like 3 or 4 hours for like 10 minutes so even though its not as bad its still very annoying 😭

@PurpleBored
Copy link

PurpleBored commented Dec 28, 2024

im not even reporting how often i get it, because its like at least 3 times an hour, its constant hanging to the point of being unusable. is there like a mutex thats not being unlocked or something??

I do not use Matrix that much but pretty much I get it like every few times I use DMs pretty annoying

@jjj333-p
Copy link
Contributor

the reason im wondering if its a mutex issue or not, is i'm in a lot more public rooms with a lot more traffic than ari, and i get the issue more frequently, and resource usage doesnt seem abnormal when its hung. its usually at a high resource state (again seems to happen with frequent federation traffic) but neither cpu nor ram get pushed to the maximum by dendrite or postgres

@TruncatedDinoSour
Copy link
Author

the reason im wondering if its a mutex issue or not, is i'm in a lot more public rooms with a lot more traffic than ari, and i get the issue more frequently, and resource usage doesnt seem abnormal when its hung. its usually at a high resource state (again seems to happen with frequent federation traffic) but neither cpu nor ram get pushed to the maximum by dendrite or postgres

if it was a deadlock the whole homeserver would be frozen but it clearly goes back to normal after some time, however, you do raise an interesting point, because traffic may be related

@jjj333-p
Copy link
Contributor

thats why im thinking its a mutex, perhaps specific to client api. go has decent enough concurrency it probably wouldnt hold up the whole thing, and it would explain the lack of change in resource usage

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 28, 2024

hmm, i get what u mean now

@jjj333-p
Copy link
Contributor

i should also note that for me it seems to happen more frequently when messaging in large rooms. like i reliably get it sending any message in #python:matrix.org. but it does happen in smaller rooms obviously and not all large rooms seem to trigger it.

@TruncatedDinoSour
Copy link
Author

Relevant logs:

time="..." level=error msg="SendEvents failed" error="InputRoomEventsResponse: context canceled" req.id=... req.method=PUT req.path="/_matrix/client/r0/rooms/!.../send/m.room.encrypted/..." user_id="@ari:ari.lt"
http: superfluous response.WriteHeader call from github.com/element-hq/dendrite/mediaapi/routing.(*downloadRequest).jsonErrorResponse (download.go:210)

@jjj333-p
Copy link
Contributor

i dont notice mine getting canceled, but that might be nheko retrying continously until it goes through

@jjj333-p
Copy link
Contributor

okay dendrite is basically unusable for me at this point. its hanging for over 20 minutes at a time now, constantly. restarting dendrite doesnt help

@S7evinK
Copy link
Contributor

S7evinK commented Dec 31, 2024

Wondering if this is related to #3447 and fetching auth events. 😕

@jjj333-p
Copy link
Contributor

i mean i kinda had this issue even before then, but it didnt get real bad until 0.13.8+97706ff (i was on that pr version before then) so i dont think its specifically that pr. i just to try something left the genshin impact room (!AGeUOyHpLMMrLYAkXW:matrix.org, top of the room list) and so far seems immediately better but like its been 5 minutes.

my suspision is whatever is happening (i.e. a state fetch) isnt happening in async and is holding up the entire room which locks it up for basically 20 minutes at a time

again ive kinda aways had this issue of it locking up for a few minutes, its just locking up longer and more often

@TruncatedDinoSour
Copy link
Author

i personally only began having this issue ever since i upgraded dendrite from matrix-org to element-hq version,, before that it was fine for both i and the users, so maybe its not directly related to #3447 but rather something in the element-hq version that had caused event sendability to be affected by it

@jjj333-p
Copy link
Contributor

i am increasingly sure that evacuating the genshin impact room has mostly fixed the problem. idk whats up with that room making everything completely lock up. surely if its state is being processed ill still be able to sync events from other rooms right?

@S7evinK
Copy link
Contributor

S7evinK commented Dec 31, 2024

The roomserver by its nature only processes one event at a time for a room, so yea, it can lock up for one room while others work fine.

There weren't any changes that could explain hanging except for #3447

@jjj333-p
Copy link
Contributor

my point though is shouldnt it not lock up other rooms? because in its current state it seems to lock up all rooms

@TruncatedDinoSour
Copy link
Author

The roomserver by its nature only processes one event at a time for a room, so yea, it can lock up for one room while others work fine.

There weren't any changes that could explain hanging except for #3447

its not just one room, its all rooms, as in, the whole server just refuses to send events for a period of time

@jjj333-p
Copy link
Contributor

and in my case it appears to stop responding to sync, however clients dont seem to go offline so its also possible that its just getting empty syncs. are you sure that the roomserver doesnt block the entire server for each event in each room?

@S7evinK
Copy link
Contributor

S7evinK commented Dec 31, 2024

Pretty sure, every room has it's own goroutine and own actor. Wondering if it may be the updated internal NATS server blocking. But from the looks of it, we'd not only return context canceled. (only on mobile, so may have missed something)

When the server locks up, what does the metric dendrite_roomserver_input_backpressure look like?

@TruncatedDinoSour
Copy link
Author

Pretty sure, every room has it's own goroutine and own actor. Wondering if it may be the updated internal NATS server blocking. But from the looks of it, we'd not only return context canceled. (only on mobile, so may have missed something)

When the server locks up, what does the metric dendrite_roomserver_input_backpressure look like?

once the server locks up again i can check, would you be so kind to provide instructions of how to check it ?

@jjj333-p
Copy link
Contributor

When the server locks up, what does the metric dendrite_roomserver_input_backpressure look like?

how do i check this? i dont have any kinda monitoring software set up 😅

@idanoo
Copy link

idanoo commented Jan 13, 2025

Also seeing this a few times every hour, only related logs at that time

Jan 12 21:22:58 XX dendrite[230286]: time="2025-01-12T21:22:58.855774245Z" level=error msg="SendEvents failed" error="InputRoomEventsResponse: context canceled" req.id=XXX req.method=PUT req.path="/_matrix/client/v3/rooms/!lXXX:XXX.com/send/m.room.encrypted/mXXXX.XX" user_id="@XXX:XXX.com"

@S7evinK
Copy link
Contributor

S7evinK commented Jan 13, 2025

// Otherwise, we'll want to sit and wait for the responses
// from the roomserver. There will be one response for every
// input we submitted. The last error value we receive will
// be the one returned as the error string.
defer replySub.Drain() // nolint:errcheck
for i := 0; i < len(request.InputRoomEvents); i++ {
msg, err := replySub.NextMsgWithContext(ctx)
if err != nil {
response.ErrMsg = err.Error()
return
}
if len(msg.Data) > 0 {
response.ErrMsg = string(msg.Data)
}
}

Is the relevant code here. I added some more information to the errors and it is timing out on NextMsgWithContext, which basically means the roomserver is processing a different event for the room in question. Note that this can also cause dupe messages, as i.e. the client attempts to resend the message while still processing the first message.

We could make message sending async, but that probably confuses clients (and definitely will confuse Sytest) and might cause some bad UX. (i.e. you think the message got sent, but it actually was not yet sent)

@jjj333-p
Copy link
Contributor

jjj333-p commented Jan 13, 2025 via email

@idanoo
Copy link

idanoo commented Jan 13, 2025

@S7evinK looks like dendrite_roomserver_input_backpressure goes up and all SendEvents/ProcessRoomEvent stop entirely. I'm not sure if my graphs are perfect, but it lines up with what you were asking. Ignore the titles/duplicate labels, I threw this together quickly to get some data. (Sorry for deleting/reposting message, wanted to confirm data)

First
image

Second
image

An error from a client when this happens
image

@jjj333-p
Copy link
Contributor

jjj333-p commented Jan 14, 2025 via email

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Jan 14, 2025

An error from a client when this happens

Image

its actually different in my case ?

nowever makes sense since i resent the event since og failed

@idanoo
Copy link

idanoo commented Jan 14, 2025

Yeah I suspect that's just a symptom of it locking up and the message request erroring out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants