Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated messages in mam_message table #4316

Open
rthouvenin opened this issue Jul 4, 2024 · 16 comments
Open

Duplicated messages in mam_message table #4316

rthouvenin opened this issue Jul 4, 2024 · 16 comments

Comments

@rthouvenin
Copy link

rthouvenin commented Jul 4, 2024

MongooseIM version: 6.2.1
Installed from: Docker image
Erlang/OTP version: provided by the docker image

In case this is relevant: I am using JWT authentication

I am essentially facing the problem described in #4128.

On some occasions, all the messages of a conversation received by a user get duplicated in database in the mam_message table. This does not happen consistently (was not able to describe a reproducing scenario yet), and happens only when stream management is enabled and I think this happens when one of the users requests the conversation history. This makes me think there is a bug when there is a problem in delivering the history response that triggers a duplication of the history in DB.

When this happens for a user, it is likely to happen again soon, often resulting in dozens of copies of all received messages in the conversation. All the messages (same origin_id) are identical except for the timestamp, which seems to be the time when the duplication occured.
So for example if a user received a single message in a conversation, and the bug occurs 3 times for that conversation, you would see something like the below in the mam_message table:

id |  origin_id   | direction | timestamp (extracted from mongooseimctl)
1  | message-id-1 | I         | 2024-07-04 14:53:42
2  | message-id-1 | I         | 2024-07-04 14:55:12
3  | message-id-1 | I         | 2024-07-04 15:03:27
4  | message-id-1 | I         | 2024-07-04 15:03:27
5  | message-id-1 | I         | 2024-07-04 18:25:51
6  | message-id-1 | I         | 2024-07-04 18:25:51
7  | message-id-1 | I         | 2024-07-04 18:25:51
8  | message-id-1 | I         | 2024-07-04 18:25:51
@rthouvenin
Copy link
Author

I have not been able to find a reproducing scenario, but I was able to reproduce a few times by abruptly killing connections from the client side and creating new ones immediately after. This does not trigger the problem consistently though.

During these attempts I gathered as much traces and logs as I could to try to get clues about what's going on. Here is what I found that might be relevant:

  • Packet content: I enabled mam_message_xml as the db_message_format so that I could see better directly from the DB, and I noticed that the duplicated messages have the same id as the original one but
    • Have a from attribute that the original does not have
    • Have a stanza-id element that the original does not have.

Here is an example:

<!-- ORIGINAL -->
<message xmlns='jabber:client' 
         id='6d104193-5155-42c6-aaa2-e558c4079bc6' 
         to='c884302f-7797-4ca4-b68c-3a431a7d9024@localhost' 
         type='chat'>
    <origin-id xmlns='urn:xmpp:sid:0' id='6d104193-5155-42c6-aaa2-e558c4079bc6'/>
    <body>THIS IS MESSAGE 2!</body>
    <markable xmlns='urn:xmpp:chat-markers:0'/>
</message>

<!-- DUPLICATE -->
<message from='b4958e4b-fa52-46db-972e-9ad288776e62@localhost/1720-704391-9018-d7c41bad90715457' 
         to='c884302f-7797-4ca4-b68c-3a431a7d9024@localhost' 
         xmlns='jabber:client' 
         id='6d104193-5155-42c6-aaa2-e558c4079bc6' 
         type='chat'>
    <origin-id xmlns='urn:xmpp:sid:0' id='6d104193-5155-42c6-aaa2-e558c4079bc6'/>
    <body>THIS IS MESSAGE 2!</body>
    <markable xmlns='urn:xmpp:chat-markers:0'/>
    <stanza-id by='c884302f-7797-4ca4-b68c-3a431a7d9024@localhost' id='C77OREVP3K01' xmlns='urn:xmpp:sid:0'/>
</message>

The added stanza id is not the same for all duplicated messages. For instance, the message of my above example got duplicated 98 times, but there are 34 different stanza ids in the duplicate messages, each being present from 2 to 9 times. From what I can see the same stanza id is NOT used for the duplicates of two different original messages.

  • When the duplication occurs: in my original post I mentioned that the duplication seems to occur when querying the mam history, but now I have witnessed that this is NOT true. Rather, this seems to occur when a new client connects after the previous one disappearing abruptly. In the server logs, I can see entries with what=cc_send_copies, trying to send copies to resources that were killed. I also see entries with what=mam_user_receive_packet for packets that were sent much earlier and already received and stored in mam_message. These log entries do show the extra stanza id element.

  • Which messages are duplicated: Only the mam side of the user who disconnects abruptly gets duplicated. So for example, if user 1 sends messages to user 2 and the disconnection-reconnection that triggers the bug is from user 2, then only the mam_message records of user 2 get duplicated. The duplication seems to be for messages in both directions: in my test user 2 sent received and displayed markers to user 1, and these markers got duplicated as well (but again only for user 2).
    Note though that NOT all messages of user 2 got duplicated : in my test, user1 sent 5 messages to user2 before I could reproduce, and only messages 2 to 5 got duplicated (and their associated markers as well).

  • Chat states: Even though chat states are not stored by the mam module, I can see that they are also re-sent to the newly connected clients, I imagine this is part of the standard carbon copy process.

All of the above make me say: it looks like mongooseim keeps somewhere in memory or some transient storage a queue of stanzas even after they were sent and acknowledged (as mentioned in my original post, this seems to occur only when stream management is enabled), and these stanzas get routed as carbon copies to new clients of the same user. If they don't get acked by the new client, they get stored to the mam table again. This is of course only a rough idea of what I understand, in the hope that it may help someone with better knowledge of MongooseIM internals to understand the problem.

Let me know if there is more information that I can provide or experiments that I can do (bear in mind that I still don't know how to reproduce consistently). In my latest example, I have server logs (debug level), CSV dump of the mam_message table, websocket traces of clients.

@JanuszJakubiec
Copy link
Contributor

Hi, thank you for bringing this issue to our attention. We are currently investigating it and will keep you updated on our progress.

@rthouvenin
Copy link
Author

Thanks for looking at it. Something happened to me yesterday that might help the investigation:

I wanted to switch the storage format of the mam table to XML for a test instance, so I truncated the tables mam_message and inbox before changing the configuration. I noticed that some old messages re-appeared in the mam_message table, shortly after I emptied the table. The messages were all copies of a single old message (sent in July), which belonged to a user that had connected and disconnected just before I emptied the table. The disconnection was not clean (the user killed the client because of a bug).

I wanted to share this because it gives more evidence towards some of my guesses I made above : happens when a client abruptly disappears, and the messages are copied from a transient source (they could not be copied from DB since the table was empty). Hope this helps.

@gustawlippa
Copy link
Contributor

Hi @rthouvenin , I'm looking into this issue now. I have some comments, and a few questions which may help me find the reason, as I'm trying to reproduce the scenario.

  1. I think the cc_send_copies you see is not an issue. Carbon Copies would wrap the message in a forwarded element which I don't see in either my tests or your report. No CC is being actually send, and it's just an (unfortunately phrased) log CC message. If the targets= entry has no value, no CC will be sent, regardless of what's the value of resources=. Could you confirm that empty targets= in the logs is the case for you as well?
  2. I have found that when resending messages that were unacked from the stream management perspective, for some reason MAM thinks that each outgoing message is a new one, and gets a new ID. The stanza-id that you see in the message, is added by MAM, for users to be able to retrieve them later. Could you confirm, that the stanzas before disconnecting have indeed been acknowledged successfully? If they have, I'm not sure what's going on, and have not been able to reproduce this behaviour for now. I have however reproduced that MIM will try to resend an unacked message stanza to a quickly reconnected client, even if it does not reenable stream management. No delay or any indication that it is a retransmission is added as well. I believe this is a root cause, together with the fact, that MAM saves those retransmitted messages.
  3. I have confirmed that this does not happen without SM, and happens regardless of Carbon Copies and other modules being enabled/disabled.

@rthouvenin
Copy link
Author

Thank you @gustawlippa for looking into this problem. I spent some more time trying to understand the reproducing scenario, below are my answers:

  1. Yes I confirm that targets is always empty in the cc_send_copies log records.
  2. Your idea sounded promising, but unfortunately as far as I can see, all the stanzas have been acked from a SM perspective when the disconnection happens.

To reproduce and look at the ACK exchanges, here is what I did:

  1. Start from a MIM database with empty mam_message and inbox tables, and a container freshly restarted
  2. Connect user1 and send a few messages from user1 to user2 (who is offline), disconnect gracefully user1
  3. Connect user2, display messages from user1, disconnect gracefully
  4. Connect user1, sent 2 more messages to user2 and disconnect gracefully
  5. Connect user2 without displaying the 2 new messages (see below for the reason why), and refresh the page a few times at short (< 2s) intervals (this is a web client from a browser)

The web client I am using has the following behavior: when a user connects, it opens a session with the server and then:

  • retrieve the blocklist of the user
  • retrieve the inbox of the user
  • if an inbox conversation has unread messages, mark the last message (provided in the inbox response) as received
  • if an inbox conversation has unread messages, retrieve the messages history for that conversation (this is to count the number of actual unread messages, because the MIM inbox counts retraction messages as readable messages)

The fact that the conversation has several messages and the fact that it has unread messages means that the client initialization takes longer, and I thought it would give me enough time to kill the client before the initialization is complete. I did notice that I was not able to reproduce when the conversation is read, but interestingly when I did manage to reproduce, the client initialization actually completed... So I'm not sure what to think...

Anyway, I am attaching the output of mongooseimctl stanza getLastMessages after duplication, so you can see which messages got duplicated and when. I am also attaching a HAR file of all the XMPP sessions from user2 from step 5. You might want to search-and-replace these outputs with more user-friendly ids and jids for an easier investigation.

Some things worth mentioning that I found:

  • The first duplicated messaged happened at 2024-09-12T06:51:16.847077Z, which is during session 7.
  • This duplicated message is not sent to the client during session 7
  • But it is sent during session 8, before the client requests any mam message, and together with a few duplicate client state messages from user1 (which is offline at this stage).
  • For this duplicated message, mongooseimctl reports a stanza id C8F6H228RP81, but the XML body has a stanza-id element with value C8F6FGK2B0G1, which is the stanza id of the original message.
  • The message sent to the client during session 8 has a stanza-id element with value C8F6H228RP81.
  • The duplicated message is not sent to the client during session 8 as part of the mam response
  • It is only during session 9 that the client receives the duplicated message in the mam response, with stanza id C8F6FGK2B0G1

Let me know if there is any more info I can provide!

@gustawlippa
Copy link
Contributor

Thank you for such a detailed response @rthouvenin, it really helps. Your effort is greatly appreciated!

I have to ask just to clarify, because I see that you use Stream Management, Chat Markers and Inbox. When you say that you mark messages as received, I assume you mean in the Chat Markers sense, that is sending a stanza with a displayed element? I'm afraid this does not affect Stream Management, and may be one of the reasons you see the issue (other being a bug in MIM).

In the testing that you've performed, does the client respond to the Stream Management request <r xmlns='urn:xmpp:sm:3'/> stanzas that MIM is sending? A response would look like this: <a xmlns='urn:xmpp:sm:3' h='1'/> - h value may be different. This is a different way of handling message acknowledgements, and does not really relate to Inbox/Chat Markers. I can't see those stanzas in the logs you've provided as those are not messages.

I've found an issue which can occur when reconnecting, and before a client managed to ack the stanza in the Stream Management sense. I think it's similar to what you are describing (same MAM doubled messages) and can reproduce it. I'm close to fixing it, and hopefully this is the same issue that you are experiencing. If you need a solution quickly, or can experiment a bit more, I've found that setting the https://esl.github.io/MongooseDocs/latest/modules/mod_mam/#modulesmod_mampmsame_mam_id_for_peers should remedy the issue I found, but I will post a real fix as soon as I find it.

@rthouvenin
Copy link
Author

When you say that you mark messages as received, I assume you mean in the Chat Markers sense, that is sending a stanza with a displayed element?

That's right, more exactly: sending a stanza with a received element. But I described this behavior only to give a bit more context and help you make sense of what's happening if you looked at the HAR file, I am aware of the difference with Stream Management. I did check that all the SM requests sent by the server were replied with the correct h value, and I could not see any unacked stanza. You can double-check this in the HAR file. Sorry if providing more context turned out more confusing than anything!

Just in case you are not familiar with HAR files, this is notably the format used by browsers to export network logs. You can open it by going to the network tab of Chrome (I believe it works with other browsers) and find the button "Import HAR file". You can also open it with a text editor of your choice, this is just a plain JSON file, and look at the _webSocketMessages arrays.

Thank you for the possible workaround, I will try it later this week to confirm that this is indeed the same issue.

@rthouvenin
Copy link
Author

Hello again, something interesting happened to me just now.

After reproducing the problem this morning as described above, some messages got duplicated a number of times, and the total number of messages in mam_messages was 84.
A few minutes ago, I reconnected with user2, and displayed the conversation with user1. Within a few seconds, I noticed that the server was sending the same duplicate messages again, in large quantities. I killed the client, and when I looked in the mam_message table, I saw more than 16000 thousands messages. The number of records kept increasing in the table for some time (less than a minute) after the client had disconnected, stopping at about 20000 messages in the table.

Unfortunately I had not foreseen this and did not capture any trace or log, but I thought this might give some more clues about the issue.

@gustawlippa
Copy link
Contributor

Hi @rthouvenin, thanks for the additional info, and tips for the HAR files!
We've pushed a fix to the issue that I've found. It should be already available on the DockerHub under the latest tag. The explosion of MAM messages is very worrying, but perhaps the issues you are seeing all have the same underlying cause that should be gone. Could you please try to reproduce (any of) the bugs you've seen, or confirm that they are gone using the latest Docker image?
Thank you!

@rthouvenin
Copy link
Author

@gustawlippa Thank you for working on this and pushing a fix. With the latest docker image that I pulled just a few minutes ago, I was still able to reproduce the bug of duplicated messages. This time again, the number of MAM messages exploded to several thousands :(

Could it be that another image was pushed with tag latest and without your fix? If that is the case, feel free to let me know a more precise tag or a digest that I should use, and I'll try again. The digest of the image that I used is sha256:b55a7f9426b4807c2824a6653f048f581ea4939c51c9acc3db54463f17662cd9.

@gustawlippa
Copy link
Contributor

Hello @rthouvenin,
sorry for late response, but currently I don't have much availability.

You have used the correct Docker image. My fix didn't work, but we have the task to revisit this bug our internal backlog. I believe the bug wasn't fixed, because of how websockets connections are handled in MIM. I focused on the XMPP connections logic, and I thought this would be enough. I hope we will find time to fix it until the end of the year.

@rthouvenin
Copy link
Author

Thank you for the update

@gustawlippa
Copy link
Contributor

Hello @rthouvenin,
I'm back trying to find the cause for this issue :)
Thanks for all the data, I'm trying to reproduce the scenario you've described. So far no luck with seeing doubled messages. Could you please provide the TOML configuration file (mongooseim.toml) that you use? I just want to make sure I enable the same modules when trying to reproduce the bug.

@rthouvenin
Copy link
Author

Hi @gustawlippa, sure thing: mongooseim.toml.txt (I renamed to .txt because Github would refuse it otherwise).

This is the file I use for my local tests (in a Docker container, using the official image), the only changes I've made before uploading here are the JWT secret and the RDBMS connection settings. For transport, I use only websockets initiated over HTTP (port 5280).

I hope this helps

@gustawlippa
Copy link
Contributor

Hi, for now I'm not able to reproduce the issue by trying to emulate your client's behaviour, using the same set of Mongoose modules enabled.

If you are still able to reproduce the issue, could you please provide debug logs collected while it happens? This would be of much help to us. To enable debug logs, please set the loglevel = "debug" option in mongooseim.toml. The logs are stored in the /usr/lib/mongooseim/mongooseim.log.* files.

As an additional debugging step, which could be a bit more involved, but also very useful, we could use the tracing application included in MongooseIM. If you were willing, the trace can be captured by entering the Erlang shell by running mongooseimctl debug and then executing: tr:start(#{limit => 100000}), tr:trace_app(mongooseim). This starts tracing, and if the issue is reproducible in a relatively short time after this command is run, full view of what Mongoose was doing will be captured. After that, the trace can be saved to a file by running (still in the Erlang shell) tr:stop_tracing(), tr:dump("trace.ets"). The trace is then saved to the /usr/lib/mongooseim/trace.ets file. The number of traced function calls is limited to 100000, because the trace files can get big, and may affect performance. If the issue is not so easily reproducable, the limit may have to be increased. If you would be so kind to collect and share such a trace with us, we would greatly appreciate it.

As for fixes to the underlying issue, I can't yet see a reason why an unacked stanza would be resent by stream management, but maybe with more info we'll figure it out. So if you'd be able to provide either the logs, the trace, or both to us, we would be grateful.

@rthouvenin
Copy link
Author

Hi, I meant to come back to you earlier but I've been quite busy with other tasks, sorry for that.
I did manage today to reproduce the problem while capturing the trace. I had to put the calls limit to 10,000,000 because with 1,000,000 the tracing aborted while the replication started to show up, so I was not sure you would get all the information you need.

What I did is I tried to play the scenario that most often allowed me to reproduce, and on the side I was looking at the count of rows in mam_message, refreshed every second. When I saw the count increased abnormally, I stopped and dumped the tracing.
The count of mam_messages exploded to very high number like I mentioned earlier, but this happened after I stopped the tracing.

Here is the trace dump that I uploaded on my personal server (N.B.: it is 2.5GB, I hope this is usable!): https://next.thouvenin.cloud/s/dKNMYYaR9kEbQRS
And here is the log file with debug level:
mim.debug.log

Here are some approximate timestamps when I interruped abrutly the client connection and I saw the count of mam_messages increase (I am not sure about the first two):

  • 2024-12-17 16:10:16
  • 2024-12-17 16:10:24
  • 2024-12-17 16:10:36
  • 2024-12-17 16:10:44
  • 2024-12-17 16:11:00
  • 2024-12-17 16:11:15

Let me know if I can provide more info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants