-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve replication reliability with transaction acknowledgment #74
base: master
Are you sure you want to change the base?
Conversation
Add proper error handling and transaction acknowledgment in the WAL replication system. Changes include: - Propagate errors through event processing chain - Only acknowledge transactions after successful processing - Track and update WAL positions correctly - Add debug logging for WAL position updates This improves system reliability by ensuring consistent transaction processing and maintaining accurate WAL positions.
Spicy decisions
I think as written this would be a breaking change. People might have handlers returning stuff like nil and then it go boom. If that's a problem I could try to build opt in configs or we could just bump major. I still want to add test cases for this behavior before coming out of draft. |
Awesome, much appreciated contribution. I'll spend some time testing it. |
I have a test case locally but postgres doesn't seem to behave like I expected. When I send back the old wal position its like okay cool. But never sends the messages again. I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state |
That's a good question. Any opinion @chasers @DaemonSnake ? |
lib/walex/replication/server.ex
Outdated
) | ||
|
||
[ | ||
<<?r, state.wal_position::64, state.wal_position::64, state.wal_position::64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct here having no experience with postgres replication 😁 is to immediately disconnect and reconnect on error from middleware. Pro
Con
Other options would be
|
FYI I tried setting up something similar in the past but abandoned the idea as we found it to be somewhat buggy. There's a lot of unknowns for me on the impact of blocking postgres replication and how, in turns, it treats the replica. |
- ack the lsn for the end of the traction not the begining - adding a test case (currently failing)
Drift how exactly? If the replication slot is If the slot is There may be valid reasons for both setups. eg if you're just busting caches a temporary slot is probably fine, but if you're replicating a table and you want guarantees then you'll want to be able to catch up. |
jfyi pg_logical_slot_peek_changes() could be useful for testing. You have to call it from the replication slot connection. |
This makes sense. Two different setups have different error handling needs. Most of the changes in this pr are just threading error states back to the server.ex so we can make decisions on what to do when an event handler errors. I hit a wall because I'm not actually sure what to do when a handler errors. In your two examples I can speculate maybe Cache busting: Replication: |
Reconnecting to the same replication slot seems like a valid way to do it. Perhaps there is an official way to ask a slot to try again.
Your handler may not have this behavior tho and perhaps it's better to reconnect or force the replication slot to be the thing that triggers the error. Unsure if it will start again. |
ChatGPT says:
But it's knowledge of Postgres details is very suspect so take it with a grain of salt. |
The wal_sender_timeout config seems to confirm this. If you do not ack a wal record in 60 seconds the primary will disconnect the subscriber. |
Add proper error handling and transaction acknowledgment in the WAL replication system. Changes include:
This improves system reliability by ensuring consistent transaction processing and maintaining accurate WAL positions.
closes #73