Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve replication reliability with transaction acknowledgment #74

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

trashhalo
Copy link

Add proper error handling and transaction acknowledgment in the WAL replication system. Changes include:

  • Propagate errors through event processing chain
  • Only acknowledge transactions after successful processing
  • Track and update WAL positions correctly
  • Add debug logging for WAL position updates

This improves system reliability by ensuring consistent transaction processing and maintaining accurate WAL positions.

closes #73

Add proper error handling and transaction acknowledgment in the WAL replication system.
Changes include:
- Propagate errors through event processing chain
- Only acknowledge transactions after successful processing
- Track and update WAL positions correctly
- Add debug logging for WAL position updates

This improves system reliability by ensuring consistent transaction processing
and maintaining accurate WAL positions.
@trashhalo
Copy link
Author

Spicy decisions

  • handlers must return :ok or {:ok, term} or it's considered an error scenario.
  • once a handler returns an error state the transaction stops, no other handlers are processed for that tx. Partially processed wal logs seems like a bad idea.

I think as written this would be a breaking change. People might have handlers returning stuff like nil and then it go boom. If that's a problem I could try to build opt in configs or we could just bump major.

I still want to add test cases for this behavior before coming out of draft.

@cpursley
Copy link
Owner

Awesome, much appreciated contribution. I'll spend some time testing it.

@trashhalo
Copy link
Author

I have a test case locally but postgres doesn't seem to behave like I expected. When I send back the old wal position its like okay cool. But never sends the messages again.

I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state

@cpursley
Copy link
Owner

I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state

That's a good question. Any opinion @chasers @DaemonSnake ?

)

[
<<?r, state.wal_position::64, state.wal_position::64, state.wal_position::64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needed to be the wal_position+1 I think
Screenshot 2024-12-30 at 16 08 25

@trashhalo
Copy link
Author

I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state

That's a good question. Any opinion @chasers @DaemonSnake ?

My instinct here having no experience with postgres replication 😁 is to immediately disconnect and reconnect on error from middleware.

Pro

  • Doing that gives me the option to specify the wal location I failed at. Allowing me to try again

Con

  • Depending on the error could just happen again putting you in a loop
  • I'm not sure but I'm not confident that temporary wal slots maintain old logs to be able to give you the message again. So message may not be resent

Other options would be

  • some sort of persisted buffer I maintain in publisher.ex and use for retries
  • don't try to recover crash process
  • only allow this behavior on durable slots

@DaemonSnake
Copy link
Contributor

FYI I tried setting up something similar in the past but abandoned the idea as we found it to be somewhat buggy.
I looked back at our old implementation and also forgot the +1 so it's maybe just that.

There's a lot of unknowns for me on the impact of blocking postgres replication and how, in turns, it treats the replica.

- ack the lsn for the end of the traction not the begining
- adding a test case (currently failing)
@chasers
Copy link

chasers commented Dec 31, 2024

I'm not sure what to do when there is wal drift. Maybe disconnect?

Drift how exactly?

If the replication slot is temporary then disconnecting and reconnecting will reset the replication state on Postgres and the wal for that slot will be deleted.

If the slot is permanent then you will still have to catch up.

There may be valid reasons for both setups. eg if you're just busting caches a temporary slot is probably fine, but if you're replicating a table and you want guarantees then you'll want to be able to catch up.

@chasers
Copy link

chasers commented Dec 31, 2024

jfyi pg_logical_slot_peek_changes() could be useful for testing.

You have to call it from the replication slot connection.

@trashhalo
Copy link
Author

trashhalo commented Jan 1, 2025

I'm not sure what to do when there is wal drift. Maybe disconnect?

Drift how exactly?

If the replication slot is temporary then disconnecting and reconnecting will reset the replication state on Postgres and the wal for that slot will be deleted.

If the slot is permanent then you will still have to catch up.

There may be valid reasons for both setups. eg if you're just busting caches a temporary slot is probably fine, but if you're replicating a table and you want guarantees then you'll want to be able to catch up.

This makes sense. Two different setups have different error handling needs.

Most of the changes in this pr are just threading error states back to the server.ex so we can make decisions on what to do when an event handler errors.

I hit a wall because I'm not actually sure what to do when a handler errors.

In your two examples I can speculate maybe

Cache busting:
Probably want to just keep erroring processing messages but hopefully the user has sentry or something and they woke up.

Replication:
Not sure. This is the part I'm kind of spinning on. Staying connected I noticed postgres does not send the wal again.

@chasers
Copy link

chasers commented Jan 6, 2025

Staying connected I noticed postgres does not send the wal again.

Reconnecting to the same replication slot seems like a valid way to do it. Perhaps there is an official way to ask a slot to try again.

keep erroring processing messages

Your handler may not have this behavior tho and perhaps it's better to reconnect or force the replication slot to be the thing that triggers the error. Unsure if it will start again.

@chasers
Copy link

chasers commented Jan 6, 2025

ChatGPT says:

The master does not actively resend unacknowledged records in the same session but retains them until the subscriber reconnects or requests them again. This ensures reliable delivery without duplication. Proper configuration of replication slots and WAL retention is essential to prevent data loss for lagging subscribers.

But it's knowledge of Postgres details is very suspect so take it with a grain of salt.

@chasers
Copy link

chasers commented Jan 6, 2025

The wal_sender_timeout config seems to confirm this.

If you do not ack a wal record in 60 seconds the primary will disconnect the subscriber.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need a way to ack the wal
4 participants