feat: improve replication reliability with transaction acknowledgment #74

trashhalo · 2024-12-29T01:46:00Z

Add proper error handling and transaction acknowledgment in the WAL replication system. Changes include:

Propagate errors through event processing chain
Only acknowledge transactions after successful processing
Track and update WAL positions correctly
Add debug logging for WAL position updates

This improves system reliability by ensuring consistent transaction processing and maintaining accurate WAL positions.

closes #73

Add proper error handling and transaction acknowledgment in the WAL replication system. Changes include: - Propagate errors through event processing chain - Only acknowledge transactions after successful processing - Track and update WAL positions correctly - Add debug logging for WAL position updates This improves system reliability by ensuring consistent transaction processing and maintaining accurate WAL positions.

trashhalo · 2024-12-29T13:17:01Z

Spicy decisions

handlers must return :ok or {:ok, term} or it's considered an error scenario.
once a handler returns an error state the transaction stops, no other handlers are processed for that tx. Partially processed wal logs seems like a bad idea.

I think as written this would be a breaking change. People might have handlers returning stuff like nil and then it go boom. If that's a problem I could try to build opt in configs or we could just bump major.

I still want to add test cases for this behavior before coming out of draft.

cpursley · 2024-12-29T16:36:03Z

Awesome, much appreciated contribution. I'll spend some time testing it.

trashhalo · 2024-12-29T19:50:10Z

I have a test case locally but postgres doesn't seem to behave like I expected. When I send back the old wal position its like okay cool. But never sends the messages again.

I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state

cpursley · 2024-12-30T11:35:59Z

I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state

That's a good question. Any opinion @chasers @DaemonSnake ?

DaemonSnake · 2024-12-30T15:09:32Z

lib/walex/replication/server.ex

+          )
+
+          [
+            <<?r, state.wal_position::64, state.wal_position::64, state.wal_position::64,


this needed to be the wal_position+1 I think

trashhalo · 2024-12-30T16:34:11Z

I'm not sure what to do when there is wal drift. Maybe disconnect? Also when should I disconnect if it's on the next keep alive then more messages could have processed and now we are in a weird state

That's a good question. Any opinion @chasers @DaemonSnake ?

My instinct here having no experience with postgres replication 😁 is to immediately disconnect and reconnect on error from middleware.

Pro

Doing that gives me the option to specify the wal location I failed at. Allowing me to try again

Con

Depending on the error could just happen again putting you in a loop
I'm not sure but I'm not confident that temporary wal slots maintain old logs to be able to give you the message again. So message may not be resent

Other options would be

some sort of persisted buffer I maintain in publisher.ex and use for retries
don't try to recover crash process
only allow this behavior on durable slots

DaemonSnake · 2024-12-30T18:09:36Z

FYI I tried setting up something similar in the past but abandoned the idea as we found it to be somewhat buggy.
I looked back at our old implementation and also forgot the +1 so it's maybe just that.

There's a lot of unknowns for me on the impact of blocking postgres replication and how, in turns, it treats the replica.

- ack the lsn for the end of the traction not the begining - adding a test case (currently failing)

chasers · 2024-12-31T13:38:41Z

I'm not sure what to do when there is wal drift. Maybe disconnect?

Drift how exactly?

If the replication slot is temporary then disconnecting and reconnecting will reset the replication state on Postgres and the wal for that slot will be deleted.

If the slot is permanent then you will still have to catch up.

There may be valid reasons for both setups. eg if you're just busting caches a temporary slot is probably fine, but if you're replicating a table and you want guarantees then you'll want to be able to catch up.

chasers · 2024-12-31T14:04:47Z

jfyi pg_logical_slot_peek_changes() could be useful for testing.

You have to call it from the replication slot connection.

trashhalo · 2025-01-01T12:47:58Z

I'm not sure what to do when there is wal drift. Maybe disconnect?

Drift how exactly?

If the replication slot is temporary then disconnecting and reconnecting will reset the replication state on Postgres and the wal for that slot will be deleted.

If the slot is permanent then you will still have to catch up.

There may be valid reasons for both setups. eg if you're just busting caches a temporary slot is probably fine, but if you're replicating a table and you want guarantees then you'll want to be able to catch up.

This makes sense. Two different setups have different error handling needs.

Most of the changes in this pr are just threading error states back to the server.ex so we can make decisions on what to do when an event handler errors.

I hit a wall because I'm not actually sure what to do when a handler errors.

In your two examples I can speculate maybe

Cache busting:
Probably want to just keep erroring processing messages but hopefully the user has sentry or something and they woke up.

Replication:
Not sure. This is the part I'm kind of spinning on. Staying connected I noticed postgres does not send the wal again.

chasers · 2025-01-06T16:19:26Z

Staying connected I noticed postgres does not send the wal again.

Reconnecting to the same replication slot seems like a valid way to do it. Perhaps there is an official way to ask a slot to try again.

keep erroring processing messages

Your handler may not have this behavior tho and perhaps it's better to reconnect or force the replication slot to be the thing that triggers the error. Unsure if it will start again.

chasers · 2025-01-06T16:41:21Z

ChatGPT says:

The master does not actively resend unacknowledged records in the same session but retains them until the subscriber reconnects or requests them again. This ensures reliable delivery without duplication. Proper configuration of replication slots and WAL retention is essential to prevent data loss for lagging subscribers.

But it's knowledge of Postgres details is very suspect so take it with a grain of salt.

chasers · 2025-01-06T16:43:34Z

The wal_sender_timeout config seems to confirm this.

If you do not ack a wal record in 60 seconds the primary will disconnect the subscriber.

DaemonSnake reviewed Dec 30, 2024

View reviewed changes

DaemonSnake approved these changes Dec 30, 2024

View reviewed changes

- ?k wants position + 1

f314660

- ack the lsn for the end of the traction not the begining - adding a test case (currently failing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve replication reliability with transaction acknowledgment #74

feat: improve replication reliability with transaction acknowledgment #74

trashhalo commented Dec 29, 2024

trashhalo commented Dec 29, 2024

cpursley commented Dec 29, 2024

trashhalo commented Dec 29, 2024

cpursley commented Dec 30, 2024

DaemonSnake Dec 30, 2024

trashhalo commented Dec 30, 2024

DaemonSnake commented Dec 30, 2024

chasers commented Dec 31, 2024

chasers commented Dec 31, 2024

trashhalo commented Jan 1, 2025 •

edited

Loading

chasers commented Jan 6, 2025 •

edited

Loading

chasers commented Jan 6, 2025

chasers commented Jan 6, 2025

feat: improve replication reliability with transaction acknowledgment #74

Are you sure you want to change the base?

feat: improve replication reliability with transaction acknowledgment #74

Conversation

trashhalo commented Dec 29, 2024

trashhalo commented Dec 29, 2024

cpursley commented Dec 29, 2024

trashhalo commented Dec 29, 2024

cpursley commented Dec 30, 2024

DaemonSnake Dec 30, 2024

Choose a reason for hiding this comment

trashhalo commented Dec 30, 2024

DaemonSnake commented Dec 30, 2024

chasers commented Dec 31, 2024

chasers commented Dec 31, 2024

trashhalo commented Jan 1, 2025 • edited Loading

chasers commented Jan 6, 2025 • edited Loading

chasers commented Jan 6, 2025

chasers commented Jan 6, 2025

trashhalo commented Jan 1, 2025 •

edited

Loading

chasers commented Jan 6, 2025 •

edited

Loading