Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call Celery task with transaction.on_commit #73

Closed
wants to merge 1 commit into from

Conversation

Viicos
Copy link
Contributor

@Viicos Viicos commented Nov 8, 2023

Fixes #72.

@@ -254,7 +255,7 @@ def enqueue(self, log_message=None):
self.mark_as(self.STATUS_QUEUED, log_message)

try:
tasks.send_email.delay(self.pk)
transaction.on_commit(partial(tasks.send_email.delay, message_pk=self.pk))
return True
except Exception as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure having an except block makes sense here, now that we have transaction.on_commit? Also, send_email calls send_db_message which is already in an atomic block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sastred I've discussed this with my colleague a bit - the entire flow now of message queuing and status updates seems to be built on the assumption that send_mail runs in an autocommit block, but this is really a bad assumption.

Especially when sending confirmation emails (e.g. for a webshop), you absolutely do not want to send out an email to confirm the error if the actual order placement fails for whatever reason. Typically you protect against these things by marking the calling code with @transaction.atomic. In the current version this crashes in the celery task because the email message doesn't exist, but that clutters our (Sentry) error monitoring with errors that aren't actually errors but intended behaviour. The signal-to-noise ratio of error monitoring is affected by this.

It would probably make more sense to break up the individual steps in a celery chain:

  1. persist message in the dabatase (could be part of your normal @transaction.atomic)
  2. async mark the message as enqueued (to get logging information even if something goes wrong)
  3. async send the message & log any results of that

Obviously all of this is out of scope for this bugfix, but it'd be good to take a critical look at this implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Viicos and @sergei-maertens

Thanks for the PR. I agree with the solution but it breaks 2 tests: https://github.com/APSL/django-yubin/actions/runs/6796969337/job/18927395933

image

Could you fix them? Probably you will need to patch transaction.on_commit instead of django_yubin.tasks.send_email.delay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be resolved with a self.captureOnCommitCallbacks(execute=True): in the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to fix the broken tests but it's not enough, more changes are needed in the code. I have thought again about your comment and I am not sure I understand 100% your problem sending emails. It looks like that you are trying to process orders and send emails in the same database transaction. If that's the case, why not finish order transactions first and if they went well send emails in a second step?

Copy link
Contributor

@sergei-maertens sergei-maertens Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that you are trying to process orders and send emails in the same database transaction.

Not necessarily, if the database itself is slow and the celery job is picked up by a worker before it is committed to the database (which may be well outside our own control or simply unknown, since django itself has code that runs in atomic transactions, like the built in password reset views!), the e-mail sending task fails, and it does so silently.

There is no retry specified on the task definitions either, so this does not autoheal.

Additionally, we do not want to commit a transaction which for unknown reasons crashes during email scheduling, so moving every send_mail call into a transaction.on_commit callback is not desired. We also require the email scheduling itself to be part of the transaction for audit trail reasons - every email sending call also creates audit trail records to prove and pinpoint for which object an email was scheduled.

I get your perspective of breaking these things out, but unfortunately reality is complexer than that, and the current implementation is simply not guaranteed to be resistent to race-conditions, even with autocommit.

One additional thought that arises while typing this out - what if your own code at some point requires a @transaction.atomic? Say creating the email record, and creating an immediate log record that the mail was scheduled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sergei-maertens and @Viicos ,

Sorry for taking so long, I have been quite busy these past weeks.

I have been thinking about the simplest solution. I think it can be to send the email on_commit() as you pointed out and move all the database operations to the Celery task (engine.send_db_message()). This way, emails will be sent after all database operations finish successfully and there shouldn't be race conditions.

I have implemented it in the branch https://github.com/APSL/django-yubin/tree/db_transactions-celery-race_condition and It looks like it works well at small scale (unit tests and manual QA).

Do you have a bigger test scenario like the one you described before to test it in a more real scale?

@sergei-maertens
Copy link
Contributor

sergei-maertens commented Dec 22, 2023 via email

@sastred
Copy link
Member

sastred commented Jan 9, 2024

Closing this PR. #72 has been fixed in #74 based on this PR.

Thanks for your help. I have added @Viicos to humans.txt.

New version 2.0.3 is already available in PyPI.

@sastred sastred closed this Jan 9, 2024
@Viicos Viicos deleted the enqueue-on-commit branch May 3, 2024 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

message enqueuing and sending is susceptible to a race condition
3 participants