Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry cert fetching from the top #225

Open
2 tasks
bengerman13 opened this issue Dec 2, 2021 · 2 comments
Open
2 tasks

Retry cert fetching from the top #225

bengerman13 opened this issue Dec 2, 2021 · 2 comments

Comments

@bengerman13
Copy link
Contributor

In order to reduce manual interventions when DNS is flaky, we want to retry failed certificate operations from the start a fixed number of times, less than half the cert rate limit

Acceptance Criteria

  • GIVEN an update operation
    AND apparently-valid DNS configuration
    WHEN the retrieve certificate step fails
    THEN we should check the number of retries
    AND retry certificate provisioning with a new certificate order
    AND increment the number of retries
  • GIVEN a provision operation
    AND apparently-valid DNS configuration
    WHEN the retrieve certificate step fails
    THEN we should check the number of retries
    AND retry certificate provisioning with a new certificate order
    AND increment the number of retries

Security considerations

No changes

Implementation sketch

This probably requires one or both of:

  • manage the retry loop outside of huey
  • rethink how we break up the cert tasks

Maybe special logic in the failed task handler, something like:

  1. did the task fail on one of the lets encrypt steps?
  2. has it failed N or more times (track this in a new column, most likely, or maybe calculate based on number of challenges or orders?)
  3. if so, kick off a copy of the previous provision/update pipeline? or maybe clean up the models from this pipeline and let the restarter pick it up?
  4. set the failure count
    It would be good to do this without doing a bunch of retries first, so we don't burn all the time CAPI/the migrator wait for upgrades/provisioning

Once done, we need to make sure that the new total length of time

@markdboyd
Copy link
Contributor

@bengerman13 Is this ticket still relevant or important for us to address?

@bengerman13
Copy link
Contributor Author

this is probably still interesting, but I've been out of the loop on how much this drives customer issues lately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants