Skip to content

Release Shepherd Rotation

Ryan Oaks edited this page Aug 14, 2023 · 22 revisions

The Release Shepherd is a 50% rotation that handles making the Terraform Provider Google (TPG) releases (https://github.com/hashicorp/terraform-provider-google/wiki/Release-Process) as well as maintaining test environments to ensure that merging contributions and making releases are as low-friction as possible.

The current schedule can be viewed in PagerDuty.

Once-per-week responsibilities

Monday morning

  • Check the release history for any patch releases and confirm they have been cherrypicked or handled
    • Rarely, changes will already be present in the release branch (i.e. due to early patch releases) or won't make sense to migrate forward. In these cases, confirm explicitly with last week's oncall that they've been handled and that the new release will not regress.
  • Run the release based on last week's shepherd's release cut: https://github.com/hashicorp/terraform-provider-google/wiki/Release-Process#on-monday

Tuesday midday

  • Check recently filed bugs for issues coming from the new release and flag them with the oncall. Evaluate whether a patch release is needed as discussed in the incident response policy.
    • If the oncall has not picked up the issue, you should consider attempting to resolve it.

Wednesday midday

Daily responsibilities

  • Check nightly test runs for new issues that would block the next release or currently cut release and resolve them
    • Nightly tests* can be found here: GA, Beta
      • NOTE: In preparation for 5.0.0 we test the feature branches for the major release some days of the week (GA on Thurs, Beta on Fri). This means that the nightly tests run against main branch will not run on those days.
    • For failures that will block cutting the next release, resolve them on main by fixing forward or rolling back changes as appropriate
    • For failures that will block the currently cut release from going out, evaluate cherrypicking them
  • Resolve (for services without service/ labels) or label (for services with service/ labels) failures in the nightly test results
    • If you're unable to resolve an issue, file a test-failure issue.
  • Check 2-3 recent PRs for unrelated or recurrent VCR failures and resolve them, filing a test-failure issue if you are unable to.
  • If other responsibilities have been addressed, find old bug or persistent-bug issues and resolve them.

NOTE: The history for a given test will be available in the old TC projects (GA, Beta) until October. At that point new test history data will have accumulated in the new TeamCity projects and the old projects' data will be out-dated.

FAQ

How long should I spend on this rotation?

As a 50% rotation, this should take around half of your time-at-desk. If you'll be unable to spend at least 8 hours on the responsibilities listed above, consider trading shifts with someone who will be able to. Conversely, if you're required to spend 16+ hours working on these responsibilities (outside of exceptional events like weeks where multiple patch releases are required), flag that with the team so that we can bring the time commitment back within expectations.

Who should handle patch releases for GCP outages?

Patch releases for current GCP outages are handled by the Google Oncall as defined in the incident response policy. However, if they determine that additional help is required, they may enlist the release shepherd to drive the patch.

On the other hand, cherrypicks are generally handled by the release shepherd to ensure that they're the owner of the weekly minor release branch

  • If the oncall determines that a change doesn't need a patch but we will want to cherrypick it (for example, if there's a major outage on a Friday), the release shepherd will own cherrypicking it.
  • If the oncall makes a patch release, they'll work with the release shepherd to ensure that it's included in the next major release

Should I block the next minor release on an upcoming patch release?

Regressions and incidents should not generally block the next minor release. If rolling out a new release without resolving a new regression introduced in the last release will break additional users we should cancel the release, and regressions that break the entire provider (such as provider initialization issues that impact many users) likely prompt a freeze until resolved ASAP through a patch. In case of ambiguity, discuss with the oncall and mutually agree on a resolution plan, then communicate that in chat. If the oncall is unavailable, substitute a TL.

Note: In general, patches cut past midday Thursday are rare, and should be cherrypicked into the minor release rather than patched.

Clone this wiki locally