Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDS minor version upgrades fail when upgrading primary and replica together #22107

Open
steveteahan opened this issue Dec 8, 2021 · 6 comments
Labels
service/rds Issues and PRs that pertain to the rds service. upstream-terraform Addresses functionality related to the Terraform core binary.

Comments

@steveteahan
Copy link

steveteahan commented Dec 8, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

$ terraform -v
Terraform v1.0.11
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.68.0

Affected Resource(s)

  • aws_db_instance

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

provider "aws" {
  profile = "lab"
  region  = "us-east-2"
}

variable "postgres_engine_version" {
  type    = string
  default = "13.2"
}

resource "aws_db_instance" "dev" {
  allocated_storage       = 10
  engine                  = "postgres"
  engine_version          = var.postgres_engine_version
  identifier              = "dev"
  instance_class          = "db.t3.micro"
  username                = "postgres"
  password                = "password"
  skip_final_snapshot     = true
  backup_retention_period = 1
  apply_immediately       = true
}

resource "aws_db_instance" "dev-replica" {
  allocated_storage   = 10
  engine              = "postgres"
  engine_version      = var.postgres_engine_version
  identifier          = "dev-replica"
  instance_class      = "db.t3.micro"
  replicate_source_db = "dev"
  apply_immediately   = true
}

Debug Output

Panic Output

Expected Behavior

The replica should be upgraded first and then the primary should be upgraded, without a failure.

Actual Behavior

The replica is upgraded first, but the primary is never upgraded because a DBUpgradeDependencyFailure error is thrown. A second apply must be executed to complete the upgrade.

It is worth mentioning that in a related issue, during the creation of the resources I also saw:

╷
│ Error: Error creating DB Instance: DBInstanceNotFound: The source instance could not be found: dev
│ 	status code: 404, request id: 8ba0b8a6-f7dd-42c5-b611-e2e93bb4597f
│ 
│   with aws_db_instance.dev-replica,
│   on main.tf line 24, in resource "aws_db_instance" "dev-replica":
│   24: resource "aws_db_instance" "dev-replica" {
│ 
╵

I don't want to start down the path of considering separate issues in a single bug report, but I do wonder if this is related to the same root cause. Perhaps this is a matter of eventual consistency from the AWS API and the provider could wait longer, or retry after a short period of time to see if there is a different answer?

Steps to Reproduce

  1. terraform apply
  2. terraform apply -var postgres_engine_version=13.3

Important Factoids

Nothing atypical for this account.

References

@github-actions github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/rds Issues and PRs that pertain to the rds service. labels Dec 8, 2021
@ewbankkit
Copy link
Contributor

ewbankkit commented Dec 8, 2021

Relates: #20514.
Relates: hashicorp/terraform#4149.

@ewbankkit ewbankkit added waiting-response Maintainers are waiting on response from community or contributor. and removed needs-triage Waiting for first response or review from a maintainer. labels Dec 8, 2021
@ewbankkit
Copy link
Contributor

@steveteahan Thanks for raising this issue 👏 .
This is an interesting one in how your workflow

  • Create Primary, create Replica - so Replica depends on Primary creation
  • Update Replica version, update Primary version - so Primary depends on Replica update

interacts with the Terraform dependency graph.
In configuration the correct way for the resource create/delete dependency to be captured is

resource "aws_db_instance" "dev" {
  ...
}

resource "aws_db_instance" "dev-replica" {
  ...
  replicate_source_db = aws_db_instance.dev.id
}

=> dev created before dev-replica, dev-replica deleted before dev

Updating dev's engine_version requires dev-replica to be updated first, but a plain terraform apply will update dev first.

You could try using Terraform resource targeting although that is not recommended as a long-term solution.
Alternatively you could split dev and dev-replica into separate modules and apply the update to dev-replica's module first. This is not ideal as discussed above, there IS a dependency between the two that you would like to capture in code.

This workflow does not have a simple solution in Terraform today. See hashicorp/terraform#4149 for a discussion of the thinking on additional configuration change modes.

@ewbankkit ewbankkit added upstream-terraform Addresses functionality related to the Terraform core binary. and removed waiting-response Maintainers are waiting on response from community or contributor. labels Dec 8, 2021
@steveteahan
Copy link
Author

Thank you for the quick response, @ewbankkit! I do see now that I should have used replicate_source_db = aws_db_instance.dev.id, but it sounds like this change isn't expected to fix this scenario (at least the upgrade).

I believed from the logs that Terraform was applying the upgrade to the replica first, but taking a closer look I missed the modify call for the primary. Thank you for the detailed explanation and the additional resources. I'll take a more careful look through everything and decide on a path forward.

@SushanSuresh
Copy link

SushanSuresh commented Nov 10, 2022

+1,
Facing same issue while trying to update replica

  1. Can't specify the engine version for replica
  2. Can't upgrade master without applying the replica

@take-five
Copy link

We have implemented a workaround using null_resource with local-exec provisioner. The idea is to run a script that will update replicas engine version before the primary.

# Get current region and IAM role first
data "aws_region" "current" {}

data "aws_caller_identity" "current" {}

data "aws_iam_session_context" "current" {
  arn = data.aws_caller_identity.current.arn
}

locals {
  db_instance_id    = "xyz"
  postgres_version = "14.4"
}

# RDS requires that read replicas engine versions must be updated
# before updating primary DB instance engine version. Since it's not really possible
# with vanilla Terraform (DB replicas depend on the primary, so the primary
# is always changed first), we implement a "hack":
# - This null_resource is "created/updated" first and will update
#   read replicas engine version
# - Primary DB instance resource depends on this null resource to make sure
#   "update-db-replicas.py" script runs before that.
resource "null_resource" "update_replicas_before_primary" {
  triggers = {
    engine_version = local.postgres_version
  }

  provisioner "local-exec" {
    command = "${path.module}/update-db-replicas.py"

    environment = {
      DB_INSTANCE_ID = local.db_instance_id
      ENGINE_VERSION = local.postgres_version
      AWS_ROLE_ARN   = data.aws_iam_session_context.current.issuer_arn
      AWS_REGION     = data.aws_region.current.name
    }
  }
}

resource "aws_db_instance" "primary" {
  identifier = local.db_instance_id

  engine         = "postgres"
  engine_version = null_resource.update_replicas_before_primary.triggers.engine_version

  # ...snip...
}

And here's update-db-replicas.py script

#!/usr/bin/env python3

"""
This script update engine version for read replicas of a one particular DB instance.
It's a workaround for Terraform AWS provider behavior which doesn't allow updating
read replicas engine versions separately since version 4.

The script is supposed to be run only by Terraform and strictly before updating engine version
of the primary DB instance.
"""

import boto3
import os
import random
import logging

logging.basicConfig(level=logging.INFO,
                    format='[%(asctime)s] %(levelname)s %(message)s',
                    handlers=[logging.StreamHandler()])

logger = logging.getLogger()

db_instance_id = os.environ['DB_INSTANCE_ID']
role_arn = os.environ['AWS_ROLE_ARN']
engine_version = os.environ['ENGINE_VERSION']
region = os.environ['AWS_REGION']

# Assume IAM role
sts_client = boto3.client('sts', region_name=region)
assumed_role = sts_client.assume_role(
  RoleArn=role_arn,
  RoleSessionName=f"terraform-update-db-replicas-{random.randint(1, 10000)}"
)
credentials = assumed_role['Credentials']

# Find the source DB instance
rds = boto3.client(
    'rds',
    region_name=region,
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken']
)

response = rds.describe_db_instances(Filters=[{'Name': 'db-instance-id', 'Values': [db_instance_id]}])

if len(response['DBInstances']) == 0:
    # If there is no source instance, we're probably running the script before the DB instance is created.
    logging.warn("Source DB instance %s not found, ignoring" % db_instance_id)
    sys.exit(0)

db_instance = response['DBInstances'][0]

for replica_id in db_instance['ReadReplicaDBInstanceIdentifiers']:
    logging.info("Updating DB replica %s engine version to %s" % (replica_id, engine_version))

    waiter = rds.get_waiter('db_instance_available')
    waiter.wait(DBInstanceIdentifier=replica_id)

    rds.modify_db_instance(
        DBInstanceIdentifier=replica_id,
        EngineVersion=engine_version,
        ApplyImmediately=False
    )

logging.info("Done")

It's probably not the most clean solution (e.g. the script only works with IAM-roles, not with static AWS credentials), but it works for us.

@rlee-arx
Copy link

This issue is still open, though I see it has been discussed elsewhere (#24887) though it seems that the resolution to that ticket was to reenable versioning of a read-replica. Is the accepted way to handle this question, the "two applies" method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
service/rds Issues and PRs that pertain to the rds service. upstream-terraform Addresses functionality related to the Terraform core binary.
Projects
None yet
Development

No branches or pull requests

5 participants