Fleeting implementation with spot instances failing when instance is reclaimed #1217

nestorFigliuolo · 2024-12-17T11:26:52Z

Describe the bug

We are trying to use the fleeting plugin with the docker-autoscaler runner worker, multiple instance types and multi-AZ deployment, but sometimes when an instance is reclaimed by AWS if we restart the job it tries to connect to the terminated instance and doesn't try others. The only solution we have found is to re-trigger the entire pipeline.

To Reproduce

This is the configuration we where using (we had to revert):

module "runner-spot-xlarge" {
  source      = "cattle-ops/gitlab-runner/aws"
  version     = "8.1.0"
  environment = "xlarge"

  vpc_id    = module.vpc.vpc_id
  subnet_id = element(module.vpc.public_subnets, 0)

  runner_instance = {
    name                        = "${var.runner_name}-xlarge"
    monitoring                  = true
    collect_autoscaling_metrics = ["GroupDesiredCapacity", "GroupInServiceInstances"]
    private_address_only        = false 
    type                        = "t3.small"
  }

  runner_install = {
    post_install_script = local.post-user-data
  }

  runner_networking = {
    security_groups_ids = [module.vpc.default_security_group_id]
    subnet_ids          = module.vpc.public_subnets
  }

  runner_manager = {
    maximum_concurrent_jobs = 50
  }

  runner_cloudwatch = {
    log_group_name = "gitlab_runners_xlarge"
  }
  
  runner_gitlab = {
    runner_version                                = "17.4.0"
    url                                           = "https://gitlab.com"
    preregistered_runner_token_ssm_parameter_name = "xlarge-runner-registration-token"
  }

  runner_gitlab_registration_config = {
    tag_list           = "xlarge"
    description        = "Gitlab Docker Spot Runner"
    locked_to_project  = "false"
    run_untagged       = "false"
    maximum_timeout    = "3600"
  }

  runner_worker = {
    environment_variables = ["REGISTRY_URL=docker.io/"]
    output_limit          = 409600
    type                  = "docker-autoscaler"
    ssm_access            = true
  }

  runner_worker_docker_autoscaler = {
    fleeting_plugin_version = "1.0.0"
    max_use_count           = 50
  }

  runner_worker_docker_autoscaler_ami_owners = ["amazon"]
  runner_worker_docker_autoscaler_ami_filter = {
    name = ["al2023-ami-ecs-hvm-*-kernel-6.1-x86_64"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size            = 16
    monitoring           = true
    private_address_only = false
    start_script         = file("scripts/worker_start_script.sh")
  }

  runner_worker_docker_autoscaler_asg = {
    subnet_ids                               = module.vpc.public_subnets
    types                                    = ["m4.xlarge", "m5.xlarge", "m5a.xlarge"]
    enable_mixed_instances_policy            = true
    on_demand_percentage_above_base_capacity = 0
    spot_instance_pools                      = 2 
  }

  runner_worker_docker_autoscaler_autoscaling_options = [
    {
      periods      = ["* * * * *"]
      idle_count   = 0
      idle_time    = "30m"
      scale_factor = 1
    }
  ]

  runner_worker_docker_options = {
    privileged = true,
    volumes    = ["/cache", "/certs/client", "/var/run/docker.sock:/var/run/docker.sock"]
  }

  tags = {
    "tf-aws-gitlab-runner:example"           = "runner-default"
    "tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
  }

}

Expected behavior

We expect the correct reassignment of another spot instance to the job if the one it tries to use is reclaimed by AWS.

The text was updated successfully, but these errors were encountered:

abeluck · 2024-12-20T13:40:45Z

I believe this is an upstream bug: https://gitlab.com/gitlab-org/fleeting/plugins/aws/-/issues/52

also duplicate of #1200

nestorFigliuolo · 2024-12-20T14:40:28Z

I see, thanks for the info!

kayman-mk · 2025-01-16T08:48:25Z

Closed as duplicate.

kayman-mk closed this as completed Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleeting implementation with spot instances failing when instance is reclaimed #1217

Fleeting implementation with spot instances failing when instance is reclaimed #1217

nestorFigliuolo commented Dec 17, 2024

abeluck commented Dec 20, 2024 •

edited

Loading

nestorFigliuolo commented Dec 20, 2024

kayman-mk commented Jan 16, 2025

Fleeting implementation with spot instances failing when instance is reclaimed #1217

Fleeting implementation with spot instances failing when instance is reclaimed #1217

Comments

nestorFigliuolo commented Dec 17, 2024

Describe the bug

To Reproduce

Expected behavior

abeluck commented Dec 20, 2024 • edited Loading

nestorFigliuolo commented Dec 20, 2024

kayman-mk commented Jan 16, 2025

abeluck commented Dec 20, 2024 •

edited

Loading