Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleeting implementation with spot instances failing when instance is reclaimed #1217

Closed
nestorFigliuolo opened this issue Dec 17, 2024 · 3 comments

Comments

@nestorFigliuolo
Copy link

Describe the bug

We are trying to use the fleeting plugin with the docker-autoscaler runner worker, multiple instance types and multi-AZ deployment, but sometimes when an instance is reclaimed by AWS if we restart the job it tries to connect to the terminated instance and doesn't try others. The only solution we have found is to re-trigger the entire pipeline.
image

To Reproduce

This is the configuration we where using (we had to revert):

module "runner-spot-xlarge" {
  source      = "cattle-ops/gitlab-runner/aws"
  version     = "8.1.0"
  environment = "xlarge"

  vpc_id    = module.vpc.vpc_id
  subnet_id = element(module.vpc.public_subnets, 0)

  runner_instance = {
    name                        = "${var.runner_name}-xlarge"
    monitoring                  = true
    collect_autoscaling_metrics = ["GroupDesiredCapacity", "GroupInServiceInstances"]
    private_address_only        = false 
    type                        = "t3.small"
  }

  runner_install = {
    post_install_script = local.post-user-data
  }

  runner_networking = {
    security_groups_ids = [module.vpc.default_security_group_id]
    subnet_ids          = module.vpc.public_subnets
  }

  runner_manager = {
    maximum_concurrent_jobs = 50
  }

  runner_cloudwatch = {
    log_group_name = "gitlab_runners_xlarge"
  }
  
  runner_gitlab = {
    runner_version                                = "17.4.0"
    url                                           = "https://gitlab.com"
    preregistered_runner_token_ssm_parameter_name = "xlarge-runner-registration-token"
  }

  runner_gitlab_registration_config = {
    tag_list           = "xlarge"
    description        = "Gitlab Docker Spot Runner"
    locked_to_project  = "false"
    run_untagged       = "false"
    maximum_timeout    = "3600"
  }

  runner_worker = {
    environment_variables = ["REGISTRY_URL=docker.io/"]
    output_limit          = 409600
    type                  = "docker-autoscaler"
    ssm_access            = true
  }

  runner_worker_docker_autoscaler = {
    fleeting_plugin_version = "1.0.0"
    max_use_count           = 50
  }

  runner_worker_docker_autoscaler_ami_owners = ["amazon"]
  runner_worker_docker_autoscaler_ami_filter = {
    name = ["al2023-ami-ecs-hvm-*-kernel-6.1-x86_64"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size            = 16
    monitoring           = true
    private_address_only = false
    start_script         = file("scripts/worker_start_script.sh")
  }

  runner_worker_docker_autoscaler_asg = {
    subnet_ids                               = module.vpc.public_subnets
    types                                    = ["m4.xlarge", "m5.xlarge", "m5a.xlarge"]
    enable_mixed_instances_policy            = true
    on_demand_percentage_above_base_capacity = 0
    spot_instance_pools                      = 2 
  }

  runner_worker_docker_autoscaler_autoscaling_options = [
    {
      periods      = ["* * * * *"]
      idle_count   = 0
      idle_time    = "30m"
      scale_factor = 1
    }
  ]

  runner_worker_docker_options = {
    privileged = true,
    volumes    = ["/cache", "/certs/client", "/var/run/docker.sock:/var/run/docker.sock"]
  }

  tags = {
    "tf-aws-gitlab-runner:example"           = "runner-default"
    "tf-aws-gitlab-runner:instancelifecycle" = "spot:yes"
  }

}

Expected behavior

We expect the correct reassignment of another spot instance to the job if the one it tries to use is reclaimed by AWS.

@abeluck
Copy link

abeluck commented Dec 20, 2024

I believe this is an upstream bug: https://gitlab.com/gitlab-org/fleeting/plugins/aws/-/issues/52

also duplicate of #1200

@nestorFigliuolo
Copy link
Author

I see, thanks for the info!

@kayman-mk
Copy link
Collaborator

Closed as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants