Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment updating hangs when containers fail to start / become ready #28

Open
ajwootto opened this issue Jun 25, 2018 · 5 comments
Open

Comments

@ajwootto
Copy link

This may be an intentional choice, but currently if a deployment is updated through Terraform, and the updated pods fail to start for any reason (ImagePull fail, crash on startup, failing health checks etc.) then the Terraform process will just hang waiting for the pods to become ready (which will never happen). Eventually it times out, but until that happens it is seemingly waiting around for no reason. In my opinion it makes more sense for the provider to either call it a successful operation if the definition of the deployment is updated and let Kubernetes worry about the details, or fail earlier if an error like the above is detected in any of the updated containers.

@stigok
Copy link

stigok commented Jun 26, 2018

The way it works with StatefulSet is in fact that it is deemed successful when the resource definition has been successfully created/updated, and then terraform doesn't care about its childrens state. I feel this is what we want too. A deployment is creating a child Replica Set which I don't think terraform even needs to care about the state of.

When that is said. Do I feel the same way about creating a pod with e.g. a volume claim that fails and hangs forever? I'm not sure?

@stigok
Copy link

stigok commented Jun 26, 2018

I see now that the statefulset creation actually waits until all the pods in the statefulset have been initiated. That means in case of amount of replicas in the statefulset set to 3, it will wait until the statefulset contains 3 pods, but does not care what state those pods are in.

@sl1pm4t
Copy link
Owner

sl1pm4t commented Jun 26, 2018

Thanks for the discussion on this point. I've recently been pondering this issue myself, and I tend to agree that the current behaviour isn't helpful.

While I think it's a useful indicator that the config is wrong and something needs to be fixed, ideally the tool I would use at this point to fix the config is Terraform which is now hung waiting for the Timeout to lapse.

I could see a couple of possible changes that may help:

  1. add a flag to each relevant resource schema to indicate whether it should wait for resources to schedule
  2. drop the default timeout to a much lower value and / or make the timeout configurable

Option 2 with a low default timeout (e.g. 30s) might be the most useful middle ground. That way the operator gets a warning that the resource may not be healthy, but Terraform isn't blocked.

@stigok
Copy link

stigok commented Jun 27, 2018

I like option 2 as well, but re-adjusting the default might be problematic. I have had successful deployments taking a loong time. This is related to the same issue:

  1. The statefulset is waiting for the replicaset
  2. The replicaset is waiting for the pods
  3. The pods are erring about unbound volumes
  4. The persistent storage can take up to minutes to fulfill the claims (Azure)
  5. The retry interval of the pods looks to be something like 1, 5, 30, 120 seconds. 1 + 5 + 30 = 46, and next retry is then in two minutes. Volume claims are satisfied, but still waiting for pod to come out of back-off state.
  6. If more than one replica, go through step 3 to 5 for each one.
  7. Terraform is happy and state is successfully saved

If Terraform needs the completed state of the child resources of e.g. Deployments and StatefulSets it should wait. If not, it should be satisfied when the cluster accepts the resource definition.

I think it helps a lot to have something like terraform apply --timeout 10, but it would also mean we lose the state of the resource when it's cancelled, right?

@ajwootto
Copy link
Author

ajwootto commented Jun 27, 2018

Is it generally a good idea for Terraform to wait for Kubernetes to satisfy all its resources, or should its only job be to make sure that the Kubernetes resource is initially created on the cluster? Whether or not that resource is actually fully satisfied by Kubernetes might not be Terraform's concern, since it's now up to Kubernetes to handle the resource correctly. This does mean you lose any error feedback from resources that might be incorrectly configured though.

If there's a good reason for the provider to monitor Kubernetes' progress towards fulfilling a resource I'd love to hear it though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants