Deployment updating hangs when containers fail to start / become ready #28

ajwootto · 2018-06-25T18:02:46Z

This may be an intentional choice, but currently if a deployment is updated through Terraform, and the updated pods fail to start for any reason (ImagePull fail, crash on startup, failing health checks etc.) then the Terraform process will just hang waiting for the pods to become ready (which will never happen). Eventually it times out, but until that happens it is seemingly waiting around for no reason. In my opinion it makes more sense for the provider to either call it a successful operation if the definition of the deployment is updated and let Kubernetes worry about the details, or fail earlier if an error like the above is detected in any of the updated containers.

stigok · 2018-06-26T09:05:50Z

The way it works with StatefulSet is in fact that it is deemed successful when the resource definition has been successfully created/updated, and then terraform doesn't care about its childrens state. I feel this is what we want too. A deployment is creating a child Replica Set which I don't think terraform even needs to care about the state of.

When that is said. Do I feel the same way about creating a pod with e.g. a volume claim that fails and hangs forever? I'm not sure?

stigok · 2018-06-26T12:08:31Z

I see now that the statefulset creation actually waits until all the pods in the statefulset have been initiated. That means in case of amount of replicas in the statefulset set to 3, it will wait until the statefulset contains 3 pods, but does not care what state those pods are in.

sl1pm4t · 2018-06-26T21:30:13Z

Thanks for the discussion on this point. I've recently been pondering this issue myself, and I tend to agree that the current behaviour isn't helpful.

While I think it's a useful indicator that the config is wrong and something needs to be fixed, ideally the tool I would use at this point to fix the config is Terraform which is now hung waiting for the Timeout to lapse.

I could see a couple of possible changes that may help:

add a flag to each relevant resource schema to indicate whether it should wait for resources to schedule
drop the default timeout to a much lower value and / or make the timeout configurable

Option 2 with a low default timeout (e.g. 30s) might be the most useful middle ground. That way the operator gets a warning that the resource may not be healthy, but Terraform isn't blocked.

stigok · 2018-06-27T08:06:19Z

I like option 2 as well, but re-adjusting the default might be problematic. I have had successful deployments taking a loong time. This is related to the same issue:

The statefulset is waiting for the replicaset
The replicaset is waiting for the pods
The pods are erring about unbound volumes
The persistent storage can take up to minutes to fulfill the claims (Azure)
The retry interval of the pods looks to be something like 1, 5, 30, 120 seconds. 1 + 5 + 30 = 46, and next retry is then in two minutes. Volume claims are satisfied, but still waiting for pod to come out of back-off state.
If more than one replica, go through step 3 to 5 for each one.
Terraform is happy and state is successfully saved

If Terraform needs the completed state of the child resources of e.g. Deployments and StatefulSets it should wait. If not, it should be satisfied when the cluster accepts the resource definition.

I think it helps a lot to have something like terraform apply --timeout 10, but it would also mean we lose the state of the resource when it's cancelled, right?

ajwootto · 2018-06-27T18:34:00Z

Is it generally a good idea for Terraform to wait for Kubernetes to satisfy all its resources, or should its only job be to make sure that the Kubernetes resource is initially created on the cluster? Whether or not that resource is actually fully satisfied by Kubernetes might not be Terraform's concern, since it's now up to Kubernetes to handle the resource correctly. This does mean you lose any error feedback from resources that might be incorrectly configured though.

If there's a good reason for the provider to monitor Kubernetes' progress towards fulfilling a resource I'd love to hear it though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment updating hangs when containers fail to start / become ready #28

Deployment updating hangs when containers fail to start / become ready #28

ajwootto commented Jun 25, 2018

stigok commented Jun 26, 2018

stigok commented Jun 26, 2018

sl1pm4t commented Jun 26, 2018

stigok commented Jun 27, 2018 •

edited

Loading

ajwootto commented Jun 27, 2018 •

edited

Loading

Deployment updating hangs when containers fail to start / become ready #28

Deployment updating hangs when containers fail to start / become ready #28

Comments

ajwootto commented Jun 25, 2018

stigok commented Jun 26, 2018

stigok commented Jun 26, 2018

sl1pm4t commented Jun 26, 2018

stigok commented Jun 27, 2018 • edited Loading

ajwootto commented Jun 27, 2018 • edited Loading

stigok commented Jun 27, 2018 •

edited

Loading

ajwootto commented Jun 27, 2018 •

edited

Loading