-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment updating hangs when containers fail to start / become ready #28
Comments
The way it works with StatefulSet is in fact that it is deemed successful when the resource definition has been successfully created/updated, and then terraform doesn't care about its childrens state. I feel this is what we want too. A deployment is creating a child Replica Set which I don't think terraform even needs to care about the state of. When that is said. Do I feel the same way about creating a pod with e.g. a volume claim that fails and hangs forever? I'm not sure? |
I see now that the statefulset creation actually waits until all the pods in the statefulset have been initiated. That means in case of amount of replicas in the statefulset set to 3, it will wait until the statefulset contains 3 pods, but does not care what state those pods are in. |
Thanks for the discussion on this point. I've recently been pondering this issue myself, and I tend to agree that the current behaviour isn't helpful. While I think it's a useful indicator that the config is wrong and something needs to be fixed, ideally the tool I would use at this point to fix the config is Terraform which is now hung waiting for the Timeout to lapse. I could see a couple of possible changes that may help:
Option 2 with a low default timeout (e.g. 30s) might be the most useful middle ground. That way the operator gets a warning that the resource may not be healthy, but Terraform isn't blocked. |
I like option 2 as well, but re-adjusting the default might be problematic. I have had successful deployments taking a loong time. This is related to the same issue:
If Terraform needs the completed state of the child resources of e.g. Deployments and StatefulSets it should wait. If not, it should be satisfied when the cluster accepts the resource definition. I think it helps a lot to have something like |
Is it generally a good idea for Terraform to wait for Kubernetes to satisfy all its resources, or should its only job be to make sure that the Kubernetes resource is initially created on the cluster? Whether or not that resource is actually fully satisfied by Kubernetes might not be Terraform's concern, since it's now up to Kubernetes to handle the resource correctly. This does mean you lose any error feedback from resources that might be incorrectly configured though. If there's a good reason for the provider to monitor Kubernetes' progress towards fulfilling a resource I'd love to hear it though. |
This may be an intentional choice, but currently if a deployment is updated through Terraform, and the updated pods fail to start for any reason (ImagePull fail, crash on startup, failing health checks etc.) then the Terraform process will just hang waiting for the pods to become ready (which will never happen). Eventually it times out, but until that happens it is seemingly waiting around for no reason. In my opinion it makes more sense for the provider to either call it a successful operation if the definition of the deployment is updated and let Kubernetes worry about the details, or fail earlier if an error like the above is detected in any of the updated containers.
The text was updated successfully, but these errors were encountered: