Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Closed
bob2204 opened this issue Aug 30, 2024 · 24 comments
Closed

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

bob2204 opened this issue Aug 30, 2024 · 24 comments

Comments

@bob2204
Copy link

bob2204 commented Aug 30, 2024

Hello

With Kind 0.24 and Node 1.31.0 the Pod IP is not removed from Service EndPoint when ReadinessProbe failed, although noticed NotReadyAddress in EndPoint !

This was fine wih kind 0.23 and Node 1.30.2

Is this normal ?

Best Regards

@bob2204 bob2204 changed the title Node IP not removed from Service EndPoint when ReadunessProbe failed Node IP not removed from Service EndPoint when ReadinessProbe failed Aug 30, 2024
@aojea
Copy link
Contributor

aojea commented Aug 30, 2024

You have to add more details and a reproducer, is not easy to understand from the comments what can be failing there

@bob2204
Copy link
Author

bob2204 commented Aug 30, 2024

I apologize, what I wish to say is that the Pod IP was not remove from the service endpoint.

I use a Nginx Deployment with a ReadinessProbe with this container :

containers:
      - image: nginx:1.26
        name: nginx
        readinessProbe:
          httpGet:
            path: /livez
            port: 80
          periodSeconds: 3
          failureThreshold: 2

and a service like :

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

and when this ReadinessProbe failed, the Pod IP is shown "NotReadyAddress" in the EndPoint :

kubectl describe endpoints nginx 
Name:         lemp
Namespace:    default
Labels:       app=nginx
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2024-08-30T15:41:43Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.32.204.60
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  80    TCP

Events:  <none>

BUT the Pod IP 10.32.204.60 was not removed from de Service Endpoints :

kubectl describe svc nginx 
Name:                     nginx
Namespace:                default
Labels:                   app=nginx
Annotations:              <none>
Selector:                 app=nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.16.42.218
IPs:                      172.16.42.218
LoadBalancer Ingress:     172.18.0.9 (Proxy)
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  31693/TCP
Endpoints:                10.32.204.60:80
Session Affinity:         None
External Traffic Policy:  Cluster
Internal Traffic Policy:  Cluster
Events:                   <none>

With Kind 0.23 and kindest/node:1.30.2, everything is OK, the Pod IP is removed from the Service EndPoints when the ReadinessProbe failed
AND with a K8S Cluster with 3 VMs and 1.31.0 everything is OK too !

Is my english clear ?

@bob2204 bob2204 changed the title Node IP not removed from Service EndPoint when ReadinessProbe failed Pod IP not removed from Service EndPoint when ReadinessProbe failed Aug 30, 2024
@aojea
Copy link
Contributor

aojea commented Aug 31, 2024

Just to understand, this works in kubernetes versions 1.30 and 1.31, only fails with Node 1.31.0 ?

@bob2204
Copy link
Author

bob2204 commented Aug 31, 2024

After further investigations, I found that whatever kubernetes version is, the problem seems to be Virtualbox environnement.
I've to identical kind installations -- kind 0.24.0, kindest/node:1.31.0, calico-3.28.0 --, one on physical machine, one on Virtualbox VM :

An explanation ?

@aojea
Copy link
Contributor

aojea commented Aug 31, 2024

Is the kubectl the same version?

What difference make for kind running on top of virtual box or a VM, it just used docker container?

Are you doing something out of the ordinary? Adding custom nodes or different kind configuration?

@bob2204
Copy link
Author

bob2204 commented Aug 31, 2024

Kubectl is the same version
The two install are identical.
The both have the same Calico CNI version 3.28.
In both installs there is Docker.
The only difference is Physical Machine/Virtual Machine.

@BenTheElder
Copy link
Member

Do you observe this without calico? We don't really provide support for third party CNI (it's supported to be possible to install it, but we're not tracking down bugs with all of them).

@bob2204
Copy link
Author

bob2204 commented Sep 3, 2024

With calico/cilium/kindnet i've the same behavior
With Virtualbox/VmWare/kvm the same.
With killercoda everything is fine ! For me it's like a witness.

I've tried this simple

kind create cluster --config=config.yml

with one Control-Plane and three Workers.

@aojea
Copy link
Contributor

aojea commented Sep 3, 2024

can you upload a tarball with the logs of the cluster that has the issue with kind export logs and indicate the name of the Service and the time (more or less) when the problem happens?

@bob2204
Copy link
Author

bob2204 commented Sep 3, 2024

full-logs.tar.gz
Service name: nginx
UTC Time: 2024-09-03T18:43:56Z

@bob2204
Copy link
Author

bob2204 commented Sep 3, 2024

Manifest used

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        ports:
          - name: http
            containerPort: 80
        readinessProbe:
          httpGet:
            path: /healthz
            port: http
          periodSeconds: 2
          failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: nginx

Alternatively I create/destroy /usr/share/nginx/html/healthz to act on ReadinessProbe.

@aojea
Copy link
Contributor

aojea commented Sep 5, 2024

full-logs.tar.gz Service name: nginx UTC Time: 2024-09-03T18:43:56Z

that does not adds up, the ngninx container starts at 18:44

Sep 03 18:44:01 stage-worker2 containerd[185]: time="2024-09-03T18:44:01.642418279Z" level=info msg="StartContainer for "0f8fa2821ddca5ce36b9ee686d36e60cf6ffa18b665585c663fe9f4baef699d0" returns successfully"

and there is no more logs after that, you have period 2 and threshold 2, so it should start failing at 18:44:05 but there are no logs there

I noticed that your environment has only 2 GB of ram in the VM, it would not be surprising that the problem is that your VMs are constrained and everything is slower on that environment

@bob2204
Copy link
Author

bob2204 commented Sep 5, 2024

I'm so sorry to waste your time, but the problem remains the same with 8GB !
This is the new dump.
full-log-2.tar.gz

The time was around 11:40/11:50 UTC.

k describe ep,svc nginx 
Name:         nginx
Namespace:    default
Labels:       <none>
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2024-09-05T11:48:23Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.244.2.3       <<<< This shows that the IP is not Ready 
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  80    TCP

Events:  <none>


Name:                     nginx
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app=nginx
Type:                     ClusterIP
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.96.88.94
IPs:                      10.96.88.94
Port:                     <unset>  80/TCP
TargetPort:               http/TCP
Endpoints:                10.244.2.3:80         <<<< Should NOT be here because the IP is not Ready
Session Affinity:         None
Internal Traffic Policy:  Cluster
Events:                   <none>

@aojea
Copy link
Contributor

aojea commented Sep 5, 2024

@bob2204 is like the kubelet is continuously restarting ... if you have the cluster running can you verify that?

@bob2204
Copy link
Author

bob2204 commented Sep 6, 2024

None of the three kubelets is continuously restarting.
This the log of systemctl status kubelet of one node. The others are the same :

root@stage-worker2:/# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf, 11-kind.conf
     Active: active (running) since Thu 2024-09-05 11:44:26 UTC; 14h ago
       Docs: http://kubernetes.io/docs/
    Process: 197 ExecStartPre=/bin/sh -euc if [ -f /sys/fs/cgroup/cgroup.controllers ]; then /kind/bin/create-kubelet-cgroup-v2.sh; fi (code=exited, status=0/SUCCESS)
    Process: 198 ExecStartPre=/bin/sh -euc if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi (code=exited, status=0/SUCCESS)
   Main PID: 199 (kubelet)
      Tasks: 12 (limit: 9425)
     Memory: 43.2M
        CPU: 7min 5.993s
     CGroup: /kubelet.slice/kubelet.service
             └─199 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.3 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.10 --provider-id=kind://docker/stage/stage-worker2 --runtime-cgroups=/system.slice/containerd.service

@faisalkamilansari
Copy link

@bob2204

I am also having same problem , is your problem solved ??

kubernetes version : v1.31.2

@chsakell
Copy link

I have the same problem, even if the pod is not ready, its IP address is being added to the service endpoints.
The files for the probes don't even exit in the pod.

Client Version: v1.31.5
Kustomize Version: v5.4.2
Server Version: v1.32.0
kind v0.26.0 go1.23.4 linux/amd64

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: dev-cluster
nodes:
- role: control-plane
- role: worker
- role: worker
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  terminationGracePeriodSeconds: 1
  containers:
  - name: probe-demo
    image: nginx
    startupProbe:
      httpGet:
        path: /
        port: 80
      periodSeconds: 1
      failureThreshold: 30
    livenessProbe:
      httpGet:
        path: /live.html
        port: 80
      periodSeconds: 10
      failureThreshold: 30   
    readinessProbe:
      httpGet:
        path: /ready.html
        port: 80
      periodSeconds: 10
      failureThreshold: 20       
---
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 80      

Image

@aojea
Copy link
Contributor

aojea commented Jan 27, 2025

I think this is starting to become a magnet for symptoms that are not necessarily having the same root cause, kind does not do anything exceptional to kubernetes components, so all this should be open on kubernetes repo, besides I will be likely the one traiging them I will make an exception with this last one

@chsakell you can do kind export logs and dump all the logs and upload a tarball to see the component logs.
In addition it is better not to paste screenshot for the output of the commands, is ok to use markfown format

@chsakell
Copy link

Here's the logs exported with the following commands:

kind export logs --name dev-cluster
tar zcvf kind-logs.tar.gz .

kind-logs.tar.gz

@BenTheElder
Copy link
Member

I have the same problem, even if the pod is not ready, its IP address is being added to the service endpoints.

This would be a bug in the main Kubernetes project, service endpoints and pods are not implemented here.

We implement cluster bootstrapping, a default PV driver, and NetworkPolicy / pod network (NOT endpoints / services, the network bridges / node to node pod IP routing)

GitHub.com/kubernetes/kubernetes

I don't mind discussing here but there's a better chance of finding the root issue if it's reported to the project.

Also, please aim for a minimal reproducer to help contributors find the cause quickly. EG does it still happen with a single node? If so then use that.

Aside: more generally, unless you're implementing distributed behaviors related to multi-node I highly recommend single node clusters, for simplicity, reduced overhead, and not over-reporting the host's resources which are ultimately shared by the nodes.

@bob2204
Copy link
Author

bob2204 commented Jan 28, 2025

@faisalkamilansari

Like BenTheElder, I think it's a bug/feature of K8S. I've the same behavior with a vanilla Cluster 1.31 and 1.32. Kind it's not guilty ;-)

@faisalkamilansari
Copy link

@bob2204

Is there any issue in traffic routing due to that or will traffic be routed to only ready pods ?

@aojea
Copy link
Contributor

aojea commented Jan 28, 2025

@chsakell something is odd with your environment, I see pods like kindnet restart or the probe pod you run also fail probes and is restarted .... let's not complicate this issue more, kind is not doing anything special with endpoints or pod readiness, so if you have a repro please open an issue in kubernetes/kubernetes with all the details and exact steps and tag me there

/close

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

@chsakell something is odd with your environment, I see pods like kindnet restart or the probe pod you run also fail probes and is restart .... let's not complicate this issue more, kind is not doing anything special with endpoints or pod readiness, so if you have a repro please open an issue in kubernetes/kubernetes with all the details and exact steps and tag me there

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants