Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

bob2204 · 2024-08-30T11:47:34Z

Hello

With Kind 0.24 and Node 1.31.0 the Pod IP is not removed from Service EndPoint when ReadinessProbe failed, although noticed NotReadyAddress in EndPoint !

This was fine wih kind 0.23 and Node 1.30.2

Is this normal ?

Best Regards

aojea · 2024-08-30T12:27:57Z

You have to add more details and a reproducer, is not easy to understand from the comments what can be failing there

bob2204 · 2024-08-30T16:02:44Z

I apologize, what I wish to say is that the Pod IP was not remove from the service endpoint.

I use a Nginx Deployment with a ReadinessProbe with this container :

containers:
      - image: nginx:1.26
        name: nginx
        readinessProbe:
          httpGet:
            path: /livez
            port: 80
          periodSeconds: 3
          failureThreshold: 2

and a service like :

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

and when this ReadinessProbe failed, the Pod IP is shown "NotReadyAddress" in the EndPoint :

kubectl describe endpoints nginx 
Name:         lemp
Namespace:    default
Labels:       app=nginx
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2024-08-30T15:41:43Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.32.204.60
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  80    TCP

Events:  <none>

BUT the Pod IP 10.32.204.60 was not removed from de Service Endpoints :

kubectl describe svc nginx 
Name:                     nginx
Namespace:                default
Labels:                   app=nginx
Annotations:              <none>
Selector:                 app=nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.16.42.218
IPs:                      172.16.42.218
LoadBalancer Ingress:     172.18.0.9 (Proxy)
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  31693/TCP
Endpoints:                10.32.204.60:80
Session Affinity:         None
External Traffic Policy:  Cluster
Internal Traffic Policy:  Cluster
Events:                   <none>

With Kind 0.23 and kindest/node:1.30.2, everything is OK, the Pod IP is removed from the Service EndPoints when the ReadinessProbe failed
AND with a K8S Cluster with 3 VMs and 1.31.0 everything is OK too !

Is my english clear ?

aojea · 2024-08-31T10:14:38Z

Just to understand, this works in kubernetes versions 1.30 and 1.31, only fails with Node 1.31.0 ?

bob2204 · 2024-08-31T13:41:27Z

After further investigations, I found that whatever kubernetes version is, the problem seems to be Virtualbox environnement.
I've to identical kind installations -- kind 0.24.0, kindest/node:1.31.0, calico-3.28.0 --, one on physical machine, one on Virtualbox VM :

On physical machine, the behavior is fine with kubernetes 1.31, the Pod IP is removed from endPoints field on kubectl describe svc nginx when readinessProbe faill (as mentionned in doc "If the readiness probe returns a failed state, Kubernetes removes the pod from all matching service endpoints." ).
On VM, the Pod IP remains visible in endPoints field on kubectl describe svc nginx when readinessProbe faill.

An explanation ?

aojea · 2024-08-31T16:03:05Z

Is the kubectl the same version?

What difference make for kind running on top of virtual box or a VM, it just used docker container?

Are you doing something out of the ordinary? Adding custom nodes or different kind configuration?

bob2204 · 2024-08-31T16:20:09Z

Kubectl is the same version
The two install are identical.
The both have the same Calico CNI version 3.28.
In both installs there is Docker.
The only difference is Physical Machine/Virtual Machine.

BenTheElder · 2024-09-03T17:12:10Z

Do you observe this without calico? We don't really provide support for third party CNI (it's supported to be possible to install it, but we're not tracking down bugs with all of them).

bob2204 · 2024-09-03T17:23:46Z

With calico/cilium/kindnet i've the same behavior
With Virtualbox/VmWare/kvm the same.
With killercoda everything is fine ! For me it's like a witness.

I've tried this simple

kind create cluster --config=config.yml

with one Control-Plane and three Workers.

aojea · 2024-09-03T17:37:00Z

can you upload a tarball with the logs of the cluster that has the issue with kind export logs and indicate the name of the Service and the time (more or less) when the problem happens?

bob2204 · 2024-09-03T18:52:00Z

full-logs.tar.gz
Service name: nginx
UTC Time: 2024-09-03T18:43:56Z

bob2204 · 2024-09-03T18:53:40Z

Manifest used

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        ports:
          - name: http
            containerPort: 80
        readinessProbe:
          httpGet:
            path: /healthz
            port: http
          periodSeconds: 2
          failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: nginx

Alternatively I create/destroy /usr/share/nginx/html/healthz to act on ReadinessProbe.

aojea · 2024-09-05T11:11:46Z

full-logs.tar.gz Service name: nginx UTC Time: 2024-09-03T18:43:56Z

that does not adds up, the ngninx container starts at 18:44

Sep 03 18:44:01 stage-worker2 containerd[185]: time="2024-09-03T18:44:01.642418279Z" level=info msg="StartContainer for "0f8fa2821ddca5ce36b9ee686d36e60cf6ffa18b665585c663fe9f4baef699d0" returns successfully"

and there is no more logs after that, you have period 2 and threshold 2, so it should start failing at 18:44:05 but there are no logs there

I noticed that your environment has only 2 GB of ram in the VM, it would not be surprising that the problem is that your VMs are constrained and everything is slower on that environment

bob2204 · 2024-09-05T12:01:58Z

I'm so sorry to waste your time, but the problem remains the same with 8GB !
This is the new dump.
full-log-2.tar.gz

The time was around 11:40/11:50 UTC.

k describe ep,svc nginx 
Name:         nginx
Namespace:    default
Labels:       <none>
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2024-09-05T11:48:23Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.244.2.3       <<<< This shows that the IP is not Ready 
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  80    TCP

Events:  <none>


Name:                     nginx
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app=nginx
Type:                     ClusterIP
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.96.88.94
IPs:                      10.96.88.94
Port:                     <unset>  80/TCP
TargetPort:               http/TCP
Endpoints:                10.244.2.3:80         <<<< Should NOT be here because the IP is not Ready
Session Affinity:         None
Internal Traffic Policy:  Cluster
Events:                   <none>

aojea · 2024-09-05T20:46:44Z

@bob2204 is like the kubelet is continuously restarting ... if you have the cluster running can you verify that?

bob2204 · 2024-09-06T02:31:46Z

None of the three kubelets is continuously restarting.
This the log of systemctl status kubelet of one node. The others are the same :

root@stage-worker2:/# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf, 11-kind.conf
     Active: active (running) since Thu 2024-09-05 11:44:26 UTC; 14h ago
       Docs: http://kubernetes.io/docs/
    Process: 197 ExecStartPre=/bin/sh -euc if [ -f /sys/fs/cgroup/cgroup.controllers ]; then /kind/bin/create-kubelet-cgroup-v2.sh; fi (code=exited, status=0/SUCCESS)
    Process: 198 ExecStartPre=/bin/sh -euc if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi (code=exited, status=0/SUCCESS)
   Main PID: 199 (kubelet)
      Tasks: 12 (limit: 9425)
     Memory: 43.2M
        CPU: 7min 5.993s
     CGroup: /kubelet.slice/kubelet.service
             └─199 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.3 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.10 --provider-id=kind://docker/stage/stage-worker2 --runtime-cgroups=/system.slice/containerd.service

faisalkamilansari · 2025-01-23T08:13:14Z

@bob2204

I am also having same problem , is your problem solved ??

kubernetes version : v1.31.2

chsakell · 2025-01-27T18:26:46Z

I have the same problem, even if the pod is not ready, its IP address is being added to the service endpoints.
The files for the probes don't even exit in the pod.

Client Version: v1.31.5
Kustomize Version: v5.4.2
Server Version: v1.32.0
kind v0.26.0 go1.23.4 linux/amd64

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: dev-cluster
nodes:
- role: control-plane
- role: worker
- role: worker

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  terminationGracePeriodSeconds: 1
  containers:
  - name: probe-demo
    image: nginx
    startupProbe:
      httpGet:
        path: /
        port: 80
      periodSeconds: 1
      failureThreshold: 30
    livenessProbe:
      httpGet:
        path: /live.html
        port: 80
      periodSeconds: 10
      failureThreshold: 30   
    readinessProbe:
      httpGet:
        path: /ready.html
        port: 80
      periodSeconds: 10
      failureThreshold: 20       
---
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 80

aojea · 2025-01-27T18:34:47Z

I think this is starting to become a magnet for symptoms that are not necessarily having the same root cause, kind does not do anything exceptional to kubernetes components, so all this should be open on kubernetes repo, besides I will be likely the one traiging them I will make an exception with this last one

@chsakell you can do kind export logs and dump all the logs and upload a tarball to see the component logs.
In addition it is better not to paste screenshot for the output of the commands, is ok to use markfown format

chsakell · 2025-01-27T18:57:08Z

Here's the logs exported with the following commands:

kind export logs --name dev-cluster
tar zcvf kind-logs.tar.gz .

kind-logs.tar.gz

BenTheElder · 2025-01-27T19:03:28Z

I have the same problem, even if the pod is not ready, its IP address is being added to the service endpoints.

This would be a bug in the main Kubernetes project, service endpoints and pods are not implemented here.

We implement cluster bootstrapping, a default PV driver, and NetworkPolicy / pod network (NOT endpoints / services, the network bridges / node to node pod IP routing)

GitHub.com/kubernetes/kubernetes

I don't mind discussing here but there's a better chance of finding the root issue if it's reported to the project.

Also, please aim for a minimal reproducer to help contributors find the cause quickly. EG does it still happen with a single node? If so then use that.

Aside: more generally, unless you're implementing distributed behaviors related to multi-node I highly recommend single node clusters, for simplicity, reduced overhead, and not over-reporting the host's resources which are ultimately shared by the nodes.

bob2204 · 2025-01-28T07:45:34Z

@faisalkamilansari

Like BenTheElder, I think it's a bug/feature of K8S. I've the same behavior with a vanilla Cluster 1.31 and 1.32. Kind it's not guilty ;-)

faisalkamilansari · 2025-01-28T08:03:40Z

@bob2204

Is there any issue in traffic routing due to that or will traffic be routed to only ready pods ?

aojea · 2025-01-28T08:06:34Z

@chsakell something is odd with your environment, I see pods like kindnet restart or the probe pod you run also fail probes and is restarted .... let's not complicate this issue more, kind is not doing anything special with endpoints or pod readiness, so if you have a repro please open an issue in kubernetes/kubernetes with all the details and exact steps and tag me there

/close

k8s-ci-robot · 2025-01-28T08:06:39Z

@aojea: Closing this issue.

In response to this:

@chsakell something is odd with your environment, I see pods like kindnet restart or the probe pod you run also fail probes and is restart .... let's not complicate this issue more, kind is not doing anything special with endpoints or pod readiness, so if you have a repro please open an issue in kubernetes/kubernetes with all the details and exact steps and tag me there

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bob2204 changed the title ~~Node IP not removed from Service EndPoint when ReadunessProbe failed~~ Node IP not removed from Service EndPoint when ReadinessProbe failed Aug 30, 2024

bob2204 changed the title ~~Node IP not removed from Service EndPoint when ReadinessProbe failed~~ Pod IP not removed from Service EndPoint when ReadinessProbe failed Aug 30, 2024

k8s-ci-robot closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

bob2204 commented Aug 30, 2024 •

edited

Loading

aojea commented Aug 30, 2024

bob2204 commented Aug 30, 2024 •

edited

Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024 •

edited

Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024

BenTheElder commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 3, 2024

bob2204 commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 5, 2024

bob2204 commented Sep 5, 2024 •

edited

Loading

aojea commented Sep 5, 2024

bob2204 commented Sep 6, 2024

faisalkamilansari commented Jan 23, 2025

chsakell commented Jan 27, 2025

aojea commented Jan 27, 2025

chsakell commented Jan 27, 2025

BenTheElder commented Jan 27, 2025

bob2204 commented Jan 28, 2025

faisalkamilansari commented Jan 28, 2025

aojea commented Jan 28, 2025 •

edited

Loading

k8s-ci-robot commented Jan 28, 2025

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Pod IP not removed from Service EndPoint when ReadinessProbe failed #3725

Comments

bob2204 commented Aug 30, 2024 • edited Loading

aojea commented Aug 30, 2024

bob2204 commented Aug 30, 2024 • edited Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024 • edited Loading

aojea commented Aug 31, 2024

bob2204 commented Aug 31, 2024

BenTheElder commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 3, 2024

bob2204 commented Sep 3, 2024

bob2204 commented Sep 3, 2024

aojea commented Sep 5, 2024

bob2204 commented Sep 5, 2024 • edited Loading

aojea commented Sep 5, 2024

bob2204 commented Sep 6, 2024

faisalkamilansari commented Jan 23, 2025

chsakell commented Jan 27, 2025

aojea commented Jan 27, 2025

chsakell commented Jan 27, 2025

BenTheElder commented Jan 27, 2025

bob2204 commented Jan 28, 2025

faisalkamilansari commented Jan 28, 2025

aojea commented Jan 28, 2025 • edited Loading

k8s-ci-robot commented Jan 28, 2025

bob2204 commented Aug 30, 2024 •

edited

Loading

bob2204 commented Aug 30, 2024 •

edited

Loading

bob2204 commented Aug 31, 2024 •

edited

Loading

bob2204 commented Sep 5, 2024 •

edited

Loading

aojea commented Jan 28, 2025 •

edited

Loading