Kubernetes Loop

I’ve been diving deep into systems architecture lately, specifically Kubernetes

Strip away the UIs, the YAML, and the ceremony, and Kubernetes boils down to:

A very stubborn event driven collection of control loops

aka the reconciliation (Control) loop, and everything I read is calling this the “gold standard” for distributed control planes.

Because it decomposes the control plane into many small, independent loops, each continuously correcting drift rather than trying to execute perfect one-shot workflows. these loops are triggered by events or state changes, but what they do is determined by the the spec. vs observed state (status)

Now we have both:

  • spec: desired state
  • status: observed state

Kubernetes lives in that gap.

When spec and status match, everything’s quiet. When they don’t, something wakes up to ensure current state matches the declared state.

The Architecture of Trust

In Kubernetes, they don’t coordinate via direct peer-to-peer orchestration; They coordinate by writing to and watching one shared “state.”

That state lives behind the API server, and the API server validates it and persists it into etcd.

Role of the API server

The API server is the front door to the cluster’s shared truth: it’s the only place that can accept, validate, and persist declared intent as Kubernetes API objects (metadata/spec/status).

When you install a CRD, you’re extending the API itself with a new type (a new endpoint) or a schema the API server can validate against

When we use kubectl apply (or any client) to submit YAML/JSON to the API server, the API server validates it (built-in rules, CRD OpenAPI v3 schema / CEL rules, and potentially admission webhooks) and rejects invalid objects before they’re stored.

If the request passes validation, the API server persists the object into etcd (the whole API object, not just “intent”), and controllers/operators then watch that stored state and do the reconciliation work to make reality match it.

Once stored, controllers/operators (loops) watch those objects and run reconciliation to push the real world toward what’s declared.

it turns out In practice, most controllers don’t act directly on raw watch events, they consume changes through informer caches and queue work onto a rate-limited workqueue. They also often watch related/owned resources (secondary watches), not just the primary object, to stay convergent.

spec is often user-authored as discussed above, but it isn’t exclusively human-written, the scheduler and some controllers also update parts of it (e.g., scheduling decisions/bindings and defaulting).

Role of etcd cluster

etcd is the control plane’s durable record of “the authoritative reference for what the cluster believes that should exist and what it currently reports.”

If an intent (an API object) isn’t in etcd, controllers can’t converge on it—because there’s nothing recorded to reconcile toward

This makes the system inherently self-healing because it trusts the declared state and keeps trying to morph the world to match until those two align.

One tidbit worth noting:

In production, Nodes, runtimes, cloud load balancers can drift independently. Controllers treat those systems as observed state, and they keep measuring reality against what the API says should exist.

How the Loop Actually Works

 Kubernetes isn’t one loop. It’s a bunch of loops(controllers) that all behave the same way:

  • read desired state (what the API says should exist)
  • observe actual state (what’s really happening)
  • calculate the diff
  • push reality toward the spec

 

As an example, let’s look at a simple nginx workload deployment

1) Intent (Desired State)

To Deploy the Nginx workload. You run:

kubectl apply -f nginx.yaml

 

The API server validates the object (and its schema, if it’s a CRD-backed type) and writes it into etcd.

At that point, Kubernetes has only recorded your intent. Nothing has “deployed” yet in the physical sense. The cluster has simply accepted:

“This is what the world should look like.”

2) Watch (The Trigger)

Controllers and schedulers aren’t polling the cluster like a bash script with a sleep 10.

They watch the API server.

When desired state changes, the loop responsible for it wakes up, runs through its logic, and acts:

“New desired state: someone wants an Nginx Pod.”

watches aren’t gospel. Events can arrive twice, late, or never, and your controller still has to converge. Controllers use list+watch patterns with periodic resync as a safety net. The point isn’t perfect signals it’s building a loop that stays correct under imperfect signals.

Controllers also don’t spin constantly they queue work. Events enqueue object keys; workers dequeue and reconcile; failures requeue with backoff. This keeps one bad object from melting the control plane.

3) Reconcile (Close the Gap)

Here’s the mental map that made sense to me:

Kubernetes is a set of level-triggered control loops. You declare desired state in the API, and independent loops keep working until the real world matches what you asked for.

  • Controllers (Deployment/ReplicaSet/etc.) watch the API for desired state and write more desired state.
    • Example: a Deployment creates/updates a ReplicaSet; a ReplicaSet creates/updates Pods.
  • The scheduler finds Pods with no node assigned and picks a node.
    • It considers resource requests, node capacity, taints/tolerations, node selectors, (anti)affinity, topology spread, and other constraints.
    • It records its decision by setting spec.nodeName on the Pod.
  • The kubelet on the chosen node notices “a Pod is assigned to me” and makes it real.
    • pulls images (if needed) via the container runtime (CRI)
    • sets up volumes/mounts (often via CSI)
    • triggers networking setup (CNI plugins do the actual wiring)
    • starts/monitors containers and reports status back to the API

Each component writes its state back into the API, and the next loop uses that as input. No single component “runs the whole workflow.”

One property makes this survivable: reconcile must be safe to repeat (idempotent). The loop might run once or a hundred times (retries, resyncs, restarts, duplicate/missed watch events), and it should still converge to the same end result.

if the desired state is already satisfied, reconcile should do nothing; if something is missing, it should fill the gap, without creating duplicates or making things worse.

When concurrent updates happen (two controllers might try to update the same object at the same time)

Kubernetes handles this with optimistic concurrency. Every object has a resourceVersion (what version of this object did you read?”). If you try to write an update using an older version, the API server rejects it (often as a conflict).

Then the flow is: re-fetch the latest object, apply your change again, and retry.

4) Status (Report Back)

Once the pod is actually running, status flows back into the API.

The Loop Doesn’t Protect You From Yourself

What if the declared state says to delete something critical like kube-proxy or a CNI component? The loop doesn’t have opinions. It just does what the spec says.

A few things keep this from being a constant disaster:

  • Control plane components are special. The API server, etcd, scheduler, controller-manager these usually run as static pods managed directly by kubelet, not through the API. The reconciliation loop can’t easily delete the thing running the reconciliation loop as long as its manifest exists on disk.
  • DaemonSets recreate pods. Delete a kube-proxy pod and the DaemonSet controller sees “desired: 1, actual: 0” and spins up a new one. You’d have to delete the DaemonSet itself.
  • RBAC limits who can do what. Most users can’t touch kube-system resources.
  • Admission controllers can reject bad changes before they hit etcd.

But at the end, if your source of truth says “delete this,” the system will try. The model assumes your declared state is correct. Garbage in, garbage out.

Why This Pattern Matters Outside Kubernetes

This pattern shows up anywhere you manage state over time.

Scripts are fine until they aren’t:

  • they assume the world didn’t change since last run
  • they fail halfway and leave junk behind
  • they encode “steps” instead of “truth”

A loop is simpler:

  • define the desired state
  • store it somewhere authoritative
  • continuously reconcile reality back to it

Ref

NFS Provisioner Setup and Testing Guide for Rancher RKE2/Kubernetes

This guide covers how to add an NFS StorageClass and a dynamic provisioner to Kubernetes using the nfs-subdir-external-provisioner Helm chart. This enables us to mount NFS shares dynamically for PersistentVolumeClaims (PVCs) used by workloads.

Example use cases:

  • Database migrations
  • Apache Kafka clusters
  • Data processing pipelines

Requirements:

  • An accessible NFS share exported with: rw,sync,no_subtree_check,no_root_squash
  • NFSv3 or NFSv4 protocol
  • Kubernetes v1.31.7+ or RKE2 with rke2r1 or later

 

lets get to it


1. NFS Server Export Setup

Ensure your NFS server exports the shared directory correctly:

# /etc/exports
/rke-pv-storage  worker-node-ips(rw,sync,no_subtree_check,no_root_squash)

 

  • Replace worker-node-ips with actual IPs or CIDR blocks of your worker nodes.
  • Run sudo exportfs -r to reload the export table.

2. Install NFS Subdir External Provisioner

Add the Helm repo and install the provisioned:

helm repo add nfs-subdir-external-provisioner \
  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

helm install nfs-client-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --namespace kube-system \
  --set nfs.server=192.168.162.100 \
  --set nfs.path=/rke-pv-storage \
  --set storageClass.name=nfs-client \
  --set storageClass.defaultClass=false

Notes:

  • If you want this to be the default storage class, change storageClass.defaultClass=true.
  • nfs.server should point to the IP of your NFS server.
  • nfs.path must be a valid exported directory from that NFS server.
  • storageClass.name can be referenced in your PersistentVolumeClaim YAMLs using storageClassName: nfs-client.

3. PVC and Pod Test

Create a test PVC and pod using the following YAML:

# test-nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-nfs-pod
spec:
  containers:
  - name: shell
    image: busybox
    command: [ "sh", "-c", "sleep 3600" ]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-nfs-pvc

 

Apply it:

kubectl apply -f test-nfs-pvc.yaml
kubectl get pvc test-nfs-pvc -w

 

Expected output:

NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-nfs-pvc   Bound    pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   1Gi        RWX            nfs-client     30s

 


4. Troubleshooting

If the PVC remains in Pending, follow these steps:

Check the provisioner pod status:

kubectl get pods -n kube-system | grep nfs-client-provisioner

 

Inspect the provisioner pod:

kubectl describe pod -n kube-system <pod-name>
kubectl logs -n kube-system <pod-name>

 

Common Issues:

  • Broken State: Bad NFS mount
    mount.nfs: access denied by server while mounting 192.168.162.100:/pl-elt-kakfka

     

    • This usually means the NFS path is misspelled or not exported properly.
  • Broken State: root_squash enabled
    failed to provision volume with StorageClass "nfs-client": unable to create directory to provision new pv: mkdir /persistentvolumes/…: permission denied

     

    • Fix by changing the export to use no_root_squash or chown the directory to nobody:nogroup.
  • ImagePullBackOff
    • Ensure nodes have internet access and can reach registry.k8s.io.
  • RBAC errors
    • Make sure the ServiceAccount used by the provisioner has permissions to watch PVCs and create PVs.

5. Healthy State Example

kubectl get pods -n kube-system | grep nfs-client-provisioner-nfs-subdir-external-provisioner
nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m   1/1     Running     0          3m39s

 

kubectl describe pod -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
# Output shows pod is Running with Ready=True

 

kubectl logs -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
...
I0512 21:46:03.752701       1 controller.go:1420] provision "default/test-nfs-pvc" class "nfs-client": volume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" provisioned
I0512 21:46:03.752763       1 volume_store.go:212] Trying to save persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d"
I0512 21:46:03.772301       1 volume_store.go:219] persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" saved
I0512 21:46:03.772353       1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Name:"test-nfs-pvc"}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-73481f45-3055-4b4b-80f4-e68ffe83802d
...

 

Once test-nfs-pvc is bound and the pod starts successfully, your setup is working. You can now safely use storageClass: nfs-client in other workloads (e.g., Strimzi KafkaNodePool).


Solution – RKE Cluster MetalLB provides Services with IP Addresses but doesn’t ARP for the address

I ran in to the the same issue detailed here working with a RKE cluster

https://github.com/metallb/metallb/issues/1154

After looking around for a few hours digging in to the logs i figured out the issue, hopefully this helps some one else our there in the situation save some time.

Make sure the IPVS mode is enabled on the cluster configuration

If you are using :

RKE2 – edit the cluster.yaml file

RKE1 – Edit the cluster configuration from the rancher UI > Cluster management > Select the cluster > edit configuration > edit as YAML

Locate the services field under rancher_kubernetes_engine_config and add the following options to enable IPVS

    kubeproxy:
      extra_args:
        ipvs-scheduler: lc
        proxy-mode: ipvs

https://www.suse.com/support/kb/doc/?id=000020035

Default

After changes

Make sure the Kernel modules are enabled on the nodes running control planes

Background

Example Rancher – RKE1 cluster

sudo docker ps | grep proxy # find the container ID for kubproxy

sudo docker logs ####containerID###

0313 21:44:08.315888  108645 feature_gate.go:245] feature gates: &{map[]}
I0313 21:44:08.346872  108645 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="nf_conntrack_ipv4"
E0313 21:44:08.347024  108645 server_others.go:107] "Can't use the IPVS proxier" err="IPVS proxier will not be used because the following required kernel modules are not loaded: [ip_vs_lc]"

Kubproxy is trying to load the needed kernel modules and failing to enable IPVS

Lets enable the kernel modules

sudo nano /etc/modules-load.d/ipvs.conf

ip_vs_lc
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack_ipv4

Install ipvsadm to confirm the changes

sudo dnf install ipvsadm -y

Reboot the VM or the Baremetal server

use the sudo ipvsadm to confirm ipvs is enabled

sudo ipvsadm

Testing

kubectl get svc -n #namespace | grep load
arping -I ens192 192.168.94.140
ARPING 192.168.94.140 from 192.168.94.65 ens192
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 1.117ms
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 0.737ms
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 0.845ms
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 0.668ms
Sent 4 probes (1 broadcast(s))
Received 4 response(s)

If you have the service type load balancer on a deployment now you should be able to reach it if the container is responding on the service

helpful Links

https://metallb.universe.tf/configuration/troubleshooting/

https://github.com/metallb/metallb/issues/1154

https://github.com/rancher/rke2/issues/3710