control - multicastbits.com

The Architecture of Trust
Role of the API server
Role of etcd cluster
How the Loop Actually Works
As an example, let’s look at a simple nginx workload deployment
1) Intent (Desired State)
2) Watch (The Trigger)
3) Reconcile (Close the Gap)
4) Status (Report Back)
The Loop Doesn’t Protect You From Yourself
Why This Pattern Matters Outside Kubernetes
Ref

I’ve been diving deep into systems architecture lately, specifically Kubernetes

Strip away the UIs, the YAML, and the ceremony, and Kubernetes boils down to:

A very stubborn event driven collection of control loops

aka the reconciliation (Control) loop, and everything I read is calling this the “gold standard” for distributed control planes.

Because it decomposes the control plane into many small, independent loops, each continuously correcting drift rather than trying to execute perfect one-shot workflows. these loops are triggered by events or state changes, but what they do is determined by the the spec. vs observed state (status)

Now we have both:

spec: desired state
status: observed state

Kubernetes lives in that gap.

When spec and status match, everything’s quiet. When they don’t, something wakes up to ensure current state matches the declared state.

The Architecture of Trust

In Kubernetes, they don’t coordinate via direct peer-to-peer orchestration; They coordinate by writing to and watching one shared “state.”

That state lives behind the API server, and the API server validates it and persists it into etcd.

Role of the API server

The API server is the front door to the cluster’s shared truth: it’s the only place that can accept, validate, and persist declared intent as Kubernetes API objects (metadata/spec/status).

When you install a CRD, you’re extending the API itself with a new type (a new endpoint) or a schema the API server can validate against

When we use kubectl apply (or any client) to submit YAML/JSON to the API server, the API server validates it (built-in rules, CRD OpenAPI v3 schema / CEL rules, and potentially admission webhooks) and rejects invalid objects before they’re stored.

If the request passes validation, the API server persists the object into etcd (the whole API object, not just “intent”), and controllers/operators then watch that stored state and do the reconciliation work to make reality match it.

Once stored, controllers/operators (loops) watch those objects and run reconciliation to push the real world toward what’s declared.

it turns out In practice, most controllers don’t act directly on raw watch events, they consume changes through informer caches and queue work onto a rate-limited workqueue. They also often watch related/owned resources (secondary watches), not just the primary object, to stay convergent.

spec is often user-authored as discussed above, but it isn’t exclusively human-written, the scheduler and some controllers also update parts of it (e.g., scheduling decisions/bindings and defaulting).

Role of etcd cluster

etcd is the control plane’s durable record of “the authoritative reference for what the cluster believes that should exist and what it currently reports.”

If an intent (an API object) isn’t in etcd, controllers can’t converge on it—because there’s nothing recorded to reconcile toward

This makes the system inherently self-healing because it trusts the declared state and keeps trying to morph the world to match until those two align.

One tidbit worth noting:

In production, Nodes, runtimes, cloud load balancers can drift independently. Controllers treat those systems as observed state, and they keep measuring reality against what the API says should exist.

How the Loop Actually Works

Kubernetes isn’t one loop. It’s a bunch of loops(controllers) that all behave the same way:

read desired state (what the API says should exist)
observe actual state (what’s really happening)
calculate the diff
push reality toward the spec

As an example, let’s look at a simple nginx workload deployment

1) Intent (Desired State)

To Deploy the Nginx workload. You run:

kubectl apply -f nginx.yaml

The API server validates the object (and its schema, if it’s a CRD-backed type) and writes it into etcd.

At that point, Kubernetes has only recorded your intent. Nothing has “deployed” yet in the physical sense. The cluster has simply accepted:

“This is what the world should look like.”

2) Watch (The Trigger)

Controllers and schedulers aren’t polling the cluster like a bash script with a sleep 10.

They watch the API server.

When desired state changes, the loop responsible for it wakes up, runs through its logic, and acts:

“New desired state: someone wants an Nginx Pod.”

watches aren’t gospel. Events can arrive twice, late, or never, and your controller still has to converge. Controllers use list+watch patterns with periodic resync as a safety net. The point isn’t perfect signals it’s building a loop that stays correct under imperfect signals.

Controllers also don’t spin constantly they queue work. Events enqueue object keys; workers dequeue and reconcile; failures requeue with backoff. This keeps one bad object from melting the control plane.

3) Reconcile (Close the Gap)

Here’s the mental map that made sense to me:

Kubernetes is a set of level-triggered control loops. You declare desired state in the API, and independent loops keep working until the real world matches what you asked for.

Controllers (Deployment/ReplicaSet/etc.) watch the API for desired state and write more desired state.
- Example: a Deployment creates/updates a ReplicaSet; a ReplicaSet creates/updates Pods.
The scheduler finds Pods with no node assigned and picks a node.
- It considers resource requests, node capacity, taints/tolerations, node selectors, (anti)affinity, topology spread, and other constraints.
- It records its decision by setting spec.nodeName on the Pod.
The kubelet on the chosen node notices “a Pod is assigned to me” and makes it real.
- pulls images (if needed) via the container runtime (CRI)
- sets up volumes/mounts (often via CSI)
- triggers networking setup (CNI plugins do the actual wiring)
- starts/monitors containers and reports status back to the API

Each component writes its state back into the API, and the next loop uses that as input. No single component “runs the whole workflow.”

One property makes this survivable: reconcile must be safe to repeat (idempotent). The loop might run once or a hundred times (retries, resyncs, restarts, duplicate/missed watch events), and it should still converge to the same end result.

if the desired state is already satisfied, reconcile should do nothing; if something is missing, it should fill the gap, without creating duplicates or making things worse.

When concurrent updates happen (two controllers might try to update the same object at the same time)

Kubernetes handles this with optimistic concurrency. Every object has a resourceVersion (what version of this object did you read?”). If you try to write an update using an older version, the API server rejects it (often as a conflict).

Then the flow is: re-fetch the latest object, apply your change again, and retry.

4) Status (Report Back)

Once the pod is actually running, status flows back into the API.

The Loop Doesn’t Protect You From Yourself

What if the declared state says to delete something critical like kube-proxy or a CNI component? The loop doesn’t have opinions. It just does what the spec says.

A few things keep this from being a constant disaster:

Control plane components are special. The API server, etcd, scheduler, controller-manager these usually run as static pods managed directly by kubelet, not through the API. The reconciliation loop can’t easily delete the thing running the reconciliation loop as long as its manifest exists on disk.
DaemonSets recreate pods. Delete a kube-proxy pod and the DaemonSet controller sees “desired: 1, actual: 0” and spins up a new one. You’d have to delete the DaemonSet itself.
RBAC limits who can do what. Most users can’t touch kube-system resources.
Admission controllers can reject bad changes before they hit etcd.

But at the end, if your source of truth says “delete this,” the system will try. The model assumes your declared state is correct. Garbage in, garbage out.

This Pattern Outside Kubernetes

This pattern can be useful anywhere you manage state over time.

Scripts are fine until they aren’t:

they assume the world didn’t change since last run
they fail halfway and leave junk behind
they encode “steps” instead of “truth”

A loop is simpler:

define the desired state
store it somewhere authoritative
continuously reconcile reality back to it

Ref

What is VSX?

VSX is a cluster technology that allows the two VSX switches to run with independent control planes (OSPF/BGP) and present themselves as different routers in the network. In the datapath, however, they function as a single router and support active-active forwarding.

VSX allows you to mitigate inherent issues with a shared control plane that comes with traditional stacking while maintaining all the benefits

Control plane: Inter-Switch-Link and Keepalive
Data plane L2: MCLAGs
Data plane L3: Active gateway

This is a very similar technology compared to Dell VLT stacking with Dell OS10

Basic feature Comparison with Dell VLT Stacking

	Dell VLT Stacking	Aruba VSX
Supports Multi chassis Lag	✅	✅
independent control planes	✅	✅
All active-gateway configuration (L3 load balancing)	✅(VLT Peer routing)	✅(VSX Active forwarding)
Layer 3 Packet load balancing	✅	✅
Can Participate in Spanning tree MST/RSTP	✅	✅
Gateway IP Redundancy	✅VRRP	✅(VSX Active Gateway or VRRP)

Setup Guide

What you need?

10/25/40/100GE Port for the interswitch link

VSX supported switch, VSX is only supported on switches above CX6300 SKU

Switch Series	VSX
CX 6200 series	X
CX 6300 series	X
CX 6400 series	✅
CX 8200 series	✅
CX 8320/8325 series	✅
CX 8360 series	✅

**Updated 2020-Dec

For this guide im using a 8325 series switch

Dry run

Setup LAG interface for the inter-switch link (ISL)
Create the VSX cluster
Setup a keepalive link and a new VRF for the keepalive traffic

Setup LAG interface for the inter-switch link (ISL)

In order to form the VSX cluster, we need a LAG interface for the inter switch communication

Naturally i pick the fastest ports on the switch to create this 10/25/40/100GE

Depending on what switch you have, The ISL bandwidth can be a limitation/Bottleneck, Account for this factor when designing a VSX based solution
Utilize VSX-Activeforwarding or Active gateways to mitigate this

Create the LAG interface

This is a regular Port channel no special configurations, you need to create this on both switches

Native VLAN needs to be the default VLAN
Trunk port and All VLANs allowed

CORE01#

interface lag 256
no shutdown
description VSX-LAG
no routing
vlan trunk native 1 tag
vlan trunk allowed all
lacp mode active
exit


-------------------------------

CORE02#

interface lag 256
no shutdown
description VSX-LAG
no routing
vlan trunk native 1 tag
vlan trunk allowed all
lacp mode active
exit

Add/Assign the physical ports to the LAG interface

I’m using two 100GE ports for the ISL LAG

CORE01#

interface 1/1/55
no shutdown
lag 256
exit
interface 1/1/56
no shutdown
lag 256
exit

-------------------------------

CORE02#

interface 1/1/55
no shutdown
lag 256
exit
interface 1/1/56
no shutdown
lag 256
exit

Do the same configuration on the VSX Peer switch (Second Switch)

Connect the cables via DAC/Optical and confirm the Port-channel health

CORE01# sh lag 256
System-ID       : b8:d4:e7:d5:36:00
System-priority : 65534

Aggregate lag256 is up
 Admin state is up
 Description : VSX-LAG
 Type                        : normal
 MAC Address                 : b8:d4:e7:d5:36:00
 Aggregated-interfaces       : 1/1/55 1/1/56
 Aggregation-key             : 256
 Aggregate mode              : active
 Hash                        : l3-src-dst
 LACP rate                   : slow
 Speed                       : 200000 Mb/s
 Mode                        : trunk


-------------------------------------------------------------------

CORE02# sh lag 256
System-ID       : b8:d4:e7:d5:f3:00
System-priority : 65534

Aggregate lag256 is up
 Admin state is up
 Description : VSX-LAG
 Type                        : normal
 MAC Address                 : b8:d4:e7:d5:f3:00
 Aggregated-interfaces       : 1/1/55 1/1/56
 Aggregation-key             : 256
 Aggregate mode              : active
 Hash                        : l3-src-dst
 LACP rate                   : slow
 Speed                       : 200000 Mb/s
 Mode                        : trunk

Form the VSX Cluster

under the configuration mode, go in to the VSX context by entering “vsx” and issue the following commands on both switches

CORE01#

vsx
    inter-switch-link lag 256
    role primary
    linkup-delay-timer 30

-------------------------------

CORE02#

vsx
    inter-switch-link lag 256
    role secondary
    linkup-delay-timer 30

Check the VSX Status

CORE01# sh vsx status
VSX Operational State
---------------------
  ISL channel             : In-Sync
  ISL mgmt channel        : operational
  Config Sync Status      : In-Sync
  NAE                     : peer_reachable
  HTTPS Server            : peer_reachable

Attribute           Local               Peer
------------        --------            --------
ISL link            lag256              lag256
ISL version         2                   2
System MAC          b8:d4:e7:d5:36:00   b8:d4:e7:d5:f3:00
Platform            8325                8325
Software Version    GL.10.06.0001       GL.10.06.0001
Device Role         primary             secondary

----------------------------------------

CORE02# sh vsx status
VSX Operational State
---------------------
  ISL channel             : In-Sync
  ISL mgmt channel        : operational
  Config Sync Status      : In-Sync
  NAE                     : peer_reachable
  HTTPS Server            : peer_reachable

Attribute           Local               Peer
------------        --------            --------
ISL link            lag256              lag256
ISL version         2                   2
System MAC          b8:d4:e7:d5:f3:00   b8:d4:e7:d5:36:00
Platform            8325                8325
Software Version    GL.10.06.0001       GL.10.06.0001
Device Role         secondary           primary

Setup the Keepalive Link

its recommended to set up a Keepalive link to avoid Split-brain scenarios if the ISL goes down, We are trying to prevent both switches from thinking they are the active devices creating STP loops and other issues on the network

This is not a must-have, it’s nice to have, As of Aruba CX OS 10.06.x you need to sacrifice one of your data ports for this

Dell OS10 VLT archives this via the OOBM network ports, Supposedly Keepalive over OOBM is something Aruba is working on for future releases

Few things to note

It’s recommended using a routed port in a separate VRF for the keepalive link
can use a 1Gbps link for this if needed

Provision the port and VRF

CORE01#

vrf KEEPALIVE

interface 1/1/48
no shutdown
vrf attach KEEPALIVE
description VSX-keepalive-Link
ip address 192.168.168.1/24
exit

-----------------------------------------

CORE02#

vrf KEEPALIVE

interface 1/1/48
no shutdown
vrf attach KEEPALIVE
description VSX-keepalive-Link
ip address 192.168.168.2/24
exit

Define the Keepalive link

Note – Remember to define the vrf id in the keepalive statement

Thanks /u/illumynite for pointing that out

CORE01#

vsx
    inter-switch-link lag 256
    role primary
    keepalive peer 192.168.168.2 source 192.168.168.1 vrf KEEPALIVE
    linkup-delay-timer 30

-----------------------------------------

CORE02#

vsx
    inter-switch-link lag 256
    role secondary
    keepalive peer 192.168.168.1 source 192.168.168.2 vrf KEEPALIVE
    linkup-delay-timer 30

Next up…….

VSX MC-LAG
VSX Active forwarding
VSX Active gateway

References

AOS-CX 10.06 Virtual SwitchingExtension (VSX) Guide

As always if you notice any mistakes please do let me know in the comments

Kubernetes Loop