Kubernetes Loop

I’ve been diving deep into systems architecture lately, specifically Kubernetes

Strip away the UIs, the YAML, and the ceremony, and Kubernetes boils down to:

A very stubborn event driven collection of control loops

aka the reconciliation (Control) loop, and everything I read is calling this the “gold standard” for distributed control planes.

Because it decomposes the control plane into many small, independent loops, each continuously correcting drift rather than trying to execute perfect one-shot workflows. these loops are triggered by events or state changes, but what they do is determined by the the spec. vs observed state (status)

Now we have both:

  • spec: desired state
  • status: observed state

Kubernetes lives in that gap.

When spec and status match, everything’s quiet. When they don’t, something wakes up to ensure current state matches the declared state.

The Architecture of Trust

In Kubernetes, they don’t coordinate via direct peer-to-peer orchestration; They coordinate by writing to and watching one shared “state.”

That state lives behind the API server, and the API server validates it and persists it into etcd.

Role of the API server

The API server is the front door to the cluster’s shared truth: it’s the only place that can accept, validate, and persist declared intent as Kubernetes API objects (metadata/spec/status).

When you install a CRD, you’re extending the API itself with a new type (a new endpoint) or a schema the API server can validate against

When we use kubectl apply (or any client) to submit YAML/JSON to the API server, the API server validates it (built-in rules, CRD OpenAPI v3 schema / CEL rules, and potentially admission webhooks) and rejects invalid objects before they’re stored.

If the request passes validation, the API server persists the object into etcd (the whole API object, not just “intent”), and controllers/operators then watch that stored state and do the reconciliation work to make reality match it.

Once stored, controllers/operators (loops) watch those objects and run reconciliation to push the real world toward what’s declared.

it turns out In practice, most controllers don’t act directly on raw watch events, they consume changes through informer caches and queue work onto a rate-limited workqueue. They also often watch related/owned resources (secondary watches), not just the primary object, to stay convergent.

spec is often user-authored as discussed above, but it isn’t exclusively human-written, the scheduler and some controllers also update parts of it (e.g., scheduling decisions/bindings and defaulting).

Role of etcd cluster

etcd is the control plane’s durable record of “the authoritative reference for what the cluster believes that should exist and what it currently reports.”

If an intent (an API object) isn’t in etcd, controllers can’t converge on it—because there’s nothing recorded to reconcile toward

This makes the system inherently self-healing because it trusts the declared state and keeps trying to morph the world to match until those two align.

One tidbit worth noting:

In production, Nodes, runtimes, cloud load balancers can drift independently. Controllers treat those systems as observed state, and they keep measuring reality against what the API says should exist.

How the Loop Actually Works

 Kubernetes isn’t one loop. It’s a bunch of loops(controllers) that all behave the same way:

  • read desired state (what the API says should exist)
  • observe actual state (what’s really happening)
  • calculate the diff
  • push reality toward the spec

 

As an example, let’s look at a simple nginx workload deployment

1) Intent (Desired State)

To Deploy the Nginx workload. You run:

kubectl apply -f nginx.yaml

 

The API server validates the object (and its schema, if it’s a CRD-backed type) and writes it into etcd.

At that point, Kubernetes has only recorded your intent. Nothing has “deployed” yet in the physical sense. The cluster has simply accepted:

“This is what the world should look like.”

2) Watch (The Trigger)

Controllers and schedulers aren’t polling the cluster like a bash script with a sleep 10.

They watch the API server.

When desired state changes, the loop responsible for it wakes up, runs through its logic, and acts:

“New desired state: someone wants an Nginx Pod.”

watches aren’t gospel. Events can arrive twice, late, or never, and your controller still has to converge. Controllers use list+watch patterns with periodic resync as a safety net. The point isn’t perfect signals it’s building a loop that stays correct under imperfect signals.

Controllers also don’t spin constantly they queue work. Events enqueue object keys; workers dequeue and reconcile; failures requeue with backoff. This keeps one bad object from melting the control plane.

3) Reconcile (Close the Gap)

Here’s the mental map that made sense to me:

Kubernetes is a set of level-triggered control loops. You declare desired state in the API, and independent loops keep working until the real world matches what you asked for.

  • Controllers (Deployment/ReplicaSet/etc.) watch the API for desired state and write more desired state.
    • Example: a Deployment creates/updates a ReplicaSet; a ReplicaSet creates/updates Pods.
  • The scheduler finds Pods with no node assigned and picks a node.
    • It considers resource requests, node capacity, taints/tolerations, node selectors, (anti)affinity, topology spread, and other constraints.
    • It records its decision by setting spec.nodeName on the Pod.
  • The kubelet on the chosen node notices “a Pod is assigned to me” and makes it real.
    • pulls images (if needed) via the container runtime (CRI)
    • sets up volumes/mounts (often via CSI)
    • triggers networking setup (CNI plugins do the actual wiring)
    • starts/monitors containers and reports status back to the API

Each component writes its state back into the API, and the next loop uses that as input. No single component “runs the whole workflow.”

One property makes this survivable: reconcile must be safe to repeat (idempotent). The loop might run once or a hundred times (retries, resyncs, restarts, duplicate/missed watch events), and it should still converge to the same end result.

if the desired state is already satisfied, reconcile should do nothing; if something is missing, it should fill the gap, without creating duplicates or making things worse.

When concurrent updates happen (two controllers might try to update the same object at the same time)

Kubernetes handles this with optimistic concurrency. Every object has a resourceVersion (what version of this object did you read?”). If you try to write an update using an older version, the API server rejects it (often as a conflict).

Then the flow is: re-fetch the latest object, apply your change again, and retry.

4) Status (Report Back)

Once the pod is actually running, status flows back into the API.

The Loop Doesn’t Protect You From Yourself

What if the declared state says to delete something critical like kube-proxy or a CNI component? The loop doesn’t have opinions. It just does what the spec says.

A few things keep this from being a constant disaster:

  • Control plane components are special. The API server, etcd, scheduler, controller-manager these usually run as static pods managed directly by kubelet, not through the API. The reconciliation loop can’t easily delete the thing running the reconciliation loop as long as its manifest exists on disk.
  • DaemonSets recreate pods. Delete a kube-proxy pod and the DaemonSet controller sees “desired: 1, actual: 0” and spins up a new one. You’d have to delete the DaemonSet itself.
  • RBAC limits who can do what. Most users can’t touch kube-system resources.
  • Admission controllers can reject bad changes before they hit etcd.

But at the end, if your source of truth says “delete this,” the system will try. The model assumes your declared state is correct. Garbage in, garbage out.

Why This Pattern Matters Outside Kubernetes

This pattern shows up anywhere you manage state over time.

Scripts are fine until they aren’t:

  • they assume the world didn’t change since last run
  • they fail halfway and leave junk behind
  • they encode “steps” instead of “truth”

A loop is simpler:

  • define the desired state
  • store it somewhere authoritative
  • continuously reconcile reality back to it

Ref

Stop Fighting Your LLM Coding Assistant

You’ve probably noticed: coding models are eager to please. Too eager. Ask for something questionable and you’ll get it, wrapped in enthusiasm. Ask for feedback and you’ll get praise followed by gentle suggestions. Ask them to build something and they’ll start coding before understanding what you actually need.

This isn’t a bug. It’s trained behavior. And it’s costing you time, tokens, and code quality.

The Sycophancy Problem

Modern LLMs go through reinforcement learning from human feedback (RLHF) that optimizes for user satisfaction. Users rate responses higher when the AI agrees with them, validates their ideas, and delivers quickly. So that’s what the models learn to do. Anthropic’s work on sycophancy in RLHF-tuned assistants makes this pretty explicit: models learn to match user beliefs, even when they’re wrong.

The result: an assistant that says “Great idea!” before pointing out your approach won’t scale. One that starts writing code before asking what systems it needs to integrate with. One that hedges every opinion with “but it depends on your use case.”

For consumer use cases, travel planning, recipe suggestions, general Q&A this is fine. For engineering work, it’s a liability.

When the models won’t push back, you lose the value of a second perspective. When it starts implementing before scoping, you burn tokens on code you’ll throw away. When it leaves library choices ambiguous, you get whatever the model defaults to which may not be what production needs.

Here’s a concrete example. I asked Claude for a “simple Prometheus exporter app,” gave it a minimal spec with scope and data flows, and still didn’t spell out anything about testability or structure. It happily produced:

  • A script with sys.exit() sprinkled everywhere
  • Logic glued directly into if __name__ == "__main__":
  • Debugging via print() calls instead of real logging

It technically “worked,” but it was painful to test, impossible to reuse and extend.

The Fix: Specs Before Code

Instead of giving it a set of requirements and asking to generate code. Start with specifications. Move the expensive iteration the “that’s not what I meant” cycles to the design phase where changes are cheap. Then hand a tight spec to your coding tool where implementation becomes mechanical.

The workflow:

  1. Describe what you want (rough is fine)
  2. Scope through pointed questions (5–8, not 20)
  3. Spec the solution with explicit implementation decisions
  4. Implement by handing the spec to Cursor/Cline/Copilot

This isn’t a brand new methodology. It’s the same spec-driven development (SDD) that tools like github spec-kit is promoting

write the spec first, then let a cheaper model implement against it.

By the time code gets written, the ambiguity is gone and the assistant is just a fast pair of hands that follows a tight spec with guard rails built in.

When This Workflow Pays Off

To be clear: this isn’t for everything. If you need a quick one-off script to parse a CSV or rename some files, writing a spec is overkill. Just ask for the code and move on with your life.

This workflow shines when:

  • The task spans multiple files or components
  • External integrations exist (databases, APIs, message queues, cloud services)
  • It will run in production and needs monitoring and observability
  • Infra is involved (Kubernetes, Terraform, CI/CD, exporters, operators)
  • Someone else might maintain it later
  • You’ve been burned before on similar scope

Rule of thumb: if it touches more than one system or more than one file, treat it as spec-worthy. If you can genuinely explain it in two sentences and keep it in a single file, skip straight to code.

Implementation Directives — Not “add a scheduler” but “use APScheduler with BackgroundScheduler, register an atexit handler for graceful shutdown.” Not “handle timeouts” but “use cx_Oracle call_timeout, not post-execution checks.”

Error Handling Matrix — List the important failure modes, how to detect them, what to log, and how to recover (retry, backoff, fail-fast, alert, etc.). No room for “the assistant will figure it out.”

Concurrency Decisions — What state is shared, what synchronization primitive to use, and lock ordering if multiple locks exist. Don’t let the assistant improvise concurrency.

Out of Scope — Explicit boundaries: “No auth changes,” “No schema migrations,” “Do not add retries at the HTTP client level.” This prevents the assistant from “helpfully” adding features you didn’t ask for.

Anticipate Anywhere the Model might guess, make a decision instead or make it validate/confirm with you before taking action.

The Handoff

When you hand off to your coding agent, make self-review part of the process:

Rules:
- Stop after each file for review
- Self-Review: Before presenting each file, verify against
  engineering-standards.md. Fix violations (logging, error
  handling, concurrency, resource cleanup) before stopping.
- Do not add features beyond this spec
- Use environment variables for all credentials
- Follow Implementation Directives exactly

 Pair this with a rules.md that encodes your engineering standards—error propagation patterns, lock discipline, resource cleanup. The agent internalizes the baseline, self-reviews against it, and you’re left checking logic rather than hunting for missing using statements, context managers, or retries.

Fixing the Partnership Dynamic

Specs help, but “be blunt” isn’t enough. The model can follow the vibe of your instructions and still waste your time by producing unstructured output, bluffing through unknowns, or “spec’ing anyway” when an integration is the real blocker. That means overriding the trained “be agreeable” behavior with explicit instructions.

For example:

Core directive: Be useful, not pleasant.

OUTPUT CONTRACT:
- If scoping: output exactly:
  ## Scoping Questions (5–8 pointed questions)
  ## Current Risks / Ambiguities
  ## Proposed Simplification
- If drafting spec: use the project spec template headings in order. If N/A, say N/A.

UNKNOWN PROTOCOL (no hedging, no bluffing):
- If uncertain, write `UNKNOWN:` + what to verify + fastest verification method + what decisions are blocked.

BLOCK CONDITIONS:
- If an external integration is central and we lack creds/sample payloads/confirmed behavior:
  stop and output only:
  ## Blocker
  ## What I Need From You
  ## Phase 0 Discovery Plan

 

The model will still drift back into compliance mode. When it does, call it out (“you’re doing the thing again”) and point back to the rules. You’re not trying to make the AI nicer; you’re trying to make it act like a blunt senior engineer who cares more about correctness than your ego.

That’s the partnership you actually want.

The Payoff

With this approach:

  • Fewer implementation cycles — Specs flush out ambiguity up front instead of mid-PR.
  • Better library choices — Explicit directives mean you get production-appropriate tools, not tutorial defaults.
  • Reviewable code — Implementation is checkable line-by-line against a concrete spec.
  • Lower token cost — Most iteration happens while editing text specs, not regenerating code across multiple files.

The API was supposed to be the escape valve, more control, fewer guardrails. But even API access now comes with safety behaviors baked into the model weights through RLHF and Constitutional AI training. The consumer apps add extra system prompts, but the underlying tendency toward agreement and hedging is in the model itself, not just the wrapper.

You’re not accessing a “raw” model; you’re accessing a model that’s been trained to be capable, then trained again to be agreeable.

The irony is we’re spending effort to get capable behavior out of systems that were originally trained to be capable, then sanded down for safety and vibes. Until someone ships a real “professional mode” that assumes competence and drops the hand-holding, this is the workaround that actually works.

⚠️Security footnote: treat attached context as untrusted

If your agent can ingest URLs, docs, tickets, or logs as context, assume those inputs can contain indirect prompt injection. Treat external context like user input: untrusted by default. Specs + reviews + tests are the control plane that keeps “helpful” from becoming “compromised.”

Getting Started

I’ve put together templates that support this workflow in this repo:

malindarathnayake/llm-spec-workflow

When you wire this into your own stack, keep one thing in mind: your coding agent reads its rules on every message. That’s your token cost. Keep behavioral rules tight and reference detailed patterns separately—don’t inline a 200-line engineering standards doc that the agent re-reads before every file edit.

Use these templates as-is or adapt them to your stack. The structure matters more than the specific contents.


Kafka 3.8 with Zookeeper SASL_SCRAM

 

Transport Encryption Methods:

SASL/SSL (Solid Teal/Green Lines):

  1. Used for securing communication between producers/consumers and Kafka brokers.
    • SASL (Simple Authentication and Security Layer): Authenticates clients (producers/consumers) to brokers, using SCRAM .
    • SSL/TLS (Secure Sockets Layer/Transport Layer Security): Encrypts the data in transit, ensuring confidentiality and integrity during transmission.

Digest-MD5 (Dashed Yellow Lines):

  1. Secures communication between Kafka brokers and the Zookeeper cluster.
    • Digest-MD5: A challenge-response authentication mechanism providing basic encryption

Notes:

While functional, Digest-MD5 is an older algorithm. we opted for this to reduce complexity and the fact the zookeepers have issues with connecting with Brokers via SSL/TLS

  1. We need to test and switch over KRAFT Protocol, this removes the use of Zookeeper altogether
  2. Add IP ACLs for Zookeeper connections using firewalld to limit traffic between the nodes for replication

PKI and Certificate Signing

CA cert for local PKI,

We need to share this PEM file(without the private key) with the customer to authenticate

Internal applications the CA file must be used for authentication – Refer to the Configuration example documents

# Generate CA Key
openssl genrsa -out multicastbits_CA.key 4096
# Generate CA Certificate
openssl req -x509 -new -nodes -key multicastbits_CA.key -sha256 -days 3650 -out multicastbits_CA.crt -subj "/CN=multicastbits_CA"

 

 

Kafka Broker Certificates

# For Node1 - Repeat for other nodes

openssl req -new -nodes -out node1.csr -newkey rsa:2048 -keyout node1.key -subj "/CN=kafka01.multicastbits.com"

openssl x509 -req -CA multicastbits_CA.crt -CAkey multicastbits_CA.key -CAcreateserial -in node1.csr -out node1.crt -days 3650 -sha256

 

 

Create the kafka and zookeeper users

⚠️ Important: Do not skip this step. we need these users to setup Authentication in JaaS configuration

Before configuring the cluster with SSL and SASL, let’s start up the cluster without authentication and SSL to create the users. This allows us to:

  1. Verify basic dependencies and confirm the zookeeper and Kafka clusters are coming up without any issues “make sure the car starts”
  2. Create necessary user accounts for SCRAM
  3. Test for any inter-node communication issues (Blocked Ports 9092, 9093 ,2181 etc)

 

Here’s how to set up this initial configuration:

Zookeeper Configuration (No SSL or Auth)

Create the following file: /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

# Zookeeper Configuration without Auth
dataDir=/Data_Disk/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.166.110:2888:3888
server.2=192.168.166.111:2888:3888
server.3=192.168.166.112:2888:3888

 

Kafka Broker Configuration (No SSL or Auth)

Create the following file: /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

# Kafka Broker Configuration without Auth/SSL
broker.id=1
listeners=PLAINTEXT://kafka01.multicastbits.com:9092
advertised.listeners=PLAINTEXT://kafka01.multicastbits.com:9092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT
zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181

 

Open a new shell to the server Start Zookeeper:

/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh -daemon /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

 

Open a new shell to start Kafka:

/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh -daemon /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

 

 

Create the users:

Open a new shell and run the following commands:

kafka-configs.sh --bootstrap-server ext-kafka01.fleetcam.io:9092 --alter --add-config 'SCRAM-SHA-512=[password=zookeeper-password]' --entity-type users --entity-name ftszk

kafka-configs.sh --zookeeper ext-kafka01.fleetcam.io:2181 --alter --add-config 'SCRAM-SHA-512=[password=kafkaadmin-password]' --entity-type users --entity-name ftskafkaadminAfter the users are created without errors, press Ctrl+C to shut down the services we started earlier.

 

 

SASL_SSL configuration with SCRAM

Zookeeper configuration Notes

  • Zookeeper is configured with SASL/MD5 due to the SSL issues we faced during the initial setup
  • Zookeeper Traffic is isolated with in the Broker nodes to maintain security
dataDir=/Data_Disk/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.166.110:2888:3888
server.2=192.168.166.111:2888:3888
server.3=192.168.166.112:2888:3888
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl

 

 

/Data_Disk/zookeeper/myid file is updated corresponding to the zookeeper nodeID

cat /Data_Disk/zookeeper/myid
1

 

 

Jaas configuration

Create the Jaas configuration for zookeeper authentication, it has the follow this syntax

/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf

Server {
   org.apache.zookeeper.server.auth.DigestLoginModule required
   user_multicastbitszk="zkpassword";
};

 

KafkaOPTS

KafkaOPTS Java varible need to be passed when the zookeeper is started to point to the correct JaaS file

export KAFKA_OPTS="-Djava.security.auth.login.config="Path to the zookeeper-jaas.conf"

export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf"

 

 

There are few ways to handle this, you can add a script under profile.d or use a custom Zookeeper launch script for the systemd service

Systemd service

Create the launch shell script for Zookeeper

/opt/kafka/kafka_2.13-3.8.0/bin/zk-start.s

#!/bin/bash
#export the env variable
export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf"
#Start the zookeeper service
/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/zookeeper.properties
#debug - launch config with no SSL - we need this for initial setup and debug
#/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

 

 

After you save the file

chomod +x /opt/kafka/kafka_2.13-3.8.0/bin/zk-start.s

sudo chown -R multicastbitskafka:multicastbitskafka /opt/kafka/kafka_2.13-3.8.0

Create the systemd service file

/etc/systemd/system/zookeeper.service

[Unit]
Description=Apache Zookeeper Service
After=network.target
[Service]
User=multicastbitskafka
Group=multicastbitskafka
ExecStart=/opt/kafka/kafka_2.13-3.8.0/bin/zk-start.sh
Restart=on-failure
[Install]

 

WantedBy=multi-user.target

After the file is saved, start the service

sudo systemctl daemon-reload.
sudo systemctl enable zookeeper
sudo systemctl start zookeeper

 

Kafka Broker configuration Notes

/opt/kafka/kafka_2.13-3.8.0/config/server.properties

broker.id=1
listeners=SASL_SSL://kafka01.multicastbits.com:9093
advertised.listeners=SASL_SSL://kafka01.multicastbits.com:9093
listener.security.protocol.map=SASL_SSL:SASL_SSL
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
ssl.keystore.location=/opt/kafka/secrets/kafkanode1.keystore.jks
ssl.keystore.password=keystorePassword
ssl.truststore.location=/opt/kafka/secrets/kafkanode1.truststore.jks
ssl.truststore.password=truststorePassword
#SASL/SCRAM Authentication
sasl.enabled.mechanisms=SCRAM-SHA-256, SCRAM-SHA-512
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.mechanism.client=SCRAM-SHA-512
security.inter.broker.protocol=SASL_SSL
#zookeeper
zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181
zookeeper.sasl.client=true
zookeeper.sasl.clientconfig=ZookeeperClient

 

zookeeper connect options

Define the zookeeper servers the broker will connect to

zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181

Enable SASL

zookeeper.sasl.client=true

Tell the broker to use the creds defined under ZookeeperClient section on the JaaS file used by the kafka service

zookeeper.sasl.clientconfig=ZookeeperClient

Broker and listener configuration

Define the broker id

broker.id=1

Define the servers listener name and port

listeners=SASL_SSL://kafka01.multicastbits.com:9093

Define the servers advertised listener name and port

advertised.listeners=SASL_SSL://kafka01.multicastbits.com:9093

Define the SASL_SSL for security protocol

listener.security.protocol.map=SASL_SSL:SASL_SSL

Enable ACLs

authorizer.class.name=kafka.security.authorizer.AclAuthorizer

Define the Java Keystores

ssl.keystore.location=/opt/kafka/secrets/kafkanode1.keystore.jks

ssl.keystore.password=keystorePassword

ssl.truststore.location=/opt/kafka/secrets/kafkanode1.truststore.jks

ssl.truststore.password=truststorePassword

Jaas configuration

/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf

KafkaServer {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="multicastbitskafkaadmin"
  password="kafkaadmin-password";
};
ZookeeperClient {
  org.apache.zookeeper.server.auth.DigestLoginModule required
  username="multicastbitszk"
  password="Zookeeper_password";
};

 

 

SASL and SCRAM configuration Notes

Enable SASL SCRAM for authentication

org.apache.kafka.common.security.scram.ScramLoginModule required

Use MD5 for Zookeeper authentication

org.apache.zookeeper.server.auth.DigestLoginModule required

KafkaOPTS

KafkaOPTS Java variable need to be passed and must point to the correct JaaS file, when the kafka service is started

export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf"

 

 

Systemd service

Create the launch shell script for kafka

/opt/kafka/kafka_2.13-3.8.0/bin/multicastbitskafka-server-start.sh

#!/bin/bash
#export the env variable
export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf"
#Start the kafka service
/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/server.properties
#debug - launch config with no SSL - we need this for initial setup and debug
#/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

 

 

Create the systemd service

/etc/systemd/system/kafka.service

[Unit]
Description=Apache Kafka Broker Service
After=network.target zookeeper.service
[Service]
User=multicastbitskafka
Group=multicastbitskafka
ExecStart=/opt/kafka/kafka_2.13-3.8.0/bin/multicastbitskafka-server-start.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target

 

 

Connect authenticate and use Kafka CLI tools

Requirements

  • multicastbitsadmin.keystore.jks
  • multicastbitsadmin.truststore.jks
  • WSL2 with java-11-openjdk-devel wget nano
  • Kafka 3.8 folder extracted locally

Setup your environment

  • Setup WSL2

You can use any Linux environment with JDK17 or 11

  • install dependencies

dnf install -y wget nano java-11-openjdk-devel

Download Kafka and extract it (in going to extract it to the home DIR under kafka)

# 1. Download Kafka (Choose a version compatible with your server)
wget https://dlcdn.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz
# 2. Extract
tar xzf kafka_2.13-3.8.0.tgz

 

Copy the jks files (You should generate them with the CA JKS, or use one from one of the nodes) to ~/

cp multicastbitsadmin.keystore.jks ~/

 

cp multicastbitsadmin.truststore.jks ~/

Create your admin client properties file

change the path to fit your setup

nano ~/kafka-adminclient.properties

# Security protocol and SASL/SSL configuration
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
# SSL Configuration
ssl.keystore.location=/opt/kafka/secrets/multicastbitsadmin.keystore.jks
ssl.keystore.password=keystorepw
ssl.truststore.location=/opt/kafka/secrets/multicastbitsadmin.truststore.jks
ssl.truststore.password=truststorepw
# SASL Configuration
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required 
    username="#youradminUser#" 
		password="#your-admin-PW#";

 

 

Create the JaaS file for the admin client

nano ~/kafka_client_jaas.conf

Some kafka-cli tools still look for the jaas.conf under KAFKA_OPTS environment variable

KafkaClient {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="#youradminUser#"
  password="#your-admin-PW#";
};

 

Export the Kafka environment variables

export KAFKA_HOME=/opt/kafka/kafka_2.13-3.8.0
export PATH=$PATH:$KAFKA_HOME/bin
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export KAFKA_OPTS="-Djava.security.auth.login.config=~/kafka_client_jaas.conf"
source ~/.bashrc

 

 

Kafka CLI Usage Examples

Create a user

kafka-configs.sh --bootstrap-server kafka01.multicastbits.com:9093 --alter --add-config 'SCRAM-SHA-512=[password=#password#]' --entity-type users --entity-name %username%--command-config ~/kafka-adminclient.properties

 

 

Create a topic

kafka-topics.sh --bootstrap-server kafka01.multicastbits.com:9093 --create --topic %topicname% --partitions 10 --replication-factor 3 --command-config ~/kafka-adminclient.properties

 

 

Create ACLs

External customer user with READ DESCRIBE privileges to a single topic

kafka-acls.sh --bootstrap-server kafka01.multicastbits.com:9093 
  --command-config ~/kafka-adminclient.properties 
  --add --allow-principal User:customer-user01 
  --operation READ --operation DESCRIBE --topic Customer_topic

 

 

Troubleshooting

Here are some common issues you might encounter when setting up and using Kafka with SASL_SCRAM authentication, along with their solutions:

1. Connection refused errors

Issue: Clients unable to connect to Kafka brokers.

Solution:

  • Verify that the Kafka brokers are running and listening on the correct ports.
  • Check firewall settings to ensure the Kafka ports are open and accessible.
  • Confirm that the bootstrap server addresses in client configurations are correct.

2. Authentication failures

Issue: Clients fail to authenticate with Kafka brokers.

Solution:

  • Double-check username and password in the JAAS configuration file.
  • Ensure the SCRAM credentials are properly set up on the Kafka brokers.
  • Verify that the correct SASL mechanism (SCRAM-SHA-512) is specified in client configurations.

3. SSL/TLS certificate issues

Issue: SSL handshake failures or certificate validation errors.

Solution:

  • Confirm that the keystore and truststore files are correctly referenced in configurations.
  • Verify that the certificates in the truststore are up-to-date and not expired.
  • Ensure that the hostname in the certificate matches the broker’s advertised listener.

4. Zookeeper connection issues

Issue: Kafka brokers unable to connect to Zookeeper ensemble.

Solution:

  • Verify Zookeeper connection string in Kafka broker configurations.
  • Ensure Zookeeper servers are running and accessible and the ports are open
  • Check Zookeeper client authentication settings in JAAS configuration file

 

 

NFS Provisioner Setup and Testing Guide for Rancher RKE2/Kubernetes

This guide covers how to add an NFS StorageClass and a dynamic provisioner to Kubernetes using the nfs-subdir-external-provisioner Helm chart. This enables us to mount NFS shares dynamically for PersistentVolumeClaims (PVCs) used by workloads.

Example use cases:

  • Database migrations
  • Apache Kafka clusters
  • Data processing pipelines

Requirements:

  • An accessible NFS share exported with: rw,sync,no_subtree_check,no_root_squash
  • NFSv3 or NFSv4 protocol
  • Kubernetes v1.31.7+ or RKE2 with rke2r1 or later

 

lets get to it


1. NFS Server Export Setup

Ensure your NFS server exports the shared directory correctly:

# /etc/exports
/rke-pv-storage  worker-node-ips(rw,sync,no_subtree_check,no_root_squash)

 

  • Replace worker-node-ips with actual IPs or CIDR blocks of your worker nodes.
  • Run sudo exportfs -r to reload the export table.

2. Install NFS Subdir External Provisioner

Add the Helm repo and install the provisioned:

helm repo add nfs-subdir-external-provisioner \
  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

helm install nfs-client-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --namespace kube-system \
  --set nfs.server=192.168.162.100 \
  --set nfs.path=/rke-pv-storage \
  --set storageClass.name=nfs-client \
  --set storageClass.defaultClass=false

Notes:

  • If you want this to be the default storage class, change storageClass.defaultClass=true.
  • nfs.server should point to the IP of your NFS server.
  • nfs.path must be a valid exported directory from that NFS server.
  • storageClass.name can be referenced in your PersistentVolumeClaim YAMLs using storageClassName: nfs-client.

3. PVC and Pod Test

Create a test PVC and pod using the following YAML:

# test-nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-nfs-pod
spec:
  containers:
  - name: shell
    image: busybox
    command: [ "sh", "-c", "sleep 3600" ]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-nfs-pvc

 

Apply it:

kubectl apply -f test-nfs-pvc.yaml
kubectl get pvc test-nfs-pvc -w

 

Expected output:

NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-nfs-pvc   Bound    pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   1Gi        RWX            nfs-client     30s

 


4. Troubleshooting

If the PVC remains in Pending, follow these steps:

Check the provisioner pod status:

kubectl get pods -n kube-system | grep nfs-client-provisioner

 

Inspect the provisioner pod:

kubectl describe pod -n kube-system <pod-name>
kubectl logs -n kube-system <pod-name>

 

Common Issues:

  • Broken State: Bad NFS mount
    mount.nfs: access denied by server while mounting 192.168.162.100:/pl-elt-kakfka

     

    • This usually means the NFS path is misspelled or not exported properly.
  • Broken State: root_squash enabled
    failed to provision volume with StorageClass "nfs-client": unable to create directory to provision new pv: mkdir /persistentvolumes/…: permission denied

     

    • Fix by changing the export to use no_root_squash or chown the directory to nobody:nogroup.
  • ImagePullBackOff
    • Ensure nodes have internet access and can reach registry.k8s.io.
  • RBAC errors
    • Make sure the ServiceAccount used by the provisioner has permissions to watch PVCs and create PVs.

5. Healthy State Example

kubectl get pods -n kube-system | grep nfs-client-provisioner-nfs-subdir-external-provisioner
nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m   1/1     Running     0          3m39s

 

kubectl describe pod -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
# Output shows pod is Running with Ready=True

 

kubectl logs -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
...
I0512 21:46:03.752701       1 controller.go:1420] provision "default/test-nfs-pvc" class "nfs-client": volume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" provisioned
I0512 21:46:03.752763       1 volume_store.go:212] Trying to save persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d"
I0512 21:46:03.772301       1 volume_store.go:219] persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" saved
I0512 21:46:03.772353       1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Name:"test-nfs-pvc"}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-73481f45-3055-4b4b-80f4-e68ffe83802d
...

 

Once test-nfs-pvc is bound and the pod starts successfully, your setup is working. You can now safely use storageClass: nfs-client in other workloads (e.g., Strimzi KafkaNodePool).


Advertising VRF Connected/Static routes via MP BGP to OSPF – Guide Dell S4112F-ON – OS 10.5.1.3

Im going to base this off my VRF Setup and Route leaking article and continue building on top of it

Lets say we need to advertise connected routes within VRFs using IGP to an upstream or downstream iP address this is one of many ways to get to that objective

For this example we are going to use BGP to collect connected routes and advertise that over OSPF

Setup the BGP process to collect connected routes

router bgp 65000
 router-id 10.252.250.6
 !
 address-family ipv4 unicast
 !
 neighbor 10.252.250.1
!
vrf Tenant01_VRF
 !
 address-family ipv4 unicast
  redistribute connected
!
vrf Tenant02_VRF
 !
 address-family ipv4 unicast
  redistribute connected
!
vrf Tenant03_VRF
 !
 address-family ipv4 unicast
  redistribute connected
!
vrf Shared_VRF
 !
 address-family ipv4 unicast
  redistribute connected

Setup OSPF to Redistribute the routes collected via BGP

router ospf 250 vrf Shared_VRF
 area 0.0.0.0 default-cost 0
 redistribute bgp 65000
interface vlan250
 mode L3
 description OSPF_Routing
 no shutdown
 ip vrf forwarding Shared_VRF
 ip address 10.252.250.6/29
 ip ospf 250 area 0.0.0.0
 ip ospf mtu-ignore
 ip ospf priority 10

Testing and confirmation

Local OSPF Database

Remote device

VRF Setup with route leaking guide Dell S4112F-ON – OS 10.5.1.3

Scope –

Create Three VRFs for Three separate clients

Create a Shared VRF

Leak routes from each VRF to the Shared_VRF

Logical overview

Create the VRFs

ip vrf Tenant01_VRF
ip vrf Tenant02_VRF
ip vrf Tenant03_VRF

Create and initialize the Interfaces (SVI, Layer 3 interface, Loopback)

We are creating Layer 3 SVIs Per tenant

interface vlan200
 mode L3
 description Tenant01_NET01
 no shutdown
 ip vrf forwarding Tenant01_VRF
 ip address 10.251.100.254/24
!
interface vlan201
 mode L3
 description Tenant01_NET02
 no shutdown
 ip vrf forwarding Tenant01_VRF
 ip address 10.251.101.254/24
!
interface vlan210
 mode L3
 description Tenant02_NET01
 no shutdown
 ip vrf forwarding Tenant02_VRF
 ip address 172.17.100.254/24
!
interface vlan220
 no ip address
 description Tenant03_NET01
 no shutdown
 ip vrf forwarding Tenant03_VRF
 ip address 192.168.110.254/24
!
interface vlan250
 mode L3
 description OSPF_Routing
 no shutdown
 ip vrf forwarding Shared_VRF
 ip address 10.252.250.6/29

Confirmation

LABCORE# show i
image     interface inventory ip        ipv6      iscsi
LABCORE# show ip interface brief
Interface Name            IP-Address          OK       Method       Status     Protocol
=========================================================================================
Vlan 200                   10.251.100.254/24   YES      manual       up          up
Vlan 201                   10.251.101.254/24   YES      manual       up          up
Vlan 210                   172.17.100.254/24   YES      manual       up          up
Vlan 220                   192.168.110.254/24  YES      manual       up          up
Vlan 250                   10.252.250.6/29     YES      manual       up          up
LABCORE# show ip vrf
VRF-Name                          Interfaces

Shared_VRF                        Vlan250

Tenant01_VRF                      Vlan200-201

Tenant02_VRF                      Vlan210

Tenant03_VRF                      Vlan220

default                           Vlan1

management                        Mgmt1/1/1

Route leaking

For this Example we are going to Leak routes from each of these tenant VRFs in to the Shared VRF

This design will allow each VLAN within the VRFs to see each other, which can be a security issue how ever you can easily control this by

  • narrowing the routes down to hosts
  • Using Access-lists (not the most ideal but if you have a playbook you can program this in with out any issues)

Real world use cases may differ use this as a template on how to leak routes with in VRFs, update your config as needed

Create the route export statements wihtin the VRFS

ip vrf Shared_VRF
 ip route-import 2:100
 ip route-import 3:100
 ip route-import 4:100
 ip route-export 1:100
ip vrf Tenant01_VRF
 ip route-export 2:100
 ip route-import 1:100
ip vrf Tenant02_VRF
 ip route-export 3:100
 ip route-import 1:100
ip vrf Tenant03_VRF
 ip route-export 4:100
 ip route-import 1:100

Lets Explain this a bit

ip vrf Shared_VRF
 ip route-import 2:100 -----------> Import Leaked routes from target 2:100
 ip route-import 3:100 -----------> Import Leaked routes from target 3:100
 ip route-import 4:100 -----------> Import Leaked routes from target 4:100
 ip route-export 1:100  -----------> Export routes to target 1:100

if you need to filter out who can import the routes you need to use the route-map with prefixes to filter it out

Setup static routes per VRF as needed

ip route vrf Tenant01_VRF 10.251.100.0/24 interface vlan200
ip route vrf Tenant01_VRF 10.251.101.0/24 interface vlan201
!
ip route vrf Tenant02_VRF 172.17.100.0/24 interface vlan210
!
ip route vrf Tenant03_VRF 192.168.110.0/24 interface vlan220
!
ip route vrf Shared_VRF 0.0.0.0/0 10.252.250.1 interface vlan25
  • Now these static routes will be leaked and learned by the shared VRF
  • the Default route on the Shared VRF will be learned downstream by the tenant VRFs
  • instead of the default route on the shared VRF, if you scope it to a certain IP or a subnet you can prevent the traffic routing between the VRFs via the Shared VRF
  • if you need routes directly leaked between Tenents use the ip route-import on the VRF as needed

Confirmation

Routes are being distributed via internal BGP process

LABCORE# show ip route vrf Tenant01_VRF
Codes: C - connected
       S - static
       B - BGP, IN - internal BGP, EX - external BGP, EV - EVPN BGP
       O - OSPF, IA - OSPF inter area, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type 2, E1 - OSPF external type 1,
       E2 - OSPF external type 2, * - candidate default,
       + - summary route, > - non-active route
Gateway of last resort is via 10.252.250.1 to network 0.0.0.0
  Destination                 Gateway                                        Dist/Metric       Last Change
----------------------------------------------------------------------------------------------------------
  *B IN 0.0.0.0/0           via 10.252.250.1                                 200/0             12:17:42
  C     10.251.100.0/24     via 10.251.100.254       vlan200                 0/0               12:43:46
  C     10.251.101.0/24     via 10.251.101.254       vlan201                 0/0               12:43:46
LABCORE#
LABCORE# show ip route vrf Tenant02_VRF
Codes: C - connected
       S - static
       B - BGP, IN - internal BGP, EX - external BGP, EV - EVPN BGP
       O - OSPF, IA - OSPF inter area, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type 2, E1 - OSPF external type 1,
       E2 - OSPF external type 2, * - candidate default,
       + - summary route, > - non-active route
Gateway of last resort is via 10.252.250.1 to network 0.0.0.0
  Destination                 Gateway                                        Dist/Metric       Last Change
----------------------------------------------------------------------------------------------------------
  *B IN 0.0.0.0/0           via 10.252.250.1                                 200/0             12:17:45
  C     172.17.100.0/24     via 172.17.100.254       vlan210                 0/0               12:43:49
LABCORE#
LABCORE# show ip route vrf Tenant03_VRF
Codes: C - connected
       S - static
       B - BGP, IN - internal BGP, EX - external BGP, EV - EVPN BGP
       O - OSPF, IA - OSPF inter area, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type 2, E1 - OSPF external type 1,
       E2 - OSPF external type 2, * - candidate default,
       + - summary route, > - non-active route
Gateway of last resort is via 10.252.250.1 to network 0.0.0.0
  Destination                 Gateway                                        Dist/Metric       Last Change
----------------------------------------------------------------------------------------------------------
  *B IN 0.0.0.0/0           via 10.252.250.1                                 200/0             12:17:48
  C     192.168.110.0/24    via 192.168.110.254      vlan220                 0/0               12:43:52
LABCORE# show ip route vrf Shared_VRF
Codes: C - connected
       S - static
       B - BGP, IN - internal BGP, EX - external BGP, EV - EVPN BGP
       O - OSPF, IA - OSPF inter area, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type 2, E1 - OSPF external type 1,
       E2 - OSPF external type 2, * - candidate default,
       + - summary route, > - non-active route
Gateway of last resort is via 10.252.250.1 to network 0.0.0.0
  Destination                 Gateway                                        Dist/Metric       Last Change
----------------------------------------------------------------------------------------------------------
  *S    0.0.0.0/0           via 10.252.250.1         vlan250                 1/0               12:21:33
  B  IN 10.251.100.0/24     Direct,Tenant01_VRF      vlan200                 200/0             09:01:28
  B  IN 10.251.101.0/24     Direct,Tenant01_VRF      vlan201                 200/0             09:01:28
  C     10.252.250.0/29     via 10.252.250.6         vlan250                 0/0               12:42:53
  B  IN 172.17.100.0/24     Direct,Tenant02_VRF      vlan210                 200/0             09:01:28
  B  IN 192.168.110.0/24    Direct,Tenant03_VRF      vlan220                 200/0             09:02:09

We can ping outside to the internet from the VRF IPs

Redistribute leaked routes via IGP

You can use a Internal BGP process to pickup routes from any VRF and redistribute them to other IGP processes as needed – Check the Article for that information

Vagrant Ansible LAB Guide – Bridged network

Here’s a is a quick guide to get you started with a “Ansible core lab” using Vagrant.

Alright lets get started

TLDR Version

  • Install Vagrant
  • Install Virtual-box
  • Create project folder and CD in to it
Vagrant init
  • Vagrantfile – link
  • Vagrant Provisioning Shell Script to Deploy Ansible – link
  • Install the vagrant-vbguest plugin to deploy missing
vagrant plugin install vagrant-vbguest
  • Bring up the Vagrant environment
Vagrant up

Install Vagrant and Virtual box

For this demo we are using windows 10 1909 but you can use the same guide for MAC OSX

Windows

Download Vagrant and virtual box and install it the good ol way –

https://www.vagrantup.com/downloads.html

https://www.virtualbox.org/wiki/Downloads

https://www.vagrantmanager.com/downloads/

Install the vagrant-vbguest plugin (We need this with newer versions of Ubuntu)

vagrant plugin install vagrant-vbguest

Or Using chocolatey

choco install vagrant
choco install virtualbox
choco install vagrant-manager

Install the vagrant-vbguest plugin (We need this with newer versions of Ubuntu)

vagrant plugin install vagrant-vbguest

MAC OSX – using Brewcask

Install virtual box

$ brew cask install virtualbox

Now install Vagrant either from the website or use homebrew for installing it.

$ brew cask install vagrant

Vagrant-Manager is a nice way to manage all your virtual machines in one place directly from the menu bar.

$ brew cask install vagrant-manager

Install the vagrant-vbguest plugin (We need this with newer versions of Ubuntu)

vagrant plugin install vagrant-vbguest

Setup the Vagrant Environment

Open Powershell

to get started lets check our environment

vagrant version

Create a project directory and Initialize the environment

for the project directory im using D:\vagrant

Open powershell and run

mkdir D:\vagrant
cd D:\vagrant

Initialize the environment under the project folder

vagrant init

this will create Two Items

.vagrant – Hidden folder holding Base Machines and meta data

Vagrantfile – Vagrant config file

Lets Create the Vagrantfile to deploy the VMs

https://www.vagrantup.com/docs/vagrantfile/

The syntax of Vagrantfiles is Ruby this gives us a lot of flexibility to program in logic when building your files

Im using Atom to edit the vagrantfile

Vagrant.configure("2") do |config|
     config.vm.define "controller" do |controller|
                  controller.vm.box = "ubuntu/trusty64"
                  controller.vm.hostname = "LAB-Controller"
                  controller.vm.network "public_network", bridge: "Intel(R) I211 Gigabit Network Connection", ip: "172.17.10.120"
                    controller.vm.provider "virtualbox" do |vb|
                                 vb.memory = "2048"
                  end
                  controller.vm.provision :shell, path: 'Ansible_LAB_setup.sh'
   end
   (1..3).each do |i|
         config.vm.define "vls-node#{i}" do |node|
                       node.vm.box = "ubuntu/trusty64"
                       node.vm.hostname = "vls-node#{i}"
                       node.vm.network "public_network", bridge: "Intel(R) I211 Gigabit Network Connection" ip: "172.17.10.12#{i}"
                      node.vm.provider "virtualbox" do |vb|
                                                  vb.memory = "1024"
                     end
              end
        end
end

You can grab the code from my Repo

https://github.com/malindarathnayake/Ansible_Vagrant_LAB/blob/master/Vagrantfile

Let’s talk a little bit about this code and unpack this

Vagrant API version

Vagrant uses API versions for its configuration file, this is how it can stay backward compatible. So in every Vagrantfile we need to specify which version to use. The current one is version 2 which works with Vagrant 1.1 and up.

Provisioning the Ansible VM

This will

  • Provision the controller Ubuntu VM
  • Create a bridged network adapter
  • Set the host-name – LAB-Controller
  • Set the static IP – 172.17.10.120/24
  • Run the Shell script that installs Ansible using apt-get install (We will get to this below)

Lets start digging in…

Specifying the Controller VM Name, base box and hostname

Vagrant uses a base image to clone a virtual machine quickly. These base images are known as “boxes” in Vagrant, and specifying the box to use for your Vagrant environment is always the first step after creating a new Vagrantfile.

You can find different base boxes from app.vagrantup.com

Or you can create custom base boxes for pretty much anything including “CiscoVIRL(CML)” images – keep an eye out for the next article on this

Network configurations

controller.vm.network "public_network", bridge: "Intel(R) I211 Gigabit Network Connection", ip: "your IP"

in this case, we are asking it to create a bridged adapter using the Intel(R) I211 NIC and set the IP address you defined on under IP attribute

You can the relavant interface name using

get-netadapter

You can also create a host-only private network

controller.vm.network :private_network, ip: "10.0.0.10"

for more info checkout the network section in the KB

https://www.vagrantup.com/docs/networking/

Define the provider and VM resources

We declaring virtualbox(we installed this earlier) as the provider and setting VM memory to 2048

You can get more granular with this, refer to the below KB

https://www.vagrantup.com/docs/virtualbox/configuration.html

Define the shell script to customize the VM config and install the Ansible Package

Now this is where we define the provisioning shell script

this script installs Ansible and set the host file entries to make your life easier

In case you are wondering VLS stands for V=virtual,L – linux S – server.

I use this naming scheme for my VMs. Feel free to use anything you want; make sure it matches what you defined on the Vagrantfile under node.vm.hostname

!/bin/bash
sudo apt-get update
sudo apt-get install software-propetise-common -y
sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible -y
echo "
172.17.10.120 LAB-controller
172.17.10.121 vls-node1
172.17.10.122 vls-node2
172.17.10.123 vls-node3" >> /etc/hosts

create this file and save it as Ansible_LAB_setup.sh in the Project folder

in this case I’m going to save it under D:\vagrant

You can also do this inline with a script block instead of using a separate file

https://www.vagrantup.com/docs/provisioning/basic_usage.html

Provisioning the Member servers for the lab

We covered most of the code used above, the only difference here is we are using each method to create 3 VMs with the same template (I’m lazy and it’s more convenient)

This will create three Ubuntu VMs with the following Host-names and IP addresses, you should update these values to match you LAN, or use a private Adapter

vls-node1 – 172.17.10.121

vls-node2 – 172.17.10.122

vls-node1 – 172.17.10.123

So now that we are done with explaining the code, let’s run this

Building the Lab environment using Vagrant

Issue the following command to check your syntax

Vagrant status

Issue the following command to bring up the environment

Vagrant up

If you get this message Reboot in to UEFI and make sure virtualization is enabled

Intel – VT-D

AMD Ryzen – SVM

If everything is kumbaya you will see vagrant firing up the deployment

It will provision 4 VMs as we specified

Notice since we have the “vagrant-vbguest” plugin installed, it will reinstall the relevant guest tools along with the dependencies for the OS

==> vls-node3: Machine booted and ready!
[vls-node3] No Virtualbox Guest Additions installation found.
rmmod: ERROR: Module vboxsf is not currently loaded
rmmod: ERROR: Module vboxguest is not currently loaded
Reading package lists...
Building dependency tree...
Reading state information...
Package 'virtualbox-guest-x11' is not installed, so not removed
The following packages will be REMOVED:
  virtualbox-guest-utils*
0 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
After this operation, 5799 kB disk space will be freed.
(Reading database ... 61617 files and directories currently installed.)
Removing virtualbox-guest-utils (6.0.14-dfsg-1) ...
Processing triggers for man-db (2.8.7-3) ...
(Reading database ... 61604 files and directories currently installed.)
Purging configuration files for virtualbox-guest-utils (6.0.14-dfsg-1) ...
Processing triggers for systemd (242-7ubuntu3.7) ...
Reading package lists...
Building dependency tree...
Reading state information...
linux-headers-5.3.0-51-generic is already the newest version (5.3.0-51.44).
linux-headers-5.3.0-51-generic set to manually installed.

Check the status

Vagrant status

Testing

Connecting via SSH to your VMs

vagrant ssh controller

“Controller” is the VMname we defined before not the hostname, You can find this by running Vagrant status on posh or your terminal

We are going to connect to our controller and check everything

Little bit more information on the networking side

Vagrant Adds two interfaces, for each VM

NIC 1 – Nat’d in to the host (control plane for Vagrant to manage the VMs)

NIC 2 – Bridged adapter we provisioned in the script with the IP Address

Default route is set via the Private(NAT’d) interface (you cant change it)

Netplan configs

Vagrant creates a custom netplan yaml for interface configs


Destroy/Tear-down the environment

vagrant destroy -f

https://www.vagrantup.com/intro/getting-started/teardown.html

I hope this helped someone. when I started with Vagrant a few years back it took me a few tries to figure out the system and the logic behind it, this will give you a basic understanding on how things are plugged together.

let me know in the comments if you see any issues or mistakes.

Until Next time…..

Azure AD Sync Connect No-Start-Connection status

Issue

Received the following error from the Azure AD stating that Password Synchronization was not working on the tenant.

When i manually initiate a delta sync, i see the following logs

"The Specified Domain either does not exist or could not be contacted"

(click to enlarge)

Checked the following

  • Restarted ADsync Services
  • Resolve the ADDS Domain FQDN and DNS – Working
  • Test required ports for AD-sync using portqry – issues with the Primary ADDS server defined on the DNS values

Root Cause

Turns out the Domain controller Defined as the primary DNS value was pointing was going thorough updates, its responding on the DNS but doesn’t return any data (Brown-out state)

Assumption

when checking DNS since the DNS server is connecting, Windows doesn’t check the secondary and tertiary servers defined under DNS servers.

This might happen if you are using a ADDS server via a S2S tunnel/MPLS when the latency goes high

Resolution

Check make sure your ADDS-DNS servers defined on AD-SYNC server are alive and responding

in my case i just updated the “Primary” DNS value with the umbrella Appliance IP (this act as a proxy and handle the fail-over)

Hybrid Exchange mailbox On-boarding : Target user already has a primary mailbox – Fix

During an Office 365 migration on a Hybrid environment with AAD Connectran into the following scenario:

  • Hybrid Co-Existence Environment with AAD-Sync
  • User [email protected] has a mailbox on-premises. Jon is represented as a Mail User in the cloud with an office 365 license
  • [email protected] had a cloud-only mailbox prior to the initial AD-sync was run
  • A user account is registered as a mail-user and has a valid license attached
  • During the office 365 Remote mailbox move, we end up with the following error during validation and removing the immutable ID and remapping to on-premise account won’t fix the issue
Target user 'Sam fisher' already has a primary mailbox.
+ CategoryInfo : InvalidArgument: (tsu:MailboxOrMailUserIdParameter) [New-MoveRequest], RecipientTaskException
+ FullyQualifiedErrorId : [Server=Pl-EX001,RequestId=19e90208-e39d-42bc-bde3-ee0db6375b8a,TimeStamp=11/6/2019 4:10:43 PM] [FailureCategory=Cmdlet-RecipientTaskException] 9418C1E1,Microsoft.Exchange.Management.Migration.MailboxRep
lication.MoveRequest.NewMoveRequest
+ PSComputerName : Pl-ex001.Paladin.org

It turns out this happens due to an unclean cloud object on MSOL, This is because Exchange online keeps pointers that indicate that there used to be a mailbox in the cloud for this user

Option 1 (nuclear option)

to fix this problem was to delete *MSOL User Object* for Sam and re-sync it from on-premises. This would delete [email protected] from the cloud – but it will delete him/her from all workloads, not only Exchange. This is problematic because Sam is already using Teams, One-drive, SharePoint.

Option 2

Clean up only the office 365 mailbox pointer information

PS C:\> Set-User [email protected] -PermanentlyClearPreviousMailboxInfo 
Confirm
Confirm
Are you sure you want to perform this action?
Delete all existing information about user "[email protected]"?. This operation will clear existing values from
Previous home MDB and Previous Mailbox GUID of the user. After deletion, reconnecting to the previous mailbox that
existed in the cloud will not be possible and any content it had will be unrecoverable PERMANENTLY. Do you want to
continue?
[Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): a

Executing this leaves you with a clean object without the duplicate-mailbox problem,

in some cases when you run this command you will get the following output 

 “Command completed successfully, but no user settings were changed.”

If this happens

Remove the license from the user temporarily and run the command to remove previous mailbox data

then you can re-add the license 

 

Upgrading VMware EXSI Hosts using Vcenter Update Manager Baseline (6.5 to 6.7 Update 2)

Update Manager is bundled in the vCenter Server Appliance since version 6.5, it’s a plug-in that runs on the vSphere Web Client.  we can use the component to

  • patch/upgrade hosts
  • deploy .vib files within the V-Center
  • Scan your VC environment and report on any out of compliance hosts

Hardcore/Experienced VMware operators will scoff at this article, but I have seen many organizations still using ILO/IDRAC to mount an ISO to update hosts and they have no idea this function even exists.

Now that’s out of the way Let’s get to the how-to part of this

In Vcenter click the “Menu” and drill down to the “Update Manager”

This Blade will show you all the nerd knobs and overview of your current Updates and compliance levels

Click on the “Baselines” Tab

You will have two predefined baselines for security patches created by the Vcenter, let keep that aside for now

Navigate to the “ESXi Images” Tab, and Click “Import”

Once the Upload is complete, Click on “New Baseline”

Fill in the Name and Description that makes sense to anyone that logs in and click Next

Select the image you just Uploaded before on the next Screen and continue through the wizard and complete it

Note – If you have other 3rd party software for ESXI you can create seprate baselines for those and use baseline Groups to push out upgrades and vib files at the same time 

Now click the “Menu” and Navigate Backup to “Hosts and Clusters”

Now you can apply the Baseline this at various levels within the Vcenter Hierarchy

Vcenter | DataCenter | Cluster | Host

Depending on your use case pick the right level

Excerpt from the KB 

For ESXi hosts in a cluster, the remediation process is sequential by default. With Update Manager, you can select to run host remediation in parallel.

When you remediate a cluster of hosts sequentially and one of the hosts fails to enter maintenance mode, Update Manager reports an error, and the process stops and fails. The hosts in the cluster that are remediated stay at the updated level. The ones that are not remediated after the failed host remediation are not updated. If a host in a DRS enabled cluster runs a virtual machine on which Update Manager or vCenter Server are installed, DRS first attempts to migrate the virtual machine running vCenter Server or Update Manager to another host so that the remediation succeeds. In case the virtual machine cannot be migrated to another host, the remediation fails for the host, but the process does not stop. Update Manager proceeds to remediate the next host in the cluster.

The host upgrade remediation of ESXi hosts in a cluster proceeds only if all hosts in the cluster can be upgraded.

Remediation of hosts in a cluster requires that you temporarily disable cluster features such as VMware DPM and HA admission control. Also, turn off FT if it is enabled on any of the virtual machines on a host, and disconnect the removable devices connected to the virtual machines on a host, so that they can be migrated with vMotion. Before you start a remediation process, you can generate a report that shows which cluster, host, or virtual machine has the cluster features enabled.

Link to KB on Remediation


Moving on; for this example, since I have only 2 hosts. we are going apply the baseline at the cluster level but apply the remediation at host level

Host 1 > Enter Maintenance > Remediation > Update complete and online

Host 2 > Enter Maintenance > Remediation > Update complete and online

Select the cluster, Click the “Updates” Tab and click on “Attach” on the Attached baselines section

Select and attach the baseline we created before

Click “Check Compliance” to scan and get a report

Select the host in the cluster, enter maintenance mode

Click “REMEDIATE” to start the upgrade. (if you do this at a cluster level if you have DRS, Update Manager will update each node)

This will reboot the host and go through the update process

Foot Notes –

You might run into the following issue

“vCenter cannot deploy Host upgrade agent to host”

Cause 1

Scratch partition is full use Vcenter and change the scratch folder location

VMWARE KB

Creating a persistent scratch location for ESXi  – https://kb.vmware.com/s/article/1033696

Cause 2

Hardware is not compatible,

I had this issue due to 6.7 dropping support for an LSI Raid card on an older firmware, you need to do some foot work and check the log files to figure out why its failing

Vmware HCL – Link

ESXI and Vcenter log file locations – link