Kafka 3.8 with Zookeeper SASL_SCRAM

 

Transport Encryption Methods:

SASL/SSL (Solid Teal/Green Lines):

  1. Used for securing communication between producers/consumers and Kafka brokers.
    • SASL (Simple Authentication and Security Layer): Authenticates clients (producers/consumers) to brokers, using SCRAM .
    • SSL/TLS (Secure Sockets Layer/Transport Layer Security): Encrypts the data in transit, ensuring confidentiality and integrity during transmission.

Digest-MD5 (Dashed Yellow Lines):

  1. Secures communication between Kafka brokers and the Zookeeper cluster.
    • Digest-MD5: A challenge-response authentication mechanism providing basic encryption

Notes:

While functional, Digest-MD5 is an older algorithm. we opted for this to reduce complexity and the fact the zookeepers have issues with connecting with Brokers via SSL/TLS

  1. We need to test and switch over KRAFT Protocol, this removes the use of Zookeeper altogether
  2. Add IP ACLs for Zookeeper connections using firewalld to limit traffic between the nodes for replication

PKI and Certificate Signing

CA cert for local PKI,

We need to share this PEM file(without the private key) with the customer to authenticate

Internal applications the CA file must be used for authentication – Refer to the Configuration example documents

# Generate CA Key
openssl genrsa -out multicastbits_CA.key 4096
# Generate CA Certificate
openssl req -x509 -new -nodes -key multicastbits_CA.key -sha256 -days 3650 -out multicastbits_CA.crt -subj "/CN=multicastbits_CA"

 

 

Kafka Broker Certificates

# For Node1 - Repeat for other nodes

openssl req -new -nodes -out node1.csr -newkey rsa:2048 -keyout node1.key -subj "/CN=kafka01.multicastbits.com"

openssl x509 -req -CA multicastbits_CA.crt -CAkey multicastbits_CA.key -CAcreateserial -in node1.csr -out node1.crt -days 3650 -sha256

 

 

Create the kafka and zookeeper users

⚠️ Important: Do not skip this step. we need these users to setup Authentication in JaaS configuration

Before configuring the cluster with SSL and SASL, let’s start up the cluster without authentication and SSL to create the users. This allows us to:

  1. Verify basic dependencies and confirm the zookeeper and Kafka clusters are coming up without any issues “make sure the car starts”
  2. Create necessary user accounts for SCRAM
  3. Test for any inter-node communication issues (Blocked Ports 9092, 9093 ,2181 etc)

 

Here’s how to set up this initial configuration:

Zookeeper Configuration (No SSL or Auth)

Create the following file: /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

# Zookeeper Configuration without Auth
dataDir=/Data_Disk/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.166.110:2888:3888
server.2=192.168.166.111:2888:3888
server.3=192.168.166.112:2888:3888

 

Kafka Broker Configuration (No SSL or Auth)

Create the following file: /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

# Kafka Broker Configuration without Auth/SSL
broker.id=1
listeners=PLAINTEXT://kafka01.multicastbits.com:9092
advertised.listeners=PLAINTEXT://kafka01.multicastbits.com:9092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT
zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181

 

Open a new shell to the server Start Zookeeper:

/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh -daemon /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

 

Open a new shell to start Kafka:

/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh -daemon /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

 

 

Create the users:

Open a new shell and run the following commands:

kafka-configs.sh --bootstrap-server ext-kafka01.fleetcam.io:9092 --alter --add-config 'SCRAM-SHA-512=[password=zookeeper-password]' --entity-type users --entity-name ftszk

kafka-configs.sh --zookeeper ext-kafka01.fleetcam.io:2181 --alter --add-config 'SCRAM-SHA-512=[password=kafkaadmin-password]' --entity-type users --entity-name ftskafkaadminAfter the users are created without errors, press Ctrl+C to shut down the services we started earlier.

 

 

SASL_SSL configuration with SCRAM

Zookeeper configuration Notes

  • Zookeeper is configured with SASL/MD5 due to the SSL issues we faced during the initial setup
  • Zookeeper Traffic is isolated with in the Broker nodes to maintain security
dataDir=/Data_Disk/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.166.110:2888:3888
server.2=192.168.166.111:2888:3888
server.3=192.168.166.112:2888:3888
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl

 

 

/Data_Disk/zookeeper/myid file is updated corresponding to the zookeeper nodeID

cat /Data_Disk/zookeeper/myid
1

 

 

Jaas configuration

Create the Jaas configuration for zookeeper authentication, it has the follow this syntax

/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf

Server {
   org.apache.zookeeper.server.auth.DigestLoginModule required
   user_multicastbitszk="zkpassword";
};

 

KafkaOPTS

KafkaOPTS Java varible need to be passed when the zookeeper is started to point to the correct JaaS file

export KAFKA_OPTS="-Djava.security.auth.login.config="Path to the zookeeper-jaas.conf"

export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf"

 

 

There are few ways to handle this, you can add a script under profile.d or use a custom Zookeeper launch script for the systemd service

Systemd service

Create the launch shell script for Zookeeper

/opt/kafka/kafka_2.13-3.8.0/bin/zk-start.s

#!/bin/bash
#export the env variable
export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf"
#Start the zookeeper service
/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/zookeeper.properties
#debug - launch config with no SSL - we need this for initial setup and debug
#/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

 

 

After you save the file

chomod +x /opt/kafka/kafka_2.13-3.8.0/bin/zk-start.s

sudo chown -R multicastbitskafka:multicastbitskafka /opt/kafka/kafka_2.13-3.8.0

Create the systemd service file

/etc/systemd/system/zookeeper.service

[Unit]
Description=Apache Zookeeper Service
After=network.target
[Service]
User=multicastbitskafka
Group=multicastbitskafka
ExecStart=/opt/kafka/kafka_2.13-3.8.0/bin/zk-start.sh
Restart=on-failure
[Install]

 

WantedBy=multi-user.target

After the file is saved, start the service

sudo systemctl daemon-reload.
sudo systemctl enable zookeeper
sudo systemctl start zookeeper

 

Kafka Broker configuration Notes

/opt/kafka/kafka_2.13-3.8.0/config/server.properties

broker.id=1
listeners=SASL_SSL://kafka01.multicastbits.com:9093
advertised.listeners=SASL_SSL://kafka01.multicastbits.com:9093
listener.security.protocol.map=SASL_SSL:SASL_SSL
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
ssl.keystore.location=/opt/kafka/secrets/kafkanode1.keystore.jks
ssl.keystore.password=keystorePassword
ssl.truststore.location=/opt/kafka/secrets/kafkanode1.truststore.jks
ssl.truststore.password=truststorePassword
#SASL/SCRAM Authentication
sasl.enabled.mechanisms=SCRAM-SHA-256, SCRAM-SHA-512
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.mechanism.client=SCRAM-SHA-512
security.inter.broker.protocol=SASL_SSL
#zookeeper
zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181
zookeeper.sasl.client=true
zookeeper.sasl.clientconfig=ZookeeperClient

 

zookeeper connect options

Define the zookeeper servers the broker will connect to

zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181

Enable SASL

zookeeper.sasl.client=true

Tell the broker to use the creds defined under ZookeeperClient section on the JaaS file used by the kafka service

zookeeper.sasl.clientconfig=ZookeeperClient

Broker and listener configuration

Define the broker id

broker.id=1

Define the servers listener name and port

listeners=SASL_SSL://kafka01.multicastbits.com:9093

Define the servers advertised listener name and port

advertised.listeners=SASL_SSL://kafka01.multicastbits.com:9093

Define the SASL_SSL for security protocol

listener.security.protocol.map=SASL_SSL:SASL_SSL

Enable ACLs

authorizer.class.name=kafka.security.authorizer.AclAuthorizer

Define the Java Keystores

ssl.keystore.location=/opt/kafka/secrets/kafkanode1.keystore.jks

ssl.keystore.password=keystorePassword

ssl.truststore.location=/opt/kafka/secrets/kafkanode1.truststore.jks

ssl.truststore.password=truststorePassword

Jaas configuration

/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf

KafkaServer {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="multicastbitskafkaadmin"
  password="kafkaadmin-password";
};
ZookeeperClient {
  org.apache.zookeeper.server.auth.DigestLoginModule required
  username="multicastbitszk"
  password="Zookeeper_password";
};

 

 

SASL and SCRAM configuration Notes

Enable SASL SCRAM for authentication

org.apache.kafka.common.security.scram.ScramLoginModule required

Use MD5 for Zookeeper authentication

org.apache.zookeeper.server.auth.DigestLoginModule required

KafkaOPTS

KafkaOPTS Java variable need to be passed and must point to the correct JaaS file, when the kafka service is started

export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf"

 

 

Systemd service

Create the launch shell script for kafka

/opt/kafka/kafka_2.13-3.8.0/bin/multicastbitskafka-server-start.sh

#!/bin/bash
#export the env variable
export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf"
#Start the kafka service
/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/server.properties
#debug - launch config with no SSL - we need this for initial setup and debug
#/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

 

 

Create the systemd service

/etc/systemd/system/kafka.service

[Unit]
Description=Apache Kafka Broker Service
After=network.target zookeeper.service
[Service]
User=multicastbitskafka
Group=multicastbitskafka
ExecStart=/opt/kafka/kafka_2.13-3.8.0/bin/multicastbitskafka-server-start.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target

 

 

Connect authenticate and use Kafka CLI tools

Requirements

  • multicastbitsadmin.keystore.jks
  • multicastbitsadmin.truststore.jks
  • WSL2 with java-11-openjdk-devel wget nano
  • Kafka 3.8 folder extracted locally

Setup your environment

  • Setup WSL2

You can use any Linux environment with JDK17 or 11

  • install dependencies

dnf install -y wget nano java-11-openjdk-devel

Download Kafka and extract it (in going to extract it to the home DIR under kafka)

# 1. Download Kafka (Choose a version compatible with your server)
wget https://dlcdn.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz
# 2. Extract
tar xzf kafka_2.13-3.8.0.tgz

 

Copy the jks files (You should generate them with the CA JKS, or use one from one of the nodes) to ~/

cp multicastbitsadmin.keystore.jks ~/

 

cp multicastbitsadmin.truststore.jks ~/

Create your admin client properties file

change the path to fit your setup

nano ~/kafka-adminclient.properties

# Security protocol and SASL/SSL configuration
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
# SSL Configuration
ssl.keystore.location=/opt/kafka/secrets/multicastbitsadmin.keystore.jks
ssl.keystore.password=keystorepw
ssl.truststore.location=/opt/kafka/secrets/multicastbitsadmin.truststore.jks
ssl.truststore.password=truststorepw
# SASL Configuration
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required 
    username="#youradminUser#" 
		password="#your-admin-PW#";

 

 

Create the JaaS file for the admin client

nano ~/kafka_client_jaas.conf

Some kafka-cli tools still look for the jaas.conf under KAFKA_OPTS environment variable

KafkaClient {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="#youradminUser#"
  password="#your-admin-PW#";
};

 

Export the Kafka environment variables

export KAFKA_HOME=/opt/kafka/kafka_2.13-3.8.0
export PATH=$PATH:$KAFKA_HOME/bin
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export KAFKA_OPTS="-Djava.security.auth.login.config=~/kafka_client_jaas.conf"
source ~/.bashrc

 

 

Kafka CLI Usage Examples

Create a user

kafka-configs.sh --bootstrap-server kafka01.multicastbits.com:9093 --alter --add-config 'SCRAM-SHA-512=[password=#password#]' --entity-type users --entity-name %username%--command-config ~/kafka-adminclient.properties

 

 

Create a topic

kafka-topics.sh --bootstrap-server kafka01.multicastbits.com:9093 --create --topic %topicname% --partitions 10 --replication-factor 3 --command-config ~/kafka-adminclient.properties

 

 

Create ACLs

External customer user with READ DESCRIBE privileges to a single topic

kafka-acls.sh --bootstrap-server kafka01.multicastbits.com:9093 
  --command-config ~/kafka-adminclient.properties 
  --add --allow-principal User:customer-user01 
  --operation READ --operation DESCRIBE --topic Customer_topic

 

 

Troubleshooting

Here are some common issues you might encounter when setting up and using Kafka with SASL_SCRAM authentication, along with their solutions:

1. Connection refused errors

Issue: Clients unable to connect to Kafka brokers.

Solution:

  • Verify that the Kafka brokers are running and listening on the correct ports.
  • Check firewall settings to ensure the Kafka ports are open and accessible.
  • Confirm that the bootstrap server addresses in client configurations are correct.

2. Authentication failures

Issue: Clients fail to authenticate with Kafka brokers.

Solution:

  • Double-check username and password in the JAAS configuration file.
  • Ensure the SCRAM credentials are properly set up on the Kafka brokers.
  • Verify that the correct SASL mechanism (SCRAM-SHA-512) is specified in client configurations.

3. SSL/TLS certificate issues

Issue: SSL handshake failures or certificate validation errors.

Solution:

  • Confirm that the keystore and truststore files are correctly referenced in configurations.
  • Verify that the certificates in the truststore are up-to-date and not expired.
  • Ensure that the hostname in the certificate matches the broker’s advertised listener.

4. Zookeeper connection issues

Issue: Kafka brokers unable to connect to Zookeeper ensemble.

Solution:

  • Verify Zookeeper connection string in Kafka broker configurations.
  • Ensure Zookeeper servers are running and accessible and the ports are open
  • Check Zookeeper client authentication settings in JAAS configuration file

 

 

NFS Provisioner Setup and Testing Guide for Rancher RKE2/Kubernetes

This guide covers how to add an NFS StorageClass and a dynamic provisioner to Kubernetes using the nfs-subdir-external-provisioner Helm chart. This enables us to mount NFS shares dynamically for PersistentVolumeClaims (PVCs) used by workloads.

Example use cases:

  • Database migrations
  • Apache Kafka clusters
  • Data processing pipelines

Requirements:

  • An accessible NFS share exported with: rw,sync,no_subtree_check,no_root_squash
  • NFSv3 or NFSv4 protocol
  • Kubernetes v1.31.7+ or RKE2 with rke2r1 or later

 

lets get to it


1. NFS Server Export Setup

Ensure your NFS server exports the shared directory correctly:

# /etc/exports
/rke-pv-storage  worker-node-ips(rw,sync,no_subtree_check,no_root_squash)

 

  • Replace worker-node-ips with actual IPs or CIDR blocks of your worker nodes.
  • Run sudo exportfs -r to reload the export table.

2. Install NFS Subdir External Provisioner

Add the Helm repo and install the provisioned:

helm repo add nfs-subdir-external-provisioner \
  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

helm install nfs-client-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --namespace kube-system \
  --set nfs.server=192.168.162.100 \
  --set nfs.path=/rke-pv-storage \
  --set storageClass.name=nfs-client \
  --set storageClass.defaultClass=false

Notes:

  • If you want this to be the default storage class, change storageClass.defaultClass=true.
  • nfs.server should point to the IP of your NFS server.
  • nfs.path must be a valid exported directory from that NFS server.
  • storageClass.name can be referenced in your PersistentVolumeClaim YAMLs using storageClassName: nfs-client.

3. PVC and Pod Test

Create a test PVC and pod using the following YAML:

# test-nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-nfs-pod
spec:
  containers:
  - name: shell
    image: busybox
    command: [ "sh", "-c", "sleep 3600" ]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-nfs-pvc

 

Apply it:

kubectl apply -f test-nfs-pvc.yaml
kubectl get pvc test-nfs-pvc -w

 

Expected output:

NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-nfs-pvc   Bound    pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   1Gi        RWX            nfs-client     30s

 


4. Troubleshooting

If the PVC remains in Pending, follow these steps:

Check the provisioner pod status:

kubectl get pods -n kube-system | grep nfs-client-provisioner

 

Inspect the provisioner pod:

kubectl describe pod -n kube-system <pod-name>
kubectl logs -n kube-system <pod-name>

 

Common Issues:

  • Broken State: Bad NFS mount
    mount.nfs: access denied by server while mounting 192.168.162.100:/pl-elt-kakfka

     

    • This usually means the NFS path is misspelled or not exported properly.
  • Broken State: root_squash enabled
    failed to provision volume with StorageClass "nfs-client": unable to create directory to provision new pv: mkdir /persistentvolumes/…: permission denied

     

    • Fix by changing the export to use no_root_squash or chown the directory to nobody:nogroup.
  • ImagePullBackOff
    • Ensure nodes have internet access and can reach registry.k8s.io.
  • RBAC errors
    • Make sure the ServiceAccount used by the provisioner has permissions to watch PVCs and create PVs.

5. Healthy State Example

kubectl get pods -n kube-system | grep nfs-client-provisioner-nfs-subdir-external-provisioner
nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m   1/1     Running     0          3m39s

 

kubectl describe pod -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
# Output shows pod is Running with Ready=True

 

kubectl logs -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
...
I0512 21:46:03.752701       1 controller.go:1420] provision "default/test-nfs-pvc" class "nfs-client": volume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" provisioned
I0512 21:46:03.752763       1 volume_store.go:212] Trying to save persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d"
I0512 21:46:03.772301       1 volume_store.go:219] persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" saved
I0512 21:46:03.772353       1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Name:"test-nfs-pvc"}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-73481f45-3055-4b4b-80f4-e68ffe83802d
...

 

Once test-nfs-pvc is bound and the pod starts successfully, your setup is working. You can now safely use storageClass: nfs-client in other workloads (e.g., Strimzi KafkaNodePool).


Find the PCI-E Slot number of PCI-E Add On card GPU, NIC, etc on Linux/Proxmox

i was working on a v-GPU POC using PVE Since Broadcom Screwed us with the Vsphere licensing costs (New post incoming about this adventure)

anyway i needed to find the PCI-E Slot used for the A4000 GPU on the host to disable it for troubleshooting

Guide

First we need to find the occupied slots and the Bus address for each slot

sudo dmidecode -t slot | grep -E "Designation|Usage|Bus Address"

Output will show the Slot ID, Usage and then the Bus Address

        Designation: CPU SLOT1 PCI-E 4.0 X16
        Current Usage: Available
        Bus Address: 0000:ff:00.0
        Designation: CPU SLOT2 PCI-E 4.0 X8
        Current Usage: In Use
        Bus Address: 0000:41:00.0
        Designation: CPU SLOT3 PCI-E 4.0 X16
        Current Usage: In Use
        Bus Address: 0000:c1:00.0
        Designation: CPU SLOT4 PCI-E 4.0 X8
        Current Usage: Available
        Bus Address: 0000:ff:00.0
        Designation: CPU SLOT5 PCI-E 4.0 X16
        Current Usage: In Use
        Bus Address: 0000:c2:00.0
        Designation: CPU SLOT6 PCI-E 4.0 X16
        Current Usage: Available
        Bus Address: 0000:ff:00.0
        Designation: CPU SLOT7 PCI-E 4.0 X16
        Current Usage: In Use
        Bus Address: 0000:81:00.0
        Designation: PCI-E M.2-M1
        Current Usage: Available
        Bus Address: 0000:ff:00.0
        Designation: PCI-E M.2-M2
        Current Usage: Available
        Bus Address: 0000:ff:00.0

We can use lspci -s #BusAddress# to locate whats installed on each slot

lspci -s 0000:c2:00.0
c2:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)

lspci -s 0000:81:00.0
81:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)

Im sure there is a much more elegant way to do this, but this worked as a quick ish way to find what i needed. if you know a better way please share in the comments

Until next time!!!

Reference –

https://stackoverflow.com/questions/25908782/in-linux-is-there-a-way-to-find-out-which-pci-card-is-plugged-into-which-pci-sl

Use Mailx to send emails using office 365

just something that came up while setting up a monitoring script using mailx, figured ill note it down here so i can get it to easily later when I need it 😀

Important prerequisites

  • You need to enable smtp basic Auth on Office 365 for the account used for authentication
  • Create an App password for the user account
  • nssdb folder must be available and readable by the user running the mailx command

Assuming all of the above prerequisite are $true we can proceed with the setup

Install mailx

RHEL/Alma linux

sudo dnf install mailx

NSSDB Folder

make sure the nssdb folder must be available and readable by the user running the mailx command

certutil -L -d /etc/pki/nssdb

The Output might be empty, but that’s ok; this is there if you need to add a locally signed cert or another CA cert manually, Microsoft Certs are trusted by default if you are on an up to date operating system with the local System-wide Trust Store

Reference – RHEL-sec-shared-system-certificates

Configure Mailx config file

sudo nano /etc/mail.rc

Append/prepend the following lines and Comment out or remove the same lines already defined on the existing config files

set smtp=smtp.office365.com
set smtp-auth-user=###[email protected]###
set smtp-auth-password=##Office365-App-password#
set nss-config-dir=/etc/pki/nssdb/
set ssl-verify=ignore
set smtp-use-starttls
set from="###[email protected]###"

This is the bare minimum needed other switches are located here – link

Testing

echo "Your message is sent!" | mailx -v -s "test" [email protected]

-v switch will print the verbos debug log to console

Connecting to 52.96.40.242:smtp . . . connected.
220 xxde10CA0031.outlook.office365.com Microsoft ESMTP MAIL Service ready at Sun, 6 Aug 2023 22:14:56 +0000
>>> EHLO vls-xxx.multicastbits.local
250-MN2PR10CA0031.outlook.office365.com Hello [167.206.57.122]
250-SIZE 157286400
250-PIPELINING
250-DSN
250-ENHANCEDSTATUSCODES
250-STARTTLS
250-8BITMIME
250-BINARYMIME
250-CHUNKING
250 SMTPUTF8
>>> STARTTLS
220 2.0.0 SMTP server ready
>>> EHLO vls-xxx.multicastbits.local
250-xxde10CA0031.outlook.office365.com Hello [167.206.57.122]
250-SIZE 157286400
250-PIPELINING
250-DSN
250-ENHANCEDSTATUSCODES
250-AUTH LOGIN XOAUTH2
250-8BITMIME
250-BINARYMIME
250-CHUNKING
250 SMTPUTF8
>>> AUTH LOGIN
334 VXNlcm5hbWU6
>>> Zxxxxxxxxxxxc0BmdC1zeXMuY29t
334 UGsxxxxxmQ6
>>> c2Rxxxxxxxxxxducw==
235 2.7.0 Authentication successful
>>> MAIL FROM:<###[email protected]###>
250 2.1.0 Sender OK
>>> RCPT TO:<[email protected]>
250 2.1.5 Recipient OK
>>> DATA
354 Start mail input; end with <CRLF>.<CRLF>
>>> .
250 2.0.0 OK <[email protected]> [Hostname=Bsxsss744.namprd11.prod.outlook.com]
>>> QUIT
221 2.0.0 Service closing transmission channel 

Now you can use this in your automation scripts or timers using the mailx command

#!/bin/bash

log_file="/etc/app/runtime.log"
recipient="[email protected]"
subject="Log file from /etc/app/runtime.log"

# Check if the log file exists
if [ ! -f "$log_file" ]; then
  echo "Error: Log file not found: $log_file"
  exit 1
fi

# Use mailx to send the log file as an attachment
echo "Sending log file..."
mailx -s "$subject" -a "$log_file" -r "[email protected]" "$recipient" < /dev/null
echo "Log file sent successfully."

Secure it

sudo chown root:root /etc/mail.rc
sudo chmod 600 /etc/mail.rc

The above commands change the file’s owner and group to root, then set the file permissions to 600, which means only the owner (root) has read and write permissions and other users have no access to the file.

Use Environment Variables: Avoid storing sensitive information like passwords directly in the mail.rc file, consider using environment variables for sensitive data and reference those variables in the configuration.

For example, in the mail.rc file, you can set:

set smtp-auth-password=$MY_EMAIL_PASSWORD

You can set the variable using another config file or store it in the Ansible vault during runtime or use something like Hashicorp.

Sure, I would just use Python or PowerShell core, but you will run into more locked-down environments like OCI-managed DB servers with only Mailx is preinstalled and the only tool you can use 🙁

the Fact that you are here means you are already in the same boat. Hope this helped… until next time

Solution – RKE Cluster MetalLB provides Services with IP Addresses but doesn’t ARP for the address

I ran in to the the same issue detailed here working with a RKE cluster

https://github.com/metallb/metallb/issues/1154

After looking around for a few hours digging in to the logs i figured out the issue, hopefully this helps some one else our there in the situation save some time.

Make sure the IPVS mode is enabled on the cluster configuration

If you are using :

RKE2 – edit the cluster.yaml file

RKE1 – Edit the cluster configuration from the rancher UI > Cluster management > Select the cluster > edit configuration > edit as YAML

Locate the services field under rancher_kubernetes_engine_config and add the following options to enable IPVS

    kubeproxy:
      extra_args:
        ipvs-scheduler: lc
        proxy-mode: ipvs

https://www.suse.com/support/kb/doc/?id=000020035

Default

After changes

Make sure the Kernel modules are enabled on the nodes running control planes

Background

Example Rancher – RKE1 cluster

sudo docker ps | grep proxy # find the container ID for kubproxy

sudo docker logs ####containerID###

0313 21:44:08.315888  108645 feature_gate.go:245] feature gates: &{map[]}
I0313 21:44:08.346872  108645 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="nf_conntrack_ipv4"
E0313 21:44:08.347024  108645 server_others.go:107] "Can't use the IPVS proxier" err="IPVS proxier will not be used because the following required kernel modules are not loaded: [ip_vs_lc]"

Kubproxy is trying to load the needed kernel modules and failing to enable IPVS

Lets enable the kernel modules

sudo nano /etc/modules-load.d/ipvs.conf

ip_vs_lc
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack_ipv4

Install ipvsadm to confirm the changes

sudo dnf install ipvsadm -y

Reboot the VM or the Baremetal server

use the sudo ipvsadm to confirm ipvs is enabled

sudo ipvsadm

Testing

kubectl get svc -n #namespace | grep load
arping -I ens192 192.168.94.140
ARPING 192.168.94.140 from 192.168.94.65 ens192
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 1.117ms
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 0.737ms
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 0.845ms
Unicast reply from 192.168.94.140 [00:50:56:96:E3:1D] 0.668ms
Sent 4 probes (1 broadcast(s))
Received 4 response(s)

If you have the service type load balancer on a deployment now you should be able to reach it if the container is responding on the service

helpful Links

https://metallb.universe.tf/configuration/troubleshooting/

https://github.com/metallb/metallb/issues/1154

https://github.com/rancher/rke2/issues/3710

Setup guide for VSFTPD FTP Server – SELinux enforced with fail2ban (RHEL, CentOS, Almalinux)

Few things to note

  • if you want to prevent directory traversal we need to setup chroot with vsftpd (not covered on this KB)
  • For the demo I just used Unencrypted FTP on port 21 to keep things simple, Please utilize SFTP with the letsencrypt certificate for better security. i will cover this on another article and link it here

Update and Install packages we need

sudo dnf update
sudo dnf install net-tools lsof unzip zip tree policycoreutils-python-utils-2.9-20.el8.noarch vsftpd nano setroubleshoot-server -y

Setup Groups and Users and security hardening

if you want to prevent directory traversal we need to setup chroot with vsftpd (not covered on this KB)

Create the Service admin account

sudo useradd ftpadmin
sudo passwd ftpadmin

Create the group

sudo groupadd FTP_Root_RW

Create FTP only user shell for the FTP users

echo -e '#!/bin/sh\necho "This account is limited to FTP access only."' | sudo tee -a /bin/ftponly
sudo chmod a+x /bin/ftponly

echo "/bin/ftponly" | sudo tee -a /etc/shells

Create FTP users

sudo useradd ftpuser01 -m -s /bin/ftponly
sudo useradd ftpuser02 -m -s /bin/ftponly
user passwd ftpuser01 
user passwd ftpuser02

Add the users to the group

sudo usermod -a -G FTP_Root_RW ftpuser01
sudo usermod -a -G FTP_Root_RW ftpuser02

sudo usermod -a -G FTP_Root_RW ftpadmin

Disable SSH Access for the FTP users.

Edit sshd_config

sudo nano /etc/ssh/sshd_config

Add the following line to the end of the file

DenyUsers ftpuser01 ftpuser02

Open ports on the VM Firewall

sudo firewall-cmd --permanent --add-port=20-21/tcp

#Allow the passive Port-Range we will define it later on the vsftpd.conf
sudo firewall-cmd --permanent --add-port=60000-65535/tcp

#Reload the ruleset
sudo firewall-cmd --reload

Setup the Second Disk for FTP DATA

Attach another disk to the VM and reboot if you haven’t done this already

lsblk to check the current disks and partitions detected by the system

lsblk 

Create the XFS partition

sudo mkfs.xfs /dev/sdb
# use mkfs.ext4 for ext4

Why XFS? https://access.redhat.com/articles/3129891

Create the folder for the mount point

sudo mkdir /FTP_DATA_DISK

Update the etc/fstab file and add the following line

sudo nano etc/fstab
/dev/sdb /FTP_DATA_DISK xfs defaults 1 2

Mount the disk

sudo mount -a

Testing

mount | grep sdb

Setup the VSFTPD Data and Log Folders

Setup the FTP Data folder

sudo mkdir /FTP_DATA_DISK/FTP_Root -p

Create the log directory

sudo mkdir /FTP_DATA_DISK/_logs/ -p

Set permissions

sudo chgrp -R FTP_Root_RW /FTP_DATA_DISK/FTP_Root/
sudo chmod 775 -R /FTP_DATA_DISK/FTP_Root/

Setup the VSFTPD Config File

Backup the default vsftpd.conf and create a newone

sudo mv /etc/vsftpd/vsftpd.conf /etc/vsftpd/vsftpdconfback
sudo nano /etc/vsftpd/vsftpd.conf
#KB Link - ####

anonymous_enable=NO
local_enable=YES
write_enable=YES
local_umask=002
dirmessage_enable=YES
ftpd_banner=Welcome to multicastbits Secure FTP service.
chroot_local_user=NO
chroot_list_enable=NO
chroot_list_file=/etc/vsftpd/chroot_list
listen=YES
listen_ipv6=NO

userlist_file=/etc/vsftpd/user_list
pam_service_name=vsftpd
userlist_enable=YES
userlist_deny=NO
listen_port=21
connect_from_port_20=YES
local_root=/FTP_DATA_DISK/FTP_Root/

xferlog_enable=YES
vsftpd_log_file=/FTP_DATA_DISK/_logs/vsftpd.log
log_ftp_protocol=YES
dirlist_enable=YES
download_enable=NO

pasv_enable=Yes
pasv_max_port=65535
pasv_min_port=60000

Add the FTP users to the userlist file

Backup the Original file

sudo mv /etc/vsftpd/user_list /etc/vsftpd/user_listBackup
echo "ftpuser01" | sudo tee -a /etc/vsftpd/user_list
echo "ftpuser02" | sudo tee -a /etc/vsftpd/user_list
sudo systemctl start vsftpd

sudo systemctl enable vsftpd

sudo systemctl status vsftpd

Setup SELinux

instead of putting our hands up and disabling SElinux, we are going to setup the policies correctly

Find the available policies using getsebool -a | grep ftp

getsebool -a | grep ftp

ftpd_anon_write --> off
ftpd_connect_all_unreserved --> off
ftpd_connect_db --> off
ftpd_full_access --> off
ftpd_use_cifs --> off
ftpd_use_fusefs --> off
ftpd_use_nfs --> off
ftpd_use_passive_mode --> off
httpd_can_connect_ftp --> off
httpd_enable_ftp_server --> off
tftp_anon_write --> off
tftp_home_dir --> off
[lxadmin@vls-BackendSFTP02 _logs]$ 
[lxadmin@vls-BackendSFTP02 _logs]$ 
[lxadmin@vls-BackendSFTP02 _logs]$ getsebool -a | grep ftp
ftpd_anon_write --> off
ftpd_connect_all_unreserved --> off
ftpd_connect_db --> off
ftpd_full_access --> off
ftpd_use_cifs --> off
ftpd_use_fusefs --> off
ftpd_use_nfs --> off
ftpd_use_passive_mode --> off
httpd_can_connect_ftp --> off
httpd_enable_ftp_server --> off
tftp_anon_write --> off
tftp_home_dir --> off

Set SELinux boolean values

sudo setsebool -P ftpd_use_passive_mode on

sudo setsebool -P ftpd_use_cifs on

sudo setsebool -P ftpd_full_access 1

    "setsebool" is a tool for setting SELinux boolean values, which control various aspects of the SELinux policy.

    "-P" specifies that the boolean value should be set permanently, so that it persists across system reboots.

    "ftpd_use_passive_mode" is the name of the boolean value that should be set. This boolean value controls whether the vsftpd FTP server should use passive mode for data connections.

    "on" specifies that the boolean value should be set to "on", which means that vsftpd should use passive mode for data connections.

    Enable ftp_home_dir --> on if you are using chroot

Add a new file context rule to the system.

sudo semanage fcontext -a -t public_content_rw_t "/FTP_DATA_DISK/FTP_Root/(/.*)?"
    "fcontext" is short for "file context", which refers to the security context that is associated with a file or directory.

    "-a" specifies that a new file context rule should be added to the system.

    "-t" specifies the new file context type that should be assigned to files or directories that match the rule.

    "public_content_rw_t" is the name of the new file context type that should be assigned to files or directories that match the rule. In this case, "public_content_rw_t" is a predefined SELinux type that allows read and write access to files and directories in public directories, such as /var/www/html.

    "/FTP_DATA_DISK/FTP_Root/(/.)?" specifies the file path pattern that the rule should match. The pattern includes the "/FTP_DATA_DISK/FTP_Root/" directory and any subdirectories or files beneath it. The regular expression "/(.)?" matches any file or directory name that may follow the "/FTP_DATA_DISK/FTP_Root/" directory path.

In summary, this command sets the file context type for all files and directories under the "/FTP_DATA_DISK/FTP_Root/" directory and its subdirectories to "public_content_rw_t", which allows read and write access to these files and directories.

Reset the SELinux security context for all files and directories under the “/FTP_DATA_DISK/FTP_Root/”

sudo restorecon -Rvv /FTP_DATA_DISK/FTP_Root/
    "restorecon" is a tool that resets the SELinux security context for files and directories to their default values.

    "-R" specifies that the operation should be recursive, meaning that the security context should be reset for all files and directories under the specified directory.

    "-vv" specifies that the command should run in verbose mode, which provides more detailed output about the operation.

"/FTP_DATA_DISK/FTP_Root/" is the path of the directory whose security context should be reset.

Setup Fail2ban

Install fail2ban

sudo dnf install fail2ban

Create the jail.local file

This file is used to overwrite the config blocks in /etc/fail2ban/fail2ban.conf
sudo nano /etc/fail2ban/jail.local
vsftpd]
enabled = true
port = ftp,ftp-data,ftps,ftps-data
logpath = /FTP_DATA_DISK/_logs/vsftpd.log
maxretry = 5
bantime = 7200

Make sure to update the logpath directive to match the vsftpd log file we defined on the vsftpd.conf file

sudo systemctl start fail2ban

sudo systemctl enable fail2ban

sudo systemctl status fail2ban
journalctl -u fail2ban  will help you narrow down any issues with the service

Testing

sudo tail -f /var/log/fail2ban.log

Fail2ban injects and manages the following rich rules

Client will fail to connect using FTP until the ban is lifted

Remove the ban IP list

#get the list of banned IPs 
sudo fail2ban-client get vsftpd banned

#Remove a specific IP from the list 
sudo fail2ban-client set vsftpd unbanip <IP>

#Remove/Reset all the the banned IP lists
sudo fail2ban-client unban --all

This should get you up and running, For the demo I just used Unencrypted FTP on port 21 to keep things simple, Please utilize SFTP with the letsencrypt certificate for better security. i will cover this on another article and link it here

Change the location of the Docker overlay2 storage directory

If you found this page you already know why you are looking for this, your server /dev/mapper/cs-root is filled due to /var/lib/docker taking up most of the space

Yes, you can change the location of the Docker overlay2 storage directory by modifying the daemon.json file. Here’s how to do it:

Open or create the daemon.json file using a text editor:

sudo nano /etc/docker/daemon.json

{
    "data-root": "/path/to/new/location/docker"
}

Replace “/path/to/new/location/docker” with the path to the new location of the overlay2 directory.

If the file already contains other configuration settings, add the "data-root" setting to the file under the "storage-driver" setting:

{
    "storage-driver": "overlay2",
    "data-root": "/path/to/new/location/docker"
}

Save the file and Restart docker

sudo systemctl restart docker

Don’t forget to remove the old data

rm -rf /var/lib/docker/overlay2

PowerShell remoting (WinRM) over HTTPS using a AD CS PKI (CA) signed client Certificate

This is a guide to show you how to enroll your servers/desktops to allow powershell remoting (WINRM) over HTTPS

Assumptions

  • You have a working Root CA on the ADDS environment – Guide
  • CRL and AIA is configured properly – Guide
  • Root CA cert is pushed out to all Servers/Desktops – This happens by default

Contents

  1. Setup CA Certificate template
  2. Deploy Auto-enrolled Certificates via Group Policy
  3. Powershell logon script to set the WinRM listener
  4. Deploy the script as a logon script via Group Policy
  5. Testing
1 – Setup CA Certificate template to allow Client Servers/Desktops to checkout the certificate from the CA

Connect to the The Certification Authority Microsoft Management Console (MMC)

Navigate to Certificate Templates > Manage

On the “Certificate templates Console” window > Select Web Server > Duplicate Template

Under the new Template window Set the following attributes

General – Pick a Name and Validity Period – This is up to you

Compatibility – Set the compatibility attributes (You can leave this on the default values, It up to you)

Subject Name – Set ‘Subject Name’ attributes (Important)

Security – Add “Domain Computers” Security Group and Set the following permissions

  • Read – Allow
  • Enroll – Allow
  • Autoenroll – Allow

Click “OK” to save and close out of “Certificate template console”

Issue to the new template

Go back to the “The Certification Authority Microsoft Management Console” (MMC)

Under templates (Right click the empty space) > Select New > Certificate template to Issue

Under the Enable Certificate template window > Select the Template you just created

Allow few minutes for ADDS to replicate and pick up the changes with in the forest

2 – Deploy Auto-enrolled Certificates via Group Policy

Create a new GPO

Windows Settings > Security Settings > Public Key Policies/Certificate Services Client – Auto-Enrollment Settings

Link the GPO to the relevant OU with in your ADDS environment

Note – You can push out the root CA cert as a trusted root certificate with this same policy if you want to force computers to pick up the CA cert,

Testing

If you need to test it gpupdate/force or reboot your test machine, The Server VM/PC will pickup a certificate from ADCS PKI

3 – Powershell logon script to set the WINRM listener

Dry run

  • Setup the log file
  • Check for the Certificate matching the machines FQDN Auto-enrolled from AD CS
  • If exist
    • Set up the HTTPS WInRM listener and bind the certificate
    • Write log
  • else
    • Write log
#Malinda Rathnayake- 2020
#
#variable
$Date = Get-Date -Format "dd_MM_yy"
$port=5986
$SessionRunTime = Get-Date -Format "dd_yyyy_HH-mm"
#
#Setup Logs folder and log File
$ScriptVersion = '1.0'
$locallogPath = "C:\_Scripts\_Logs\WINRM_HTTPS_ListenerBinding"
#
$logging_Folder = (New-Item -Path $locallogPath -ItemType Directory -Name $Date -Force)
$ScriptSessionlogFile = New-Item $logging_Folder\ScriptSessionLog_$SessionRunTime.txt -Force
$ScriptSessionlogFilePath = $ScriptSessionlogFile.VersionInfo.FileName
#
#Check for the the auto-enrolled SSL Cert
$RootCA = "Company-Root-CA" #change This
$hostname = ([System.Net.Dns]::GetHostByName(($env:computerName))).Hostname
$certinfo = (Get-ChildItem -Path Cert:\LocalMachine\My\ |? {($_.Subject -Like "CN=$hostname") -and ($_.Issuer -Like "CN=$RootCA*")})
$certThumbprint = $certinfo.Thumbprint
#
#Script-------------------------------------------------------
#
#Remove the existing WInRM Listener if there is any
Get-ChildItem WSMan:\Localhost\Listener | Where -Property Keys -eq "Transport=HTTPS" | Remove-Item -Recurse -Force
#
#If the client certificate exists Setup the WinRM HTTPS listener with the cert else Write log
if ($certThumbprint){
#
New-Item -Path WSMan:\Localhost\Listener -Transport HTTPS -Address * -CertificateThumbprint $certThumbprint -HostName $hostname -Force
#
netsh advfirewall firewall add rule name="Windows Remote Management (HTTPS-In)" dir=in action=allow protocol=TCP localport=$port
#
Add-Content -Path $ScriptSessionlogFilePath -Value "Certbinding with the HTTPS WinRM HTTPS Listener Completed"
Add-Content -Path $ScriptSessionlogFilePath -Value "$certinfo.Subject"}
else{
Add-Content -Path $ScriptSessionlogFilePath -Value "No Cert matching the Server FQDN found, Please run gpupdate/force or reboot the system"
}

Script is commented with Explaining each section (should have done functions but i was pressed for time, never got around to do it, if you do fix it up and improve this please let me know in the comments :D)

5 – Deploy the script as a logon script via Group Policy

Setup a GPO and set this script as a logon Powershell script

Im using a user policy with GPO Loop-back processing set to Merge applied to the server OU

Testing

To confirm WinRM is listening on HTTPS, type the following commands:

winrm enumerate winrm/config/listener
Winrm get http://schemas.microsoft.com/wbem/wsman/1/config

Sources that helped me

https://docs.microsoft.com/en-us/troubleshoot/windows-client/system-management-components/configure-winrm-for-https

https://gmusumeci.medium.com/get-rid-of-those-annoying-self-signed-certificates-with-microsoft-certificate-services-part-3-9d4b8e819f45

http://vcloud-lab.com/entries/powershell/powershell-remoting-over-https-using-self-signed-ssl-certificate

Azure AD Sync Connect No-Start-Connection status

Issue

Received the following error from the Azure AD stating that Password Synchronization was not working on the tenant.

When i manually initiate a delta sync, i see the following logs

"The Specified Domain either does not exist or could not be contacted"

(click to enlarge)

Checked the following

  • Restarted ADsync Services
  • Resolve the ADDS Domain FQDN and DNS – Working
  • Test required ports for AD-sync using portqry – issues with the Primary ADDS server defined on the DNS values

Root Cause

Turns out the Domain controller Defined as the primary DNS value was pointing was going thorough updates, its responding on the DNS but doesn’t return any data (Brown-out state)

Assumption

when checking DNS since the DNS server is connecting, Windows doesn’t check the secondary and tertiary servers defined under DNS servers.

This might happen if you are using a ADDS server via a S2S tunnel/MPLS when the latency goes high

Resolution

Check make sure your ADDS-DNS servers defined on AD-SYNC server are alive and responding

in my case i just updated the “Primary” DNS value with the umbrella Appliance IP (this act as a proxy and handle the fail-over)

Upgrading VMware EXSI Hosts using Vcenter Update Manager Baseline (6.5 to 6.7 Update 2)

Update Manager is bundled in the vCenter Server Appliance since version 6.5, it’s a plug-in that runs on the vSphere Web Client.  we can use the component to

  • patch/upgrade hosts
  • deploy .vib files within the V-Center
  • Scan your VC environment and report on any out of compliance hosts

Hardcore/Experienced VMware operators will scoff at this article, but I have seen many organizations still using ILO/IDRAC to mount an ISO to update hosts and they have no idea this function even exists.

Now that’s out of the way Let’s get to the how-to part of this

In Vcenter click the “Menu” and drill down to the “Update Manager”

This Blade will show you all the nerd knobs and overview of your current Updates and compliance levels

Click on the “Baselines” Tab

You will have two predefined baselines for security patches created by the Vcenter, let keep that aside for now

Navigate to the “ESXi Images” Tab, and Click “Import”

Once the Upload is complete, Click on “New Baseline”

Fill in the Name and Description that makes sense to anyone that logs in and click Next

Select the image you just Uploaded before on the next Screen and continue through the wizard and complete it

Note – If you have other 3rd party software for ESXI you can create seprate baselines for those and use baseline Groups to push out upgrades and vib files at the same time 

Now click the “Menu” and Navigate Backup to “Hosts and Clusters”

Now you can apply the Baseline this at various levels within the Vcenter Hierarchy

Vcenter | DataCenter | Cluster | Host

Depending on your use case pick the right level

Excerpt from the KB 

For ESXi hosts in a cluster, the remediation process is sequential by default. With Update Manager, you can select to run host remediation in parallel.

When you remediate a cluster of hosts sequentially and one of the hosts fails to enter maintenance mode, Update Manager reports an error, and the process stops and fails. The hosts in the cluster that are remediated stay at the updated level. The ones that are not remediated after the failed host remediation are not updated. If a host in a DRS enabled cluster runs a virtual machine on which Update Manager or vCenter Server are installed, DRS first attempts to migrate the virtual machine running vCenter Server or Update Manager to another host so that the remediation succeeds. In case the virtual machine cannot be migrated to another host, the remediation fails for the host, but the process does not stop. Update Manager proceeds to remediate the next host in the cluster.

The host upgrade remediation of ESXi hosts in a cluster proceeds only if all hosts in the cluster can be upgraded.

Remediation of hosts in a cluster requires that you temporarily disable cluster features such as VMware DPM and HA admission control. Also, turn off FT if it is enabled on any of the virtual machines on a host, and disconnect the removable devices connected to the virtual machines on a host, so that they can be migrated with vMotion. Before you start a remediation process, you can generate a report that shows which cluster, host, or virtual machine has the cluster features enabled.

Link to KB on Remediation


Moving on; for this example, since I have only 2 hosts. we are going apply the baseline at the cluster level but apply the remediation at host level

Host 1 > Enter Maintenance > Remediation > Update complete and online

Host 2 > Enter Maintenance > Remediation > Update complete and online

Select the cluster, Click the “Updates” Tab and click on “Attach” on the Attached baselines section

Select and attach the baseline we created before

Click “Check Compliance” to scan and get a report

Select the host in the cluster, enter maintenance mode

Click “REMEDIATE” to start the upgrade. (if you do this at a cluster level if you have DRS, Update Manager will update each node)

This will reboot the host and go through the update process

Foot Notes –

You might run into the following issue

“vCenter cannot deploy Host upgrade agent to host”

Cause 1

Scratch partition is full use Vcenter and change the scratch folder location

VMWARE KB

Creating a persistent scratch location for ESXi  – https://kb.vmware.com/s/article/1033696

Cause 2

Hardware is not compatible,

I had this issue due to 6.7 dropping support for an LSI Raid card on an older firmware, you need to do some foot work and check the log files to figure out why its failing

Vmware HCL – Link

ESXI and Vcenter log file locations – link