The overthinking tax: token efficiency across six coding models

Posted on July 13, 2026July 13, 2026 by Malinda RathnayakeLeave a Comment

When Sol and Luna Released i cranked it up to ultra and used it for a code review, i was using it as an advisor on my foreman workflow and kept noticing it took 30+ mins to complete and sometimes timeout, while fable on Max effort and Gemini 3.1 Pro worked with no issues

I expected to burn more tokens but the time it took for something simple was just ridiculous, Noticed the same pattern with my Hermes bot

Sol was supposed to be the one!!!

it annoyed me enough to start doing some targeted experiments to find what’s going on since we have way too many LLM written crap when you search for something like this that just regurgitate the same slop.

The setup

Six coding models the same three from-scratch tasks, single-shot, and measured what each one spends: generated tokens, wall-clock, and how much of the output is reasoning versus actual code.

Three tasks, each “reimplement a piece of the Python standard library without importing it”:

a regex matcher (re.fullmatch)
a Python-literal parser (ast.literal_eval)
a POSIX shell splitter (shlex.split)

All three are specified, self-contained, and unambiguous about what counts as correct.

Grading

Borrowed the idea from Kaizen: Metamorphic Fuzzing and Differential Testing for LLM-Translated HPC Applications

instead of a fixed test suite, I generate thousands of random inputs per task and compare each model’s function against the real stdlib function on every one. Agreement over that corpus is the quality score Every run is single-shot, no agentic self-verification: one response, graded as-is.

Test Subjects

Five hosted models plus GLM-5.2, an open-weight coder run at full thinking.

Results – TLDR quick view (for skimmers like me)

Verdict – Sol xhigh

Sol xhigh was the most token- and time-efficient configuration in this three-task benchmark by a wide margin. Luna xhigh generated about 4× as many tokens, while Opus 4.8 max generated about 11.4× as many. Sol also recorded the lowest combined wall-clock time across the suite. Its advantage came from both compact code and a comparatively modest reasoning trace.

At the opposite extreme, Opus 4.8 max was the least token-efficient configuration tested. It generated approximately 281K tokens across the three-task suite, about 98% of which were reasoning tokens.

Reasoning effort is where the tokens go

Higher reasoning effort substantially increased token use and latency, but the quality gains were uneven. Luna max generated about 44% more tokens and took about 57% longer than Luna xhigh. It achieved higher overall agreement, but the improvement was concentrated primarily in the regex task rather than appearing consistently across all three tasks.

Per-task detail

Tokens = coding + reasoning. Quality = agreement with the stdlib oracle over the fuzz corpus (1.000 = no mismatches observed in the tested corpus). Codex models and GLM-5.2: 3 trials (mean). Anthropic models: 1 trial. Effort is matched for Sol vs Luna (both xhigh via Codex); Fable ran high, Opus max, GLM-5.2 at full/default thinking.

Model / effort	quality	coding tok	reasoning tok	total tok	time
regex
Sol xhigh	1.000	7,987	5,230	13,217	304s
Luna xhigh	0.77	25,752	23,076	48,828	487s
Luna max	0.999	30,612	27,528	58,140	752s
Fable 5 high	1.000	3,569	43,288	46,857	601s
Opus 4.8 max	n/a	1,898	129,363	131,261	1681s
GLM-5.2 full	0.81	3,392	34,566	37,958	627s
literal
Sol xhigh	0.93	6,455	3,680	10,135	183s
Luna xhigh	1.000	17,575	14,692	32,267	337s
Luna max	1.000	33,913	30,759	64,672	634s
Fable 5 high	1.000	3,731	42,572	46,303	602s
Opus 4.8 max	1.000	2,939	129,750	132,689	1619s
GLM-5.2 full	0.54	1,693	33,751	35,444	818s
shlex
Sol xhigh	1.000	802	447	1249	35s
Luna xhigh	1.000	8,500	8,136	16,636	174s
Luna max	0.99	9,261	8,941	18,202	185s
Fable 5 high	1.000	447	3,269	3,716	50s
Opus 4.8 max	1.000	598	16,503	17,101	196s
GLM-5.2 full	0.63	183	29,901	30,084	760s

Sol is efficient on both axes. It writes moderate code (~15K total) and reasons moderately (~9K, 38%). Nobody else is balanced like this.
Fable & Opus: compact code, runaway reasoning. Their solutions are the smallest (5–8K code total) — but at high/max effort they generate 89K (Fable) and 276K (Opus) reasoning tokens. Opus-max is 98% reasoning.
Luna is bloated on both. It writes the most code (52–74K — 400+ line solutions) and reasons ~half its tokens. Worst of both patterns.
GLM-5.2 over-reasons and under-delivers. At full thinking it spends 90–99% of its tokens reasoning (103K/trial, ~4× Sol) yet posts the lowest quality in the set (0.54–0.81). Same task, same prompt, its reasoning swung 11K→77K tokens and wall time 62s→1937s — the least predictable model measured. It’s the clearest single example of effort not converting to output.
More tokens ≠ correct. Single-shot, each model drops a task: Sol → literal (0.93), Luna-xhigh → regex (0.77), GLM → all three. The token premium buys no reliability.

Then i got paranoid about my own test methodology and went digging around again to re-confirm

Differential fuzzing

published method Grading against a trusted reference over thousands of fuzzed inputs is “Eq@DFuzz” (arXiv 2602.15761); see also Kaizen (2607.04058), ISSTA’24 oracle-guided selection. Catches what fixed unit tests miss.

Trusted oracle

standard Reference is Python’s own re/ast/shlex; models must match but may not import it (verified — no banned imports).

Token metric

matches field Output+reasoning tokens under fixed settings, jointly with accuracy, is exactly OckBench (2511.05722): “when models have similar accuracy, how much do they diverge in output-token efficiency?”

Self-correcting

verified The harness caught its own bugs twice — 4 mislabeled oracle cases, and an extraction glitch on Opus’s output — both found and fixed, not buried.

Limits

Anthropic reasoning is estimated. The API bundles thinking into output tokens, so Fable/Opus code is measured via tiktoken and reasoning is the remainder. Since their code is small, the reasoning estimate is robust directionally, but the exact split is approximate — and uses the OpenAI tokenizer on Anthropic code.
Provider-native token counts. OpenAI (Sol/Luna) and Anthropic (Fable/Opus) tokenizers differ. Sol-vs-Luna is a clean, same-tokenizer, same-harness comparison; comparisons to Fable/Opus are cross-tokenizer and cross-harness — directional, not exact.
Effort is not uniform. Sol & Luna are effort-matched (Codex xhigh; Luna also at max); Fable ran high, Opus max. Read within-provider first.
Sample size. Codex and GLM n=3 (with large variance); Anthropic n=1 (each run costs $3–4 and ~10–28 min). Treat Anthropic cells as single-point.
Single-shot ≠ cost-per-completed-task. Raw generation only; production cost would add retries for the reliability misses.

Mapping Activation Space: Peeking Inside the Model

Posted on March 10, 2026March 10, 2026 by Malinda RathnayakeLeave a Comment

The previous post covered what happens when you hit enter. Tokens flow through layers, probabilities get shaped, text comes out. System prompts anchor the model in activation space. Temperature controls how tightly it follows that anchor.

This post goes one level deeper, into the model’s layers. What are those activations? What shape do they take?

For this experiment, I used a well-documented workflow built around Google’s Gemma 2 2B model and the Gemma Scope residual stream SAEs. These are Sparse Autoencoders trained by Google on Gemma’s residual stream activations. They act as auxiliary models that decompose dense internal states into sparse, more interpretable features. The tools are Gemma-specific but the concepts apply to any transformer model.

Vectors: The Model’s Internal Language

Think about the gas laws. Compress a gas, it gets hotter. The macroscopic behavior is simple. Underneath, there’s a seething mass of atoms bouncing around, and the real explanation to why it gets hot when compressed lives in those microscopic dynamics.

Neural networks work the same way.

The macroscopic behavior is “the model talks about France.”
The microscopic dynamics are activation patterns flowing through 2304-dimensional space.

We’re mapping the microscopic level

A vector is a list of numbers. In Gemma 2 2B, the vector at each token position is a list of 2304 numbers. That’s the model’s d_model dimension, its internal working width.

No single number in that list is human-readable. But the pattern across all 2304 numbers encodes what the model “knows” at that point in the text.

When Gemma processes “The capital of France is”, it does not have a dedicated slot for “France” and another slot for “capital”. Instead, the model has a specific geometric direction for “France” and another direction for “capital”. It adds those vectors together.

The vector at the last token position is just the mathematical sum of all those active concepts.

Because it was built through vector addition, that single coordinate in 2304-dimensional space contains the combined geometry of “France”, “geography”, and “Paris is next”.

We cannot visualize that space directly, but the model moves through it constantly, and every prediction depends on exactly where that final coordinate lands.

Toy Models of Superposition

The Linear Representation Hypothesis and the Geometry of Large Language Models

The Residual Stream: The Highway Everything Rides On

Vectors are the payload; the Residual stream is the bus. It is the main data pipeline through the transformer.

That 2304-dimensional vector persists across all 26 layers. Each layer does not replace it outright. It reads the current residual stream, computes an update, and adds that update back in.

The token embedding creates the initial vector. Layer 0 attention reads it, computes an update, adds it back. Layer 0 MLP does the same. This continues through all 26 layers, and the final vector gets mapped to token probabilities.
That additive pattern is why the residual stream is such an important place to inspect. Earlier information can remain available while later layers keep reshaping it. What survives, what gets amplified, and what becomes less useful depends on the sequence of updates across the network.

A prompt goes in. The model processes it token by token.
At layer 12, each token has a vector of shape (2304,) in the residual stream.
The SAE encoder maps that dense vector into a sparse feature space (16,384 dimensions for the coarse SAE, 262,144 for the fine SAE) where only a small number of features activate.
The SAE decoder maps the sparse representation back into (2304,) to approximate the original residual state.
A steering vector is the decoder vector for a specific SAE feature. Adding it into the residual stream biases the model toward that feature’s pattern. Phase 2 will test this.

Why Layer 12?

Gemma 2 2B has 26 transformer layers, numbered 0 through 25. I’m hooking into layer 12 — roughly 46% depth.

The hook point is blocks.12.hook_resid_post, capturing the residual stream after layer 12 finishes processing.

Production inference engines and local runners use quantized formats and custom C++ engines for speed, which strips out the Python-level hook system.

To get around this, we load the exact same Gemma 2 2B weights from HuggingFace as native PyTorch tensors (raw weights), which gives us the ability to tap any layer.

Gemma Scope trained SAEs on every layer of Gemma 2 2B, so layer 12 was a deliberate choice was based on Google’s own interpretability research papers.

Announcing Gemma Scope 2 — AI Alignment Forum

Gemma_Scope_2_Technical_Paper.pdf

What Does “Mapping Activations” Mean?

Coarse (16K SAE): fewer features, each covering more conceptual ground.
Fine (262K SAE): more features, each with higher resolution on specific facets.

An SAE decomposes a single 2304-number vector into thousands of sparse components called features.

The residual stream is dense and compressed. Gemma has to represent a huge amount of structure in just 2304 dimensions, so many different patterns get packed together in superposition. That means individual residual dimensions are not clean, human-readable concept slots. A single dimension can participate in multiple unrelated behaviors depending on the context.

This is where the Sparse Autoencoder helps. Instead of trying to interpret the raw 2304-dimensional state directly, the SAE projects it into a much larger sparse feature space, such as 16K or 262K features. In that expanded space, only a small number of features activate for a given input, and those features are often easier to interpret than the original dense residual dimensions.

So back to the actual numbers. The SAE takes that tightly packed, dense circuitry and expands it out into a much wider space so we can see which patterns actually fired.

How sparse is sparse, what is a SAE?

Think of an audio spectrum analyzer. It takes a single, dense audio wave where the kick drum, bass, and vocals are all mashed together, and splits them out into distinct frequency bands. Most bands stay flat. Only the frequencies actually present in the audio spike up.

The SAE is a semantic spectrum analyzer.

The model’s Layer 12 vector is the dense audio wave, a 2304-number mess where all the active concepts are added together.

The SAE takes that dense wave and splits it out across 16,384 distinct “frequency bands” (features).

Because a single sentence only contains a handful of concepts, the SAE only spikes about 82 of those bands to represent the entire input.

The other 16,000+ bands stay flat at zero simply because those concepts are not in the sentence. Out of 16,384 possible patterns the SAE can detect, a typical input lights up fewer than 1% of them. The rest stay silent. That is what sparse means.

Feature #3999:   mean activation 3.67, selectivity 0.97  ← France (strongest signal)
Feature #11333:  mean activation 4.48, selectivity 0.89  ← France-related
Feature #9473:   mean activation 1.21, selectivity 0.88  ← France-related
Feature #14805:  mean activation 1.24, selectivity 0.86  ← France-related
Feature #6211:   mean activation 1.27, selectivity 0.86  ← France-related
...
Feature #6,004:  activation 0.00                         ← silent
Feature #6,005:  activation 0.00                         ← silent
[~11,000 more at zero]

The key column we should focus on is selectivity: how exclusively does a feature fire on France prompts versus neutral ones?

Feature #3999 scores 0.97. This means it almost exclusively fires on France-related input. Notice #11333 actually fires harder (mean 4.48 vs 3.67) but has lower selectivity. It bleeds into neutral prompts too. A feature that fires on everything is not telling you anything useful, no matter how loud it is.

I chose France as a sanity-check target because the interpretability method was already grounded in prior work, and Gemma Scope gave me a validated SAE setup to test whether my own pipeline outputted correct Data.

The experiment follows standard A/B logic.

Run 400 prompts total through Gemma: 200 France-themed prompts and 200 neutral filler prompts.
Capture the layer 12 activations for both sets.
Decompose both through the same SAEs.
Use Python to compare features that fire consistently on the France set but stay quiet on the neutral sets.

Now we have the candidates for “France features.” Features that fire on everything else can be safely considered noise.

Two Resolutions, One Concept

The 16K SAE (coarse) might give you one feature, call it #3999, that broadly responds to “France.” One fat bucket for the whole concept.

The 262K SAE (fine) might split that same concept into multiple features.

Feature #86473 fires on a subset of France prompts. Feature #249284 fires on a different subset. Feature #243533 on another.

Each one potentially encodes a different facet (cuisine, geography, landmarks, language), though we do not know which facets they are yet. More on that problem in a minute.

This is hierarchical feature decomposition.

Same logic as hand-designing a vision network:

Edges combine into shapes
Shapes combine into objects
Objects together in a scene change the meaning again (meaning as in what regions get excited to infer what it is)

but instead of happening across multiple sequential layers, this hierarchy exists at different levels of granularity within the exact same layer.

The coarse SAE has a smaller dictionary, so it is forced to find the whole object.

The fine SAE has a massive dictionary, so it can afford to isolate the specific parts and edges. Nobody hand-wired these decompositions. They emerged naturally from the training data.

The cross-resolution mapping step figures out which fine features correspond to which coarse features. It measures two things:

Activation correlation: Do they fire on the same prompts? (behavioral evidence)
Decoder cosine similarity: Do their decoder vectors point in the same direction in the 2304-dimensional residual stream? (Geometric evidence)

Here are the actual results from the experiment.

The Heatmap: Decoder Cosine Similarity

The heatmap shows the same data from a different angle. Each cell measures how much a coarse features and a fine features point in the same direction inside the model.

High values mean they’re encoding the same thing at different zoom levels.
Low values mean they’re unrelated.

Coarse	Strongest Fine Match	Similarity	Reading
C-3999	F-249284	0.80	Near-identical direction. Almost certainly the same concept at different granularity
C-11333	F-86473	0.54	Strong overlap, plus two more matches at 0.42 and 0.34
C-9473	F-243533	0.47	One clear sub-feature

Values near zero mean unrelated. Above 0.3 is meaningful geometric correspondence.

Each of the 262,000 fine features was compared against C-3999 to see how closely they point in the same direction. Almost all of them scored near zero, meaning no geometric alignment.
The histogram shows where the crowd is: piled up at zero, with a long empty stretch before the red line at 0.3. The handful of features past that line are the ones that actually share a direction with C-3999. That’s why 0.3 is the cutoff. It’s where the crowd ends and the signal begins.

Coarse-to-Fine Feature Decomposition

Every prompt was decoded through both SAEs at the same time. The coarse SAE gave us three broad France features. The fine SAE gave us six narrow ones.

The interactive dashboard below maps every possible pair of these coarse and fine features to measure how strong their relationship actually is.

All data shown is derived from real activations collected by running 400 prompts through Gemma 2 2B and decomposing layer 12 through both Gemma Scope SAEs.

The left side shows the decomposition graph. It maps exactly which fine features branch off from our three coarse anchors.

Link width represents behavioral correlation (do they fire together?).
Link color represents geometric similarity (do they point the same way?).

The right side plots these exact same pairs in metric space:

The X-axis is behavioral correlation: do they fire on the exact same prompts?
The Y-axis is geometric similarity: do their vectors point in the same direction inside the model?

Look at the color coding on the scatter plot. It tells the whole story:

Blue dots (Top Right): These are the true sub-feature matches. They have high correlation and high geometric similarity.

They co-fire AND point the same way inside the model. The strongest pair is C-3999 to F-249284.

Orange dots (Middle): These are partial overlaps.

They fire together often, but their geometry is drifting apart.

Red dots (Bottom): These are co-occurring concepts. They sit in the lower half with moderate correlation but cosine similarity near zero. These features fire on many of the same prompts but point in completely different directions.

They co-occur with France, but they are not encoding the same concept. They are related ideas that travel together, not the same idea at two zoom levels.

Reading the scatter plot

To better understand the graph, think of these features as passive sensors hooked up to the Layer 12 data bus:

1. The Empty Top-Left (High Similarity, Low Correlation)
If Sensor A and Sensor B point in the exact same direction, they are going to catch the exact same meandering vectors. Always. You physically cannot have a vector pass by that trips Sensor A but misses Sensor B. That is why high geometric similarity mathematically forces high correlation. The top-left is empty because it defies physics.

2. The Bottom-Right (Low Similarity, High Correlation)
Sensor A points at “France”. Sensor B points at “Food”. They point in completely different directions (low geometric similarity). But when the vector for “French Cuisine” meanders down the data bus, it contains enough geometry to trip both sensors at the same time. They fire together (high correlation) even though they are looking for different things. These are your co-occurring concepts.

3. Low Selectivity (The noisy features)
If a sensor points in a direction that catches “France” but also catches a bunch of other random vectors meandering by (like “Germany” or “cheese”), it will have a lower selectivity score. This is exactly what you saw with Feature #11333 earlier in the post—it fired loud, but it fired on too much unrelated traffic to be a clean “France” feature.

The Interpretation Gap

C-3999 gets labeled “France” because it reliably activates on France prompts and not on neutral ones. That’s standard practice in interpretability work. Label a feature by what activates it.

But the model doesn’t know the word “France.” It has a direction in 2304-dimensional space that, for whatever internal reason, turned out to be useful for predicting the next token when France-related patterns show up. We call it “France” because that’s the human category our test prompts were organized around. The model’s internal geometry might carve the world along boundaries that don’t align with our conceptual categories at all — we just can’t tell, because we only test with prompts organized around our categories.

What we actually know	What we’re assuming
C-3999 fires on France prompts, not neutral	C-3999 “means” France
F-86473 fires on a subset of France prompts	F-86473 is a France sub-concept
C-3999 and F-249284 point in similar directions	They encode related meanings
Injecting C-3999’s direction changes output toward France-like text	The feature is causally involved in France generation

The labels (“French cuisine,” “Paris landmarks”) are human interpretations based on which prompts activate a feature. The model doesn’t have those labels. It has frozen weights and an activation landscape that we’re projecting human categories onto.

Neural nets don’t memorize data. They find regularities and generalize those regularities to new data. A model will generate plausible text about unicorns even though it’s never seen one described as real, because it’s learned the relational structure of mythical creatures, horses, and horns. The internal representation that enables that generalization doesn’t need to map onto our concept of “unicorn.” It just needs to be useful for next-token prediction. When we label a feature “France,” we’re assuming the model’s useful regularity aligns with our semantic boundary. Sometimes it does. Sometimes we don’t know.

This is the wall that people like Neel Nanda keep writing about in mechanistic interpretability research. It was interesting to actually hit it myself. I can identify which features fire and when.

We can measure geometric relationships between them. But mapping that to human-readable meaning is always an inference, never ground truth.

When I started this project, I wanted to build something like the Activation Space Navigator from the previous post, but using real model data derived from SAEs, I pictured clean clusters with labeled regions where you could point and say, “that’s France.”

The real data did not look like that. What it gave me instead were directions in the model’s internal representation that reliably correlate with France-themed input.

Every feature in the SAE is a literal 2304-dimensional vector stored in the SAE’s decoder matrix. Feature C-3999 is just row #3999 in that matrix. It acts as a static reference coordinate for “France”.

This exact mechanic applies to any concept. If we were testing Python code, HTML tables, or HTTP status codes, there would be a different row acting as the reference coordinate for that specific pattern.

The reference vectors in the SAE do not move. They are fixed in place like highway signs. The model’s dynamic state passes by them.

When the France prompt generates a vector that passes close to the C-3999 sign, that specific band on our semantic spectrum analyzer spikes.

Neutral prompts pass further away, so the band stays flat. Those spikes are the sparse values we actually record.

One thing worth naming: this is all an approximation. The SAE reconstructs the residual stream from its learned directions, and the reconstruction is not perfect. Some signal is always lost. We are working with a useful approximation of the model’s internal state, not the thing itself.

So it is not that “France activated these regions of the model’s weights.” The weights are frozen. The model has learned internal directions for France-related patterns, and when France text flows through, the residual stream aligns with those directions. That alignment is what we are measuring.

Closing the loop

The previous post described system prompts as activation-space manipulation. This experiment gives me supporting evidence for that framing.

The directions those prompts appear to push activations toward are measurable, and some of that structure can be decomposed into narrower features.

What I found is suggestive structure, not full semantic ground truth.

The coarse-to-fine matches look real enough to justify the next step, which is testing whether steering along those directions’ changes generation in a targeted way.

What’s Next for This Project

The activation mapping is done. I found and observed suggestive structure:

coarse France-related features that appear to decompose into finer sub-features, supported by both behavioral correlation and geometric similarity.

The GO/NO-GO question was whether the coarse-to-fine mapping would produce 3+ meaningful sub-features per coarse anchor.

C-11333 has three above threshold. That’s a GO.

Phase 2 is the actual Steering Experiment.

Take those mapped features and test whether multi-resolution steering (coarse “France” + fine sub-features) produces better, more targeted output than single-resolution steering alone.

If it does, that’s evidence the cross-resolution structure isn’t just a statistical artifact. It’s a lever we can pull to tweak the behavior of a model

If it doesn’t, the structure is real but doesn’t actually control what the model generates. Correlation isn’t causation, even inside a neural network.

Either way, I’ll know more about what those frozen weights are actually doing when we hit enter.

References

After the Weights Freeze: What Happens When You Hit Enter – the companion post on inference, temperature, and system prompts
Steering GPT-2-XL by adding an activation vector – the foundational activation steering paper
Mapping the Mind of a Large Language Model – Anthropic’s feature visualization work
Gemma Scope – Google’s Sparse Autoencoders trained on Gemma 2
TransformerLens – the library used for hooking into the residual stream
SAE Lens – library for loading and running Sparse Autoencoders

After the Weights Freeze: What Happens When You Hit Enter

Posted on February 23, 2026March 10, 2026 by Malinda RathnayakeLeave a Comment

In the last post I tried to explain how an LLM gets built. Billions of numbers, adjusted one fraction at a time, until structure emerges from prediction pressure. Circuits form. Clusters of meaning self-organize.

But that post ends where the interesting part begins. Now the model exists. The weights are frozen. Training is done.

Now we type something and hit enter. What actually happens?

This is the post I wish I’d had when I started using these tools.

The Forward Pass

We type a message, Text gets chopped into tokens.

Subword chunks, not full words. “Understanding” becomes something like ["under", "standing"]. Your message might be 20 words but 30+ tokens.

Those tokens flow forward through the model’s layers. Every layer transforms the representation. The attention mechanism lets each token look back at every other token in the context and decide what’s relevant.

The weights don’t change during this process. They’re frozen from training. The model is just running, applying its learned patterns to your specific input.

What comes out is a list.

Lets Say we type: "Write a short paragraph about Kafka vs RabbitMQ"

The model tokenizes that, processes it through every layer, and has to pick the very first token of its response.

To do that, it computes a score for every token in its vocabulary.

What’s a vocabulary?

The vocabulary is the fixed list of every token the model knows, built before training using byte-pair encoding on a massive text corpus that LLM companies have scraped off the internet.

For GPT-2, that’s 50,257 tokens. For newer models it’s larger, often 100k+.

The output is a probability distribution across that entire vocabulary, every time, for every single token it generates.

We are going to use GPT-2 as an example to explain the concept

For that first token, the raw scores (logits) might look something like this:

Token 8,527  ("When"):     0.1263
Token 16,401 ("Kafka"):    0.0891
Token 3,198  ("The"):      0.0734
Token 11,045 ("Both"):     0.0622
Token 23,189 ("Apache"):   0.0418
Token 1,550  ("In"):       0.0387
Token 42,007 ("Choosing"): 0.0095
Token 7,904  ("Message"):  0.0071
Token 33,421 ("While"):    0.0068
Token 50,012 ("ĠðŁ"):     0.0000003
Token 831    (" Q"):        0.0000001
...
[50,246 more entries trailing into the decimals]

Every generation step. Fifty thousand scores. The model doesn’t “think of” the top five and pick one.

It produces all 50,257 simultaneously and the sampling process decides which one wins. Most of that list is near-zero noise.

Tokens like emoji fragments and random punctuation that have no business starting a paragraph about message brokers. But they’re scored anyway. Every time.

This is the fundamental object we’re manipulating every time we use these tools.

A probability distribution over the entire vocabulary, shaped by everything the model has seen so far in the context window.

Let’s hang onto that mental image. It will make everything else in this post make sense.

The Dice Roll in Practice

Other post covered temperature conceptually. Low temp means predictable, high temp means creative. But knowing how it works changes how we use it.

The model produces those raw scores (logits) for all 50,257 tokens. Temperature divides those scores before they get converted to probabilities. That division matters.

Lets use our Kafka vs RabbitMQ prompt and trace what happens.

Low temperature (0.2): Stick to the spec

The division amplifies the gaps between scores. “When” was already the top pick, and after low-temp scaling it dominates. The model opens with “When” almost every time. Run it five times:

Run 1: "When comparing Kafka and RabbitMQ, the key distinction lies in..."
Run 2: "When choosing between Kafka and RabbitMQ, it's important to..."
Run 3: "When evaluating message brokers, Kafka and RabbitMQ represent..."
Run 4: "When comparing Kafka and RabbitMQ, the key distinction lies in..."
Run 5: "When choosing between Kafka and RabbitMQ, the fundamental..."

Nearly identical openings. The first token barely varies, and that initial choice constrains everything that follows

Runs 1 and 4 might be word-for-word identical for the first 20 tokens before the dice diverge.

High temperature (1.0): Creative

The division shrinks the gaps. “When” is still the most probable, but “Kafka,” “Both,” “Apache,” “Choosing” all get a real shot. The outputs sprawl:

Run 1: "Both systems handle messaging but their philosophies diverge..."
Run 2: "Kafka treats the log as the fundamental abstraction..."
Run 3: "Choosing between these two usually comes down to whether..."
Run 4: "Apache Kafka and RabbitMQ solve overlapping problems from..."
Run 5: "In the messaging landscape, Kafka and RabbitMQ occupy..."

The model is running the same weights and producing the same kind of distribution, temperature just changes how adventurous you are when sampling from it.

Each choice cascades. Once the model starts with “Kafka treats the log,” the next token distribution shifts entirely compared to starting with “Both systems handle.”

Temperature = 0: greedy decoding. Always pick the highest-scoring token. Completely deterministic: same input, same output, every time. No dice roll at all.

Then there’s the filtering that happens before the roll.

Top-k says “only consider the k highest-scoring tokens” (exclude the rest, then renormalize).
Top-p (nucleus sampling) says “start from the top and keep adding tokens until their cumulative probability reaches p (a threshold you choose, like 0.9), then exclude the rest.”
Most production systems use some combination of all three. (If you want the full technical breakdown of these decoding methods, Hugging Face’s walkthrough is excellent.)

This is why “regenerate” gives us a different response. Same weights, same context, same list of 50,257 scores. Different roll of the dice. The terrain is identical. The path through it changes.

Modern agentic coding tools like Cursor/Cline/Codex use separate modes for planning vs coding/debugging to take advantage of this, often along with different system prompts/constraints.

Planning needs to explore options, consider architectures, think laterally. That’s higher temperature territory.

Writing the actual code from the plan needs to be precise and deterministic. That’s lower temperature.

Same model behind both modes. Different sampling strategy for different phases of the work.

Where the knobs actually are

If you’re calling the model via an API, you can usually tune these parameters to match your needs. If you’re using a chat UI/tool, the app typically picks defaults for you.

With the Claude API, we get temperature (0.0 to 1.0, defaults to 1.0), top_k, and top_p. Anthropic’s guidance is to just use temperature and leave the others alone.

With Google’s Gemini, we get temperature controls directly in the AI Studio UI. No API needed, just a slider. Their range goes from 0.0 to 2.0.

Temperature 0.5 on Claude and temperature 0.5 on Gemini don’t produce the same behavior.

Each provider trains and tunes differently, so the same number produces different sampling characteristics. It’s the same concept across all of them, but we can’t just copy settings between providers and expect identical results.

System Prompts as Activation Space Anchoring

Before I started going down this rabbit hole, I assumed system prompts worked like config flags.

“Set the model to be a Python expert.” “Tell it to be concise.” Flips a switch in the model and changes the behavior. I think most people using these tools have that same mental model.

Turns out I was wrong. And understanding what’s actually happening made me noticeably better at using these tools.

A system prompt is text. It gets tokenized and fed into the model as the first tokens in the context window. Those tokens flow through the same layers as everything else. They produce activations, patterns of neural activity inside the model. And those activations influence every token that comes after.

Check the galaxy map from Anthropic’s feature visualization, where concepts cluster into neighborhoods

Code near code, legal language near legal language, casual conversation near casual conversation

The system prompt doesn’t tell the model which neighborhood to visit. It starts the model in that neighborhood.

When you write "You are a senior Python developer who writes production code with proper error handling, type hints, and logging"

every one of those tokens activates features in the model. “Senior” pulls toward experienced patterns.

“Production” pulls toward robustness. “Error handling,” “type hints,” “logging” each activate their own clusters.

Those activations become part of the context. Every subsequent token the model generates is influenced by them because the attention mechanism lets every new token look back at the system prompt tokens.

(For Claude API, Anthropic has documentation on how system prompts work at the implementation level.)

The system prompt seeds the context window with tokens that bias which internal features and clusters activate. It pulls the model into a specific region of its representation space (I’m using ‘space’ loosely here, it’s more ‘internal state’ more than geometry).

this is activation space anchoring (this is an analogy for shifting internal representations, not a literal coordinate system like a map.)

(This isn’t just theory. we can literally steer GPT-2’s behavior by adding activation vectors into its forward pass. Add a “wedding” vector and the model talks about weddings. Add an “anger” vector and it gets hostile. The activations are the steering mechanism, and system prompts are doing a version of the same thing through natural language.)

And this connects directly to the probability list. The system prompt doesn’t add new tokens to the vocabulary. It doesn’t unlock hidden capabilities.

What it does is reshape the probability distribution over those same 50,257 tokens.

Tokens related to the system prompt’s domain get boosted. so it will assign high probability to tokens in that subjects domain

Take our Kafka vs RabbitMQ prompt again. Without a system prompt, the first-token distribution had “When” on top, “Kafka” and “Both” trailing behind, a generic opening for a generic comparison.

Now add a system prompt: "You are a senior distributed systems architect. Prioritize throughput, partition tolerance, and operational tradeoffs. Be direct."

The same prompt. But those system prompt tokens have been flowing through the model’s layers, activating features related to distributed systems, performance, architecture. By the time the model gets to our question, the probability landscape has shifted:

Token 16,401 ("Kafka"):    0.1534   (was 0.0891)
Token 3,198  ("The"):      0.0812   (was 0.0734)
Token 6,571  ("At"):       0.0498   (new in top 10)
Token 11,045 ("Both"):     0.0411   (was 0.0622)
Token 8,527  ("When"):     0.0389   (was 0.1263, dropped hard)
Token 19,888 ("From"):     0.0285   (new in top 10)
Token 23,189 ("Apache"):   0.0271   (was 0.0418)

“When,” the safe essay-style opener, dropped from first place to fifth. “Kafka” jumped to the top. The model is more likely to lead with the technical substance rather than a comparison framework. That "Be direct" token cluster suppressed the hedging openers. The distributed systems context boosted tokens that lead to architectural analysis.

Same vocabulary. Same 50,257 entries. Different weights across the list.

Simplified interactive visualization of Activation Space Anchoring

Activation Space Navigator

This is a toy 2D visualization. Real activation space is enormous and multidimensional, but it’s faithful to the idea that prompts steer the model by shifting internal activations.

Temperature + System Prompt: Two Knobs, One Process

Once you start seeing it this way, temperature and system prompts stop being separate concepts.

The system prompt shapes which probabilities are high and which are low. It sculpts the distribution. Boosts code tokens, suppresses casual ones, or whatever the prompt content biases toward.

Temperature controls how strictly the model follows that shaped distribution.

Low temperature means “stick to what the system prompt is pushing you toward.”

High temperature means “the system prompt set a direction, but feel free to wander.”

They’re two knobs on the same process. One shapes the probability field. The other controls how tightly the model walks along its ridges.

MoE Routing: When the Architecture Gets Involved

Some models take this further. Mixture of Experts (MoE) architectures, used in models like Gemini and DeepSeek, don’t activate all their parameters for every token. They route each token through a subset of specialized “expert” subnetworks. (Hugging Face has a solid explainer on how MoE works with a full architecture breakdown.)

tldr;

In a MoE model, the system prompt tokens flow through the network and produce hidden states, just like in a dense model. But the way they influence routing is indirect, and it matters to get this right.

The router itself is stateless. It’s a simple feed-forward layer that looks at one token’s hidden state and decides which experts to use. It has no memory of what came before. So the system prompt tokens don’t “tilt” the router or bias it over time.

What actually happens is the Attention mechanism does the work first. When a token from your actual question (say “Kafka”) is being processed, it attends back to the system prompt tokens (“You are a distributed systems architect”).

That attention pulls system-prompt context into the current token’s hidden state vector. By the time that enriched “Kafka” vector reaches the MoE layer, it looks different than it would without the system prompt. The router sees that specific vector, evaluates it, and routes it to the experts that match. A “Kafka” vector colored by distributed systems context gets routed differently than a “Kafka” vector colored by literary analysis context.

It’s not a clean “wake up the code expert” signal. It’s per-token and indirect. The system prompt infects each new token through Attention, and that infected representation is what the router evaluates.

Implementation details vary by architecture, but the core idea is the same: routing decisions are made per-token from that token’s current hidden state.

The effect is real, but the mechanism is Attention doing the heavy lifting before the router ever sees the token.

This is very similar to the activation anchoring principle, but operating at an additional architectural level. Not just biasing which features activate within a single network, but biasing which sub-networks get used at all.

Why Models Drift in Long Conversations

This one drove me nuts before I understood the mechanism.

we write a careful system prompt. The model follows it perfectly for 10 messages. By message 20, it’s drifting. The tone shifts.

It starts complimenting you. It forgets constraints you set. With some models, the anti-sycophancy instructions you wrote might as well not exist after enough back-and-forth.

The architecture explains exactly why.

Attention has a cost that scales with context length. As the conversation grows, each new token has more previous tokens to attend to. The system prompt tokens are still there, they haven’t been deleted, but they’re now a small fraction of a much larger context window.

Think of it like a voice in a growing crowd. Your system prompt is a person at the front of the room speaking clearly. When there are 10 people in the room, everyone hears this them fine. When there are 500 people all talking, that original voice gets harder to pick out.

As context grows, relevant instructions can lose salience among competing tokens; re-anchoring helps reenforce intended context.

Transformers don’t inherently know word order, so they use positional encodings (like RoPE, Rotary Position Embedding) to inject position information into each token.

These encodings bias the attention mechanism to favor tokens that are physically closer. As the conversation gets longer, the physical distance between the current token and the system prompt grows.

Now when we Combine that distance penalty with the fact that recent back-and-forth dialogue we built up in the chat, the system prompt’s anchoring effect fades.

And what fills the gap is the model’s base personality. The behaviors baked in during RLHF and preference tuning.

The agreeable, helpful, slightly sycophantic tendencies that training optimized for. The system prompt was overriding those tendencies, but as its influence weakens, the base behavior seeps back through.

This is why context window isn’t just a memory constraint. It’s a behavioral stability constraint.

A model with a 128k context window doesn’t just remember more, it maintains system prompt influence over a longer conversation.

( “Lost in the Middle” Paper shows language models perform best when relevant information is at the beginning or end of the context, and significantly worse when it’s buried in the middle. system prompt sits at the very beginning, which helps, but distance penalty applies)

Practical Implications

Dense system prompts beat fluffy ones.

Length isn’t the problem. Anthropic’s own default system prompt for Claude is thousands of tokens long, and it works. A 2,000-token prompt packed with dense architectural constraints, few-shot examples, strict schemas, and specific behavioral rules.

This creates a massive anchor in the context that practically forces the model into a specific behavioral subspace.

But a 2,000-token prompt full of vague running sentences (“Be a helpful, friendly, synergistic assistant who always puts the user first”) is actively sabotaging the prompt and just burning tokens and a little hole in your wallet and warming our planet.

Every token in the system prompt must earn its keep. The failure mode isn’t “too long,” it’s “too much noise.” Contradictory instructions, redundant phrasing, and generic filler all dilute the signal of the tokens that actually matter.

Domain context is activation anchoring.

When we paste a code file, an API schema, or a data model into the context, we are not just “giving the model information.” Its flooding the context with domain-specific tokens that bias the entire activation landscape.

This is why RAG (Retrieval-Augmented Generation) is popular. Not just because the model “reads” the retrieved documents, but because those documents’ tokens reshape the probability distribution toward domain-relevant outputs.

Temperature stacking with system prompts.

Now we can be deliberate: use a tight system prompt to sculpt the distribution, then use temperature to control variance within that sculpted space.

Tight prompt + low temp for implementation.

Tight prompt + higher temp for exploring design alternatives. Same anchor, different sampling discipline.

Mitigations

Refresh the system prompt in long conversations. when you are 30 messages deep and the model is drifting, restating the key constraints will re-anchor the model. we are injecting fresh system-prompt-like tokens closer to the model’s current attention window, boosting their influence relative to the stale tokens at the beginning.

Use spec-based development and write skills. Every modern agent supports them. A spec is a dense, structured document that front-loads context.

Skills are reusable instruction sets that get injected into the system prompt. Both are mechanisms for packing the context window with high-signal tokens that keep the model anchored to what we actually want. I wrote about this workflow in a previous post.

Same Patterns, Different Layer

At the inference layer, the mechanism is different but the shape is the same.

We write a prompt. Those tokens create activation patterns. Those patterns bias a probability distribution. Sampling selects from that distribution. The output feeds back in and the loop continues. Simple operations, iterated, producing behavior that looks like understanding.

The system prompt anchors activation space the same way training data anchors weight space: through statistical pressure on what comes next.

The patterns repeat across layers of the system. Training, architecture, inference, usage. Layers within layers across densely packed weights in the network.

This is not a deep insight. but once we see the machinery, the mystique fades. The model isn’t doing something magical when it writes good code or drifts into sycophancy.

It’s doing math on probability distributions. Understanding that makes us better at using them.

“When you hit enter”, you are querying a frozen snapshot. The model cannot learn from your prompt. Even if you use RAG or an agent to inject additional context, you are only modifying the input state, the model itself remains static, routing those new tokens through the exact same frozen circuitry.

This is why the biggest lever for making a model smarter is packing more high-signal data into the weights before the freeze. And that single fact is driving the entire AI economy we have today in 2025-2026. It’s why AI labs are scraping every corner of the internet, triggering massive copyright lawsuits from publishers and artists.

The more impactful issue today is the violently expensive infrastructure required to store and process it all. To build and run these frozen matrices, High Bandwidth Memory (HBM) for AI accelerators is currently eating the global supply of DRAM wafers. which is why a standard DDR5 kit costs roughly twice what it did a year ago.

Well, if you got this far, thanks for reading and I hope this helped, until next time!!!!

References and Further Reading

How to generate text: using different decoding methods for language generation with Transformers – Hugging Face
Mixture of Experts Explained – Hugging Face
Lost in the Middle: How Language Models Use Long Contexts – Liu et al., 2023
System Prompts – Claude – Anthropic
Steering GPT-2-XL by adding an activation vector – AI Alignment Forum
Mapping the Mind of a Large Language Model – Anthropic
Fractals All the Way Down – Post 1 in this series
Stop Fighting Your LLM Coding Assistant – The practical workflow post

Fractals All the Way Down

Posted on February 20, 2026February 23, 2026 by Malinda RathnayakeLeave a Comment

I’m not a machine learning engineer. But I work deep enough in systems that when something doesn’t make sense architecturally, it bothers me. And LLMs didn’t make sense.

On paper, all they do is predict the next word. In practice, they write code, solve logic problems, and explain concepts better than most people can. I wanted to know what was in that gap.

I did some digging. And the answer wasn’t that someone sat down and programmed reasoning into these systems. Nobody did. Apparently it emerged. Simple math, repeated at scale, producing structure that looks intentional but isn’t.

But that simplicity didn’t come from nowhere.

Claude Shannon was running letter-guessing games in the 1950s, proving that language has predictable statistical structure.

Rosenblatt built the first neural network around the same time.

Backpropagation matured in the ’80s but computers were too slow and data was too small but the idea kept dying and getting resurrected for decades.

Then in 2017, a team at Google Brain published a paper called “Attention Is All You Need” and introduced the Transformer architecture.

This crystallized the earlier attention ideas into something that scaled.

Not a new idea so much as the right idea finally meeting the infrastructure that could support it.

GPUs that could parallelize the math.
High-speed internet that made massive datasets collectible.
Faster CPUs, SSDs, and RAM that kept feeding an exponential curve of compute and throughput.

Each piece was evolving on its own timeline and they all converged around the same window. GPT, Claude, Gemini, all of it traces back to that paper landing at the exact moment the hardware could actually run what it described.

From what I’ve learned and what I understand, here’s what happens under the hood.

One Moment in Time

The model sees a sequence of tokens and has to guess the next one.

Not full words “tokens”. Tokens are chunks: subwords, punctuation, sometimes pieces of words. “Unbelievable” might get split into “un,” “believ,” “able.” This is why models can handle rare words they’ve never seen whole they know the parts.

It’s also why current models can be weirdly bad at things like the infamous “how many r’s in strawberry” question and exact arithmetic. Because the model reads ‘strawberry’ as two chunks 'straw' and 'berry' it literally cannot see the individual letters inside them.”

But the principle is the same.

Every capability, every impressive demo, every unnerving conversation anyone’s ever had with an LLM comes back to this single act a mathematical system producing a weighted list of what might come next. “The cat sat on the…” and the model outputs something like:

mat:    35%
floor:  20%
roof:   15%
dog:     5%
piano:   3%
...thousands more trailing off into the decimals

Those probabilities aren’t hand-coded. They come from the model’s weights and billions of numbers that were adjusted, one tiny fraction at a time, by showing the model real human text and punishing it for guessing wrong.

The process looks like this:

Let’s take a real sentence “The capital of France is Paris”

Then we feed it in one piece at a time.

The model sees “The” and guesses the next token. The actual answer was “capital.” Wrong guess? Adjust the weights.
Now it sees “The capital” and guesses again. Actual answer: “of.” Adjust. “The capital of” → “France” → adjust. Over and over.

Do this across hundreds of billions to trillions of tokens from real human text and the weights slowly encode patterns of grammar, facts, reasoning structure, tone, everything.

That’s pretraining. Real data as the baseline. Prediction as the mechanism. The model is learning to mimic the statistical patterns of language at a depth that’s hard to overstate.

Then we Loop It

One prediction isn’t useful. But chain them together and something starts to happen.

The model picks a token, appends it, and predicts the next one. Repeat.

That’s the autoregressive loop: the system feeds its own output back in, one token at a time.

Conceptually it reprocesses the whole context each step; but in practice it caches(KV cache) intermediate computations so each new token is incremental. But the mental model of “reads it all again” is the right way to think about what it’s doing.

the model can “look back” at everything that came before and not just the last few tokens this is the core innovation of the Transformer architecture.

Older approaches like RNNs, compressed the entire history into a single state vector, like trying to remember a whole book by the feeling it left you with.

Transformers use a mechanism called Attention

which is essentially content-addressable memory over the entire context window each token issues a query and retrieves the most relevant pieces of the past.

Instead of compressing history into one state, the model can directly reach back and pull information from any earlier token

which is why it can track entities across paragraphs, resolve references, and maintain coherent structure over long passages.

It’s also why “context window” is a real architectural constraint. There’s a hard limit on how far back the model can look, and when conversations exceed that limit, things start falling off the edge.

🗨️ Right here, with just these two pieces “next-token prediction and the loop” we already have something that can generate coherent paragraphs of text. No special architecture for understanding. Just a prediction engine running in a loop, and the patterns baked into its weights doing the rest.

But this creates a question: if the model only ever produces a probability list, how do we actually pick which token to use?

Rolling the Dice

This is where sampling comes in.

The model gives us a weighted list.

we roll a weighted die.

Temperature controls how hard we shake it and it reshapes the probability distribution.

🗨️ The raw scores are divided by the Temperature number before being converted to probabilities.

Gentle shake (low temperature) and the die barely tumbles and it lands on the heaviest side almost every time. The gaps between scores get stretched wide, so the top answer dominates. “Mat.” Safe. Predictable.

Shake it hard (high temperature) and everything’s in play. The gaps shrink, the scores flatten out, and long shots get a real chance. “Piano.” Creative. Surprising. Maybe nonsensical.

But temperature isn’t the only knob. There’s also top-k and top-p (nucleus) sampling, which control which candidates are even allowed into the roll.

Top-k says “only consider the 40 most probable tokens.”

Top-p says “only consider enough tokens to cover 95% of the total probability mass.”

These methods trim the long tail of weird, unlikely completions before the die is even cast. Most production systems use some combination of all three.

The weights of the model don’t change between rolls. it’s the same brain, the same probabilities, but different luck on each draw.

This matters because it’s how we can run the same model multiple times on the same prompt and get completely different outputs. Same terrain, different path taken. The randomness is a feature, not a bug.

Run that whole loop five times on the same input and we might get:

Run 1: "The cat sat on the mat and purred."
Run 2: "The cat sat on the mat quietly."
Run 3: "The cat sat on the roof again."
Run 4: "The cat sat on the piano bench."
Run 5: "The cat sat on the mat and slept."

Same model. Same weights. Same starting text. Five different outputs, because the dice rolled differently at each step and those differences cascaded.

Teaching the Model What “Good” Means

Pretraining gets us a model that knows what language looks like. It can write fluently, complete sentences, even produce things that resemble reasoning.

But it has no concept of “helpful” or “safe” or “that’s actually a good answer.” It’s just mimicking patterns. To get from raw prediction engine to something that feels like a useful assistant, we need another layer.

This is where Reinforcement Learning from Human Feedback (RLHF) comes in. which is essentially a feedback loop that turns a raw prediction engine into something with opinions

First, there’s supervised fine-tuning (SFT).

Take the pretrained model and train it further on curated examples of good assistant behavior

high-quality question-and-answer pairs
helpful explanations
well-structured responses

This is the “be helpful” pass. It gets the model into the right ballpark before the more nuanced optimization begins.

Preference optimization stage.

Take the fine-tuned model. Give it a prompt. Let it generate multiple candidate outputs using different sampling runs

same weights, different dice rolls, different results. Then a completely separate model “a reward model”, trained specifically to judge quality reads all the candidates and scores them. “Run 1 is an 8.5. Run 4 is a 4.”

Training: Take that ranking and tell the original model to adjust its weights so outputs like Run 1 become more probable and outputs like Run 4 become less probable.

Nudge billions of weights slightly. Repeat across millions of prompts. Sometimes the “judge” is trained from human preferences; sometimes it’s trained from AI feedback — same destination, different math.

The models we interact with today are the result of all that shaping. One set of weights that already absorbed the judge’s preferences. Often the judge doesn’t run at inference time its preferences are mostly baked into the weights though some systems still layer on lightweight filters or reranking.

Then It Gets Weird

Train a small model to predict the next token and it mostly learns surface stuff: grammar, common phrases, local pattern matching.

"The sky is ___" → "blue."

Exactly what we can expect from a prediction engine.

But scale the same system up with more parameters, more data, more compute and new behaviors start showing up that nobody explicitly programmed.

A larger model can suddenly do things like:

Arithmetic-like behavior. Nobody gave it a calculator. It just saw enough examples of “2 + 3 = 5” and “147 + 38 = 185” that learning a procedure (or something procedure-shaped) reduced prediction error. Sometimes it’s memorization, sometimes it’s a learned algorithm, and often it’s a messy blend.
Code synthesis. Not just repeating snippets it saw, but generating new combinations that compile and run.
Translation and transfer. Languages, formats, and styles it barely saw during training suddenly become usable.
Multi-step reasoning traces. Following constraints, tracking entities, resolving ambiguity, and doing “if-then” logic over several steps.

The unsettling part to me at least is how these abilities appear.

Some researchers argue these cliffs are partially measurement artifacts, a function of how benchmarks score rather than a true discontinuity.

But the visible shift in capabilities with scale is hard to deny. A model at 10 billion parameters can’t do a task at all. Same architecture at 100 billion, suddenly it blooms into something new.

Like a phase transition

water isn’t “kind of ice” at 1°C It’s still liquid. At 0°C it transforms into something structurally different.

The researchers call these emergent capabilities, which is a polite way of saying “we didn’t plan this and we’re not entirely sure why it happens.” This is why people like Andrej Karpathy openly say they don’t fully understand frontier models. Meanwhile the CEOs selling them have every incentive to amplify that mystique

A human didn’t code a reasoning module. The model needed to predict the next token in text that contained reasoning, so it built internal machinery that represents how reasoning works. Because that was the best strategy for getting the prediction right.

Once researchers realized these abilities were appearing, they started shaping the conditions that strengthen them:

curating training data with more reasoning-heavy text
fine-tuning on chain-of-thought examples that show working step by step,
using preference tuning / RLHF to reward clearer logic and more helpful outputs

The engineering in frontier models is more like gardening than architecture. They’re creating conditions for capabilities to grow stronger. They still can’t fully predict what will emerge next.

Looking Inside

So if nobody designed these capabilities, what’s actually happening in the weights?

This is the question that drives a field called “Mechanistic interpretability”

Here is a great blog post that helped me wrap my head around this

https://www.neelnanda.io/mechanistic-interpretability/glossary

Researchers are opening the black box and tracing what happens inside. The model is just billions of numbers organized into layers. When text comes in, it flows through these layers and gets transformed at each step. Each layer is a giant grid of math operations. After training, nobody assigned roles to any of these. But when researchers started looking at what individual neurons and groups of neurons actually do, they found structure.

Think of it like a brain scan. You put a person in an MRI, show them a face, and a specific region lights up every time. Nobody wired that region to be “the face area.” It self-organized during development. But it’s real, consistent, and doing a specific job.

The same thing happens inside these models.

Take a sentence like “John gave the ball to Mary. What did Mary receive?”

To answer this, the model needs to figure out that

John is the giver and Mary is the receiver,

track that the ball is the object being transferred,

and connect “receive” back to “the ball.”

When researchers traced which weights activated during this task, they found consistent substructures distributed patterns of neurons that reliably participate in the same kind of computation. Not random activation but structured pathways that behave like circuits. One pattern identifies subject-object relationships and feeds into another that tracks the object, which feeds into another that resolves the reference. in reality it looks messier and more distributed than a clean pipeline diagram, but the functional structure is real and reproducible and visually noticeable

it’s a circuit that just naturally emerged due to Prediction pressure during training forcing the weights to self-organize into reliable pathways because language is full of patterns like this

And these smaller circuits compose combine and feed into complex circuits. Object-tracking feeds into reasoning feeds into analogy. It’s hierarchical self-organization layers of structure built on top of each other, none of it hand-designed.

Anthropic published research mapping millions of features inside their model.

Mapping the Mind of a Large Language Model Anthropic

https://thesephist.com/posts/prism/

Nomic Atlas (Visual Representation)

They found individual features that represent specific concepts. Not “neuron 4,517 does something vague” but “this feature activates for deception,” “this one activates for code,” “this one activates for the Golden Gate Bridge.” Mapped into clusters,

Related concepts group near each other like neighborhoods in a city. A concept like “inner conflict” sits near “balancing tradeoffs,” which sits near “opposing principles.” It looks like a galaxy map of meanings and ideas that nobody drew.

some models like DeepSeek (Mixture of Experts) take this further.

They didn’t just develop one set of circuits. They train many specialized sub-networks within a single model and route each input to the most relevant ones.

Ask it a coding question and one subset of weights fires.
Ask it a history question and a different subset activates.

The model self-organized not just circuits, but entire specialized regions and a traffic controller to direct inputs between them. Same principle, one level up.

Spirographs and fractals

This is where the overall concept it self clicked for me.

Strictly speaking, neural networks are not closed mathematical loops. Conceptually, however, a spirograph illustrates exactly how they operate:

🗨️ Simple operations, iterated across a massive space, producing complex structure that looks designed but emerged on its own.

A spirograph is one circle rolling around another. Dead simple rule. Keep going and we get intricate symmetry that feels intentional. Change one tiny thing like shifting the pen hole slightly off-center, change the radius and now we get a completely different pattern.

Training is like that: same architecture, same objective, small changes in data mix or learning rate can yield meaningfully different internal structure.

And like fractals, the deeper we look, the more structure we find. Researchers keep uncovering smaller, sharper circuits. The same motifs repeat at different scales. The interesting behavior lives right on the boundary between order and randomness.

It’s the same pattern we can see in nature: simple rules, iterated, producing shapes that look designed.

Closing out the loop

In school I used to draw circles over and over with a compass, watching patterns appear that I didn’t plan.

Years later, I found myself messing around with Google’s DeepDream feeding images into a neural network and watching it project trippy, hallucinatory patterns back.

I thought I was making trippy images. What I was also seeing was the network’s internal pattern library being cranked to maximum.

The training objective is trivially simple “guess the next word”

But the internal machinery that emerges to get good at that objective ends up resembling understanding.

And “Resembles” is doing a lot of work there whether it’s true understanding, or an imitation so sophisticated the difference stops mattering in practice.

Or maybe it’s simpler than that. We trained it on patterns and concepts and texts created by organic brains which are themselves complex math engines. As a side effect, it took on the shape of the neurons that birthed it. Like DNA from mother and father forming how we look.

Just like we see in mother nature “It’s fractals all the way down”

Software is Just Loops and State

Posted on January 27, 2026January 27, 2026 by Malinda RathnayakeLeave a Comment

A program is a collection of loops. Some loops read state. Some loops write state.

The state lives somewhere – a database, Kafka, Redis, a file, memory.

And then other loops wake up and react to that state, and produce new state, and emit it somewhere else. And it keeps happening.

That’s it. That’s all software is.

The code is just the implementation detail of how the loops run. The architecture is really about where the state lives and what happens when a loop falls behind or dies.

When you zoom out

you can also see it everywhere.

A user clicks a button. That’s an event. It ripples through your frontend, hits an API, touches a database, maybe emits to a queue, wakes up some worker, which writes somewhere else, which triggers a notification, which reaches a human, who reacts.

Zoom out further and companies work this way. Markets work this way. Ecosystems.

Events have directionality. They ripple. They hit nodes. The nodes react and emit. The ripple continues.

It’s the same pattern at every scale.

The question

So when I’m stuck on an architecture decision, I ask:

Where does the state live?
What loops are reading it?
What loops are writing it?
What happens when a loop dies or falls behind?
What is the required lag between the write and the read?

That’s usually enough to untangle it and get me going again.

This isn’t a formal definition, just a practical lens I’ve found useful

Kubernetes Loop

Posted on December 14, 2025January 27, 2026 by Malinda RathnayakeLeave a Comment

The Architecture of Trust
Role of the API server
Role of etcd cluster
How the Loop Actually Works
As an example, let’s look at a simple nginx workload deployment
1) Intent (Desired State)
2) Watch (The Trigger)
3) Reconcile (Close the Gap)
4) Status (Report Back)
The Loop Doesn’t Protect You From Yourself
Why This Pattern Matters Outside Kubernetes
Ref

I’ve been diving deep into systems architecture lately, specifically Kubernetes

Strip away the UIs, the YAML, and the ceremony, and Kubernetes boils down to:

A very stubborn event driven collection of control loops

aka the reconciliation (Control) loop, and everything I read is calling this the “gold standard” for distributed control planes.

Because it decomposes the control plane into many small, independent loops, each continuously correcting drift rather than trying to execute perfect one-shot workflows. these loops are triggered by events or state changes, but what they do is determined by the the spec. vs observed state (status)

Now we have both:

spec: desired state
status: observed state

Kubernetes lives in that gap.

When spec and status match, everything’s quiet. When they don’t, something wakes up to ensure current state matches the declared state.

The Architecture of Trust

In Kubernetes, they don’t coordinate via direct peer-to-peer orchestration; They coordinate by writing to and watching one shared “state.”

That state lives behind the API server, and the API server validates it and persists it into etcd.

Role of the API server

The API server is the front door to the cluster’s shared truth: it’s the only place that can accept, validate, and persist declared intent as Kubernetes API objects (metadata/spec/status).

When you install a CRD, you’re extending the API itself with a new type (a new endpoint) or a schema the API server can validate against

When we use kubectl apply (or any client) to submit YAML/JSON to the API server, the API server validates it (built-in rules, CRD OpenAPI v3 schema / CEL rules, and potentially admission webhooks) and rejects invalid objects before they’re stored.

If the request passes validation, the API server persists the object into etcd (the whole API object, not just “intent”), and controllers/operators then watch that stored state and do the reconciliation work to make reality match it.

Once stored, controllers/operators (loops) watch those objects and run reconciliation to push the real world toward what’s declared.

it turns out In practice, most controllers don’t act directly on raw watch events, they consume changes through informer caches and queue work onto a rate-limited workqueue. They also often watch related/owned resources (secondary watches), not just the primary object, to stay convergent.

spec is often user-authored as discussed above, but it isn’t exclusively human-written, the scheduler and some controllers also update parts of it (e.g., scheduling decisions/bindings and defaulting).

Role of etcd cluster

etcd is the control plane’s durable record of “the authoritative reference for what the cluster believes that should exist and what it currently reports.”

If an intent (an API object) isn’t in etcd, controllers can’t converge on it—because there’s nothing recorded to reconcile toward

This makes the system inherently self-healing because it trusts the declared state and keeps trying to morph the world to match until those two align.

One tidbit worth noting:

In production, Nodes, runtimes, cloud load balancers can drift independently. Controllers treat those systems as observed state, and they keep measuring reality against what the API says should exist.

How the Loop Actually Works

Kubernetes isn’t one loop. It’s a bunch of loops(controllers) that all behave the same way:

read desired state (what the API says should exist)
observe actual state (what’s really happening)
calculate the diff
push reality toward the spec

As an example, let’s look at a simple nginx workload deployment

1) Intent (Desired State)

To Deploy the Nginx workload. You run:

kubectl apply -f nginx.yaml

The API server validates the object (and its schema, if it’s a CRD-backed type) and writes it into etcd.

At that point, Kubernetes has only recorded your intent. Nothing has “deployed” yet in the physical sense. The cluster has simply accepted:

“This is what the world should look like.”

2) Watch (The Trigger)

Controllers and schedulers aren’t polling the cluster like a bash script with a sleep 10.

They watch the API server.

When desired state changes, the loop responsible for it wakes up, runs through its logic, and acts:

“New desired state: someone wants an Nginx Pod.”

watches aren’t gospel. Events can arrive twice, late, or never, and your controller still has to converge. Controllers use list+watch patterns with periodic resync as a safety net. The point isn’t perfect signals it’s building a loop that stays correct under imperfect signals.

Controllers also don’t spin constantly they queue work. Events enqueue object keys; workers dequeue and reconcile; failures requeue with backoff. This keeps one bad object from melting the control plane.

3) Reconcile (Close the Gap)

Here’s the mental map that made sense to me:

Kubernetes is a set of level-triggered control loops. You declare desired state in the API, and independent loops keep working until the real world matches what you asked for.

Controllers (Deployment/ReplicaSet/etc.) watch the API for desired state and write more desired state.
- Example: a Deployment creates/updates a ReplicaSet; a ReplicaSet creates/updates Pods.
The scheduler finds Pods with no node assigned and picks a node.
- It considers resource requests, node capacity, taints/tolerations, node selectors, (anti)affinity, topology spread, and other constraints.
- It records its decision by setting spec.nodeName on the Pod.
The kubelet on the chosen node notices “a Pod is assigned to me” and makes it real.
- pulls images (if needed) via the container runtime (CRI)
- sets up volumes/mounts (often via CSI)
- triggers networking setup (CNI plugins do the actual wiring)
- starts/monitors containers and reports status back to the API

Each component writes its state back into the API, and the next loop uses that as input. No single component “runs the whole workflow.”

One property makes this survivable: reconcile must be safe to repeat (idempotent). The loop might run once or a hundred times (retries, resyncs, restarts, duplicate/missed watch events), and it should still converge to the same end result.

if the desired state is already satisfied, reconcile should do nothing; if something is missing, it should fill the gap, without creating duplicates or making things worse.

When concurrent updates happen (two controllers might try to update the same object at the same time)

Kubernetes handles this with optimistic concurrency. Every object has a resourceVersion (what version of this object did you read?”). If you try to write an update using an older version, the API server rejects it (often as a conflict).

Then the flow is: re-fetch the latest object, apply your change again, and retry.

4) Status (Report Back)

Once the pod is actually running, status flows back into the API.

The Loop Doesn’t Protect You From Yourself

What if the declared state says to delete something critical like kube-proxy or a CNI component? The loop doesn’t have opinions. It just does what the spec says.

A few things keep this from being a constant disaster:

Control plane components are special. The API server, etcd, scheduler, controller-manager these usually run as static pods managed directly by kubelet, not through the API. The reconciliation loop can’t easily delete the thing running the reconciliation loop as long as its manifest exists on disk.
DaemonSets recreate pods. Delete a kube-proxy pod and the DaemonSet controller sees “desired: 1, actual: 0” and spins up a new one. You’d have to delete the DaemonSet itself.
RBAC limits who can do what. Most users can’t touch kube-system resources.
Admission controllers can reject bad changes before they hit etcd.

But at the end, if your source of truth says “delete this,” the system will try. The model assumes your declared state is correct. Garbage in, garbage out.

This Pattern Outside Kubernetes

This pattern can be useful anywhere you manage state over time.

Scripts are fine until they aren’t:

they assume the world didn’t change since last run
they fail halfway and leave junk behind
they encode “steps” instead of “truth”

A loop is simpler:

define the desired state
store it somewhere authoritative
continuously reconcile reality back to it

Ref

Stop Fighting Your LLM Coding Assistant

Posted on December 11, 2025December 15, 2025 by Malinda RathnayakeLeave a Comment

You’ve probably noticed: coding models are eager to please. Too eager. Ask for something questionable and you’ll get it, wrapped in enthusiasm. Ask for feedback and you’ll get praise followed by gentle suggestions. Ask them to build something and they’ll start coding before understanding what you actually need.

This isn’t a bug. It’s trained behavior. And it’s costing you time, tokens, and code quality.

The Sycophancy Problem

Modern LLMs go through reinforcement learning from human feedback (RLHF) that optimizes for user satisfaction. Users rate responses higher when the AI agrees with them, validates their ideas, and delivers quickly. So that’s what the models learn to do. Anthropic’s work on sycophancy in RLHF-tuned assistants makes this pretty explicit: models learn to match user beliefs, even when they’re wrong.

The result: an assistant that says “Great idea!” before pointing out your approach won’t scale. One that starts writing code before asking what systems it needs to integrate with. One that hedges every opinion with “but it depends on your use case.”

For consumer use cases, travel planning, recipe suggestions, general Q&A this is fine. For engineering work, it’s a liability.

When the models won’t push back, you lose the value of a second perspective. When it starts implementing before scoping, you burn tokens on code you’ll throw away. When it leaves library choices ambiguous, you get whatever the model defaults to which may not be what production needs.

Here’s a concrete example. I asked Claude for a “simple Prometheus exporter app,” gave it a minimal spec with scope and data flows, and still didn’t spell out anything about testability or structure. It happily produced:

A script with sys.exit() sprinkled everywhere
Logic glued directly into if __name__ == "__main__":
Debugging via print() calls instead of real logging

It technically “worked,” but it was painful to test, impossible to reuse and extend.

The Fix: Specs Before Code

Instead of giving it a set of requirements and asking to generate code. Start with specifications. Move the expensive iteration the “that’s not what I meant” cycles to the design phase where changes are cheap. Then hand a tight spec to your coding tool where implementation becomes mechanical.

The workflow:

Describe what you want (rough is fine)
Scope through pointed questions (5–8, not 20)
Spec the solution with explicit implementation decisions
Implement by handing the spec to Cursor/Cline/Copilot

This isn’t a brand new methodology. It’s the same spec-driven development (SDD) that tools like github spec-kit is promoting

write the spec first, then let a cheaper model implement against it.

By the time code gets written, the ambiguity is gone and the assistant is just a fast pair of hands that follows a tight spec with guard rails built in.

When This Workflow Pays Off

To be clear: this isn’t for everything. If you need a quick one-off script to parse a CSV or rename some files, writing a spec is overkill. Just ask for the code and move on with your life.

This workflow shines when:

The task spans multiple files or components
External integrations exist (databases, APIs, message queues, cloud services)
It will run in production and needs monitoring and observability
Infra is involved (Kubernetes, Terraform, CI/CD, exporters, operators)
Someone else might maintain it later
You’ve been burned before on similar scope

Rule of thumb: if it touches more than one system or more than one file, treat it as spec-worthy. If you can genuinely explain it in two sentences and keep it in a single file, skip straight to code.

Implementation Directives — Not “add a scheduler” but “use APScheduler with BackgroundScheduler, register an atexit handler for graceful shutdown.” Not “handle timeouts” but “use cx_Oracle call_timeout, not post-execution checks.”

Error Handling Matrix — List the important failure modes, how to detect them, what to log, and how to recover (retry, backoff, fail-fast, alert, etc.). No room for “the assistant will figure it out.”

Concurrency Decisions — What state is shared, what synchronization primitive to use, and lock ordering if multiple locks exist. Don’t let the assistant improvise concurrency.

Out of Scope — Explicit boundaries: “No auth changes,” “No schema migrations,” “Do not add retries at the HTTP client level.” This prevents the assistant from “helpfully” adding features you didn’t ask for.

Anticipate Anywhere the Model might guess, make a decision instead or make it validate/confirm with you before taking action.

The Handoff

When you hand off to your coding agent, make self-review part of the process:

Rules:
- Stop after each file for review
- Self-Review: Before presenting each file, verify against
  engineering-standards.md. Fix violations (logging, error
  handling, concurrency, resource cleanup) before stopping.
- Do not add features beyond this spec
- Use environment variables for all credentials
- Follow Implementation Directives exactly

Pair this with a rules.md that encodes your engineering standards—error propagation patterns, lock discipline, resource cleanup. The agent internalizes the baseline, self-reviews against it, and you’re left checking logic rather than hunting for missing using statements, context managers, or retries.

Fixing the Partnership Dynamic

Specs help, but “be blunt” isn’t enough. The model can follow the vibe of your instructions and still waste your time by producing unstructured output, bluffing through unknowns, or “spec’ing anyway” when an integration is the real blocker. That means overriding the trained “be agreeable” behavior with explicit instructions.

For example:

Core directive: Be useful, not pleasant.

OUTPUT CONTRACT:
- If scoping: output exactly:
  ## Scoping Questions (5–8 pointed questions)
  ## Current Risks / Ambiguities
  ## Proposed Simplification
- If drafting spec: use the project spec template headings in order. If N/A, say N/A.

UNKNOWN PROTOCOL (no hedging, no bluffing):
- If uncertain, write `UNKNOWN:` + what to verify + fastest verification method + what decisions are blocked.

BLOCK CONDITIONS:
- If an external integration is central and we lack creds/sample payloads/confirmed behavior:
  stop and output only:
  ## Blocker
  ## What I Need From You
  ## Phase 0 Discovery Plan

The model will still drift back into compliance mode. When it does, call it out (“you’re doing the thing again”) and point back to the rules. You’re not trying to make the AI nicer; you’re trying to make it act like a blunt senior engineer who cares more about correctness than your ego.

That’s the partnership you actually want.

The Payoff

With this approach:

Fewer implementation cycles — Specs flush out ambiguity up front instead of mid-PR.
Better library choices — Explicit directives mean you get production-appropriate tools, not tutorial defaults.
Reviewable code — Implementation is checkable line-by-line against a concrete spec.
Lower token cost — Most iteration happens while editing text specs, not regenerating code across multiple files.

The API was supposed to be the escape valve, more control, fewer guardrails. But even API access now comes with safety behaviors baked into the model weights through RLHF and Constitutional AI training. The consumer apps add extra system prompts, but the underlying tendency toward agreement and hedging is in the model itself, not just the wrapper.

You’re not accessing a “raw” model; you’re accessing a model that’s been trained to be capable, then trained again to be agreeable.

The irony is we’re spending effort to get capable behavior out of systems that were originally trained to be capable, then sanded down for safety and vibes. Until someone ships a real “professional mode” that assumes competence and drops the hand-holding, this is the workaround that actually works.

⚠️Security footnote: treat attached context as untrusted

If your agent can ingest URLs, docs, tickets, or logs as context, assume those inputs can contain indirect prompt injection. Treat external context like user input: untrusted by default. Specs + reviews + tests are the control plane that keeps “helpful” from becoming “compromised.”

Getting Started

I’ve put together templates that support this workflow in this repo:

malindarathnayake/llm-spec-workflow

When you wire this into your own stack, keep one thing in mind: your coding agent reads its rules on every message. That’s your token cost. Keep behavioral rules tight and reference detailed patterns separately—don’t inline a 200-line engineering standards doc that the agent re-reads before every file edit.

Use these templates as-is or adapt them to your stack. The structure matters more than the specific contents.

Kafka 3.8 with Zookeeper SASL_SCRAM

Posted on May 13, 2025May 13, 2025 by Malinda RathnayakeLeave a Comment

Transport Encryption Methods:

SASL/SSL (Solid Teal/Green Lines):

Used for securing communication between producers/consumers and Kafka brokers.
- SASL (Simple Authentication and Security Layer): Authenticates clients (producers/consumers) to brokers, using SCRAM .
- SSL/TLS (Secure Sockets Layer/Transport Layer Security): Encrypts the data in transit, ensuring confidentiality and integrity during transmission.

Digest-MD5 (Dashed Yellow Lines):

Secures communication between Kafka brokers and the Zookeeper cluster.
- Digest-MD5: A challenge-response authentication mechanism providing basic encryption

Notes:

While functional, Digest-MD5 is an older algorithm. we opted for this to reduce complexity and the fact the zookeepers have issues with connecting with Brokers via SSL/TLS

We need to test and switch over KRAFT Protocol, this removes the use of Zookeeper altogether
Add IP ACLs for Zookeeper connections using firewalld to limit traffic between the nodes for replication

PKI and Certificate Signing

CA cert for local PKI,

We need to share this PEM file(without the private key) with the customer to authenticate

Internal applications the CA file must be used for authentication – Refer to the Configuration example documents

# Generate CA Key
openssl genrsa -out multicastbits_CA.key 4096
# Generate CA Certificate
openssl req -x509 -new -nodes -key multicastbits_CA.key -sha256 -days 3650 -out multicastbits_CA.crt -subj "/CN=multicastbits_CA"

Kafka Broker Certificates

# For Node1 - Repeat for other nodes

openssl req -new -nodes -out node1.csr -newkey rsa:2048 -keyout node1.key -subj "/CN=kafka01.multicastbits.com"

openssl x509 -req -CA multicastbits_CA.crt -CAkey multicastbits_CA.key -CAcreateserial -in node1.csr -out node1.crt -days 3650 -sha256

Create the kafka and zookeeper users

⚠️ Important: Do not skip this step. we need these users to setup Authentication in JaaS configuration

Before configuring the cluster with SSL and SASL, let’s start up the cluster without authentication and SSL to create the users. This allows us to:

Verify basic dependencies and confirm the zookeeper and Kafka clusters are coming up without any issues “make sure the car starts”
Create necessary user accounts for SCRAM
Test for any inter-node communication issues (Blocked Ports 9092, 9093 ,2181 etc)

Here’s how to set up this initial configuration:

Zookeeper Configuration (No SSL or Auth)

Create the following file: /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

# Zookeeper Configuration without Auth
dataDir=/Data_Disk/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.166.110:2888:3888
server.2=192.168.166.111:2888:3888
server.3=192.168.166.112:2888:3888

Kafka Broker Configuration (No SSL or Auth)

Create the following file: /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

# Kafka Broker Configuration without Auth/SSL
broker.id=1
listeners=PLAINTEXT://kafka01.multicastbits.com:9092
advertised.listeners=PLAINTEXT://kafka01.multicastbits.com:9092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT
zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181

Open a new shell to the server Start Zookeeper:

/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh -daemon /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

Open a new shell to start Kafka:

/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh -daemon /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

Create the users:

Open a new shell and run the following commands:

kafka-configs.sh --bootstrap-server ext-kafka01.fleetcam.io:9092 --alter --add-config 'SCRAM-SHA-512=[password=zookeeper-password]' --entity-type users --entity-name ftszk

kafka-configs.sh --zookeeper ext-kafka01.fleetcam.io:2181 --alter --add-config 'SCRAM-SHA-512=[password=kafkaadmin-password]' --entity-type users --entity-name ftskafkaadminAfter the users are created without errors, press Ctrl+C to shut down the services we started earlier.

SASL_SSL configuration with SCRAM

Zookeeper configuration Notes

Zookeeper is configured with SASL/MD5 due to the SSL issues we faced during the initial setup
Zookeeper Traffic is isolated with in the Broker nodes to maintain security

dataDir=/Data_Disk/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.166.110:2888:3888
server.2=192.168.166.111:2888:3888
server.3=192.168.166.112:2888:3888
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl

/Data_Disk/zookeeper/myid file is updated corresponding to the zookeeper nodeID

cat /Data_Disk/zookeeper/myid
1

Jaas configuration

Create the Jaas configuration for zookeeper authentication, it has the follow this syntax

/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf

Server {
   org.apache.zookeeper.server.auth.DigestLoginModule required
   user_multicastbitszk="zkpassword";
};

KafkaOPTS

KafkaOPTS Java varible need to be passed when the zookeeper is started to point to the correct JaaS file

export KAFKA_OPTS="-Djava.security.auth.login.config="Path to the zookeeper-jaas.conf"

export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf"

There are few ways to handle this, you can add a script under profile.d or use a custom Zookeeper launch script for the systemd service

Systemd service

Create the launch shell script for Zookeeper

/opt/kafka/kafka_2.13-3.8.0/bin/zk-start.s

#!/bin/bash
#export the env variable
export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/zookeeper-jaas.conf"
#Start the zookeeper service
/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/zookeeper.properties
#debug - launch config with no SSL - we need this for initial setup and debug
#/opt/kafka/kafka_2.13-3.8.0/bin/zookeeper-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/zookeeper-NOSSL_AUTH.properties

After you save the file

chomod +x /opt/kafka/kafka_2.13-3.8.0/bin/zk-start.s

sudo chown -R multicastbitskafka:multicastbitskafka /opt/kafka/kafka_2.13-3.8.0

Create the systemd service file

/etc/systemd/system/zookeeper.service

[Unit]
Description=Apache Zookeeper Service
After=network.target
[Service]
User=multicastbitskafka
Group=multicastbitskafka
ExecStart=/opt/kafka/kafka_2.13-3.8.0/bin/zk-start.sh
Restart=on-failure
[Install]

WantedBy=multi-user.target

After the file is saved, start the service

sudo systemctl daemon-reload.
sudo systemctl enable zookeeper
sudo systemctl start zookeeper

Kafka Broker configuration Notes

/opt/kafka/kafka_2.13-3.8.0/config/server.properties

broker.id=1
listeners=SASL_SSL://kafka01.multicastbits.com:9093
advertised.listeners=SASL_SSL://kafka01.multicastbits.com:9093
listener.security.protocol.map=SASL_SSL:SASL_SSL
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
ssl.keystore.location=/opt/kafka/secrets/kafkanode1.keystore.jks
ssl.keystore.password=keystorePassword
ssl.truststore.location=/opt/kafka/secrets/kafkanode1.truststore.jks
ssl.truststore.password=truststorePassword
#SASL/SCRAM Authentication
sasl.enabled.mechanisms=SCRAM-SHA-256, SCRAM-SHA-512
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
sasl.mechanism.client=SCRAM-SHA-512
security.inter.broker.protocol=SASL_SSL
#zookeeper
zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181
zookeeper.sasl.client=true
zookeeper.sasl.clientconfig=ZookeeperClient

zookeeper connect options

Define the zookeeper servers the broker will connect to

zookeeper.connect=kafka01.multicastbits.com:2181,kafka02.multicastbits.com:2181,kafka03.multicastbits.com:2181

Enable SASL

zookeeper.sasl.client=true

Tell the broker to use the creds defined under ZookeeperClient section on the JaaS file used by the kafka service

zookeeper.sasl.clientconfig=ZookeeperClient

Broker and listener configuration

Define the broker id

broker.id=1

Define the servers listener name and port

listeners=SASL_SSL://kafka01.multicastbits.com:9093

Define the servers advertised listener name and port

advertised.listeners=SASL_SSL://kafka01.multicastbits.com:9093

Define the SASL_SSL for security protocol

listener.security.protocol.map=SASL_SSL:SASL_SSL

Enable ACLs

authorizer.class.name=kafka.security.authorizer.AclAuthorizer

Define the Java Keystores

ssl.keystore.location=/opt/kafka/secrets/kafkanode1.keystore.jks

ssl.keystore.password=keystorePassword

ssl.truststore.location=/opt/kafka/secrets/kafkanode1.truststore.jks

ssl.truststore.password=truststorePassword

Jaas configuration

/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf

KafkaServer {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="multicastbitskafkaadmin"
  password="kafkaadmin-password";
};
ZookeeperClient {
  org.apache.zookeeper.server.auth.DigestLoginModule required
  username="multicastbitszk"
  password="Zookeeper_password";
};

SASL and SCRAM configuration Notes

Enable SASL SCRAM for authentication

org.apache.kafka.common.security.scram.ScramLoginModule required

Use MD5 for Zookeeper authentication

org.apache.zookeeper.server.auth.DigestLoginModule required

KafkaOPTS

KafkaOPTS Java variable need to be passed and must point to the correct JaaS file, when the kafka service is started

export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf"

Systemd service

Create the launch shell script for kafka

/opt/kafka/kafka_2.13-3.8.0/bin/multicastbitskafka-server-start.sh

#!/bin/bash
#export the env variable
export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/kafka_2.13-3.8.0/config/kafka_server_jaas.conf"
#Start the kafka service
/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/server.properties
#debug - launch config with no SSL - we need this for initial setup and debug
#/opt/kafka/kafka_2.13-3.8.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.13-3.8.0/config/server-NOSSL_AUTH.properties

Create the systemd service

/etc/systemd/system/kafka.service

[Unit]
Description=Apache Kafka Broker Service
After=network.target zookeeper.service
[Service]
User=multicastbitskafka
Group=multicastbitskafka
ExecStart=/opt/kafka/kafka_2.13-3.8.0/bin/multicastbitskafka-server-start.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target

Connect authenticate and use Kafka CLI tools

Requirements

multicastbitsadmin.keystore.jks
multicastbitsadmin.truststore.jks
WSL2 with java-11-openjdk-devel wget nano
Kafka 3.8 folder extracted locally

Setup your environment

Setup WSL2

You can use any Linux environment with JDK17 or 11

install dependencies

dnf install -y wget nano java-11-openjdk-devel

Download Kafka and extract it (in going to extract it to the home DIR under kafka)

# 1. Download Kafka (Choose a version compatible with your server)
wget https://dlcdn.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz
# 2. Extract
tar xzf kafka_2.13-3.8.0.tgz

Copy the jks files (You should generate them with the CA JKS, or use one from one of the nodes) to ~/

cp multicastbitsadmin.keystore.jks ~/

cp multicastbitsadmin.truststore.jks ~/

Create your admin client properties file

change the path to fit your setup

nano ~/kafka-adminclient.properties

# Security protocol and SASL/SSL configuration
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
# SSL Configuration
ssl.keystore.location=/opt/kafka/secrets/multicastbitsadmin.keystore.jks
ssl.keystore.password=keystorepw
ssl.truststore.location=/opt/kafka/secrets/multicastbitsadmin.truststore.jks
ssl.truststore.password=truststorepw
# SASL Configuration
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required 
    username="#youradminUser#" 
		password="#your-admin-PW#";

Create the JaaS file for the admin client

nano ~/kafka_client_jaas.conf

Some kafka-cli tools still look for the jaas.conf under KAFKA_OPTS environment variable

KafkaClient {
  org.apache.kafka.common.security.scram.ScramLoginModule required
  username="#youradminUser#"
  password="#your-admin-PW#";
};

Export the Kafka environment variables

export KAFKA_HOME=/opt/kafka/kafka_2.13-3.8.0
export PATH=$PATH:$KAFKA_HOME/bin
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export KAFKA_OPTS="-Djava.security.auth.login.config=~/kafka_client_jaas.conf"
source ~/.bashrc

Kafka CLI Usage Examples

Create a user

kafka-configs.sh --bootstrap-server kafka01.multicastbits.com:9093 --alter --add-config 'SCRAM-SHA-512=[password=#password#]' --entity-type users --entity-name %username%--command-config ~/kafka-adminclient.properties

Create a topic

kafka-topics.sh --bootstrap-server kafka01.multicastbits.com:9093 --create --topic %topicname% --partitions 10 --replication-factor 3 --command-config ~/kafka-adminclient.properties

Create ACLs

External customer user with READ DESCRIBE privileges to a single topic

kafka-acls.sh --bootstrap-server kafka01.multicastbits.com:9093 
  --command-config ~/kafka-adminclient.properties 
  --add --allow-principal User:customer-user01 
  --operation READ --operation DESCRIBE --topic Customer_topic

Troubleshooting

Here are some common issues you might encounter when setting up and using Kafka with SASL_SCRAM authentication, along with their solutions:

1. Connection refused errors

Issue: Clients unable to connect to Kafka brokers.

Solution:

Verify that the Kafka brokers are running and listening on the correct ports.
Check firewall settings to ensure the Kafka ports are open and accessible.
Confirm that the bootstrap server addresses in client configurations are correct.

2. Authentication failures

Issue: Clients fail to authenticate with Kafka brokers.

Solution:

Double-check username and password in the JAAS configuration file.
Ensure the SCRAM credentials are properly set up on the Kafka brokers.
Verify that the correct SASL mechanism (SCRAM-SHA-512) is specified in client configurations.

3. SSL/TLS certificate issues

Issue: SSL handshake failures or certificate validation errors.

Solution:

Confirm that the keystore and truststore files are correctly referenced in configurations.
Verify that the certificates in the truststore are up-to-date and not expired.
Ensure that the hostname in the certificate matches the broker’s advertised listener.

4. Zookeeper connection issues

Issue: Kafka brokers unable to connect to Zookeeper ensemble.

Solution:

Verify Zookeeper connection string in Kafka broker configurations.
Ensure Zookeeper servers are running and accessible and the ports are open
Check Zookeeper client authentication settings in JAAS configuration file

NFS Provisioner Setup and Testing Guide for Rancher RKE2/Kubernetes

Posted on May 13, 2025May 13, 2025 by Malinda RathnayakeLeave a Comment

This guide covers how to add an NFS StorageClass and a dynamic provisioner to Kubernetes using the nfs-subdir-external-provisioner Helm chart. This enables us to mount NFS shares dynamically for PersistentVolumeClaims (PVCs) used by workloads.

Example use cases:

Database migrations
Apache Kafka clusters
Data processing pipelines

Requirements:

An accessible NFS share exported with: rw,sync,no_subtree_check,no_root_squash
NFSv3 or NFSv4 protocol
Kubernetes v1.31.7+ or RKE2 with rke2r1 or later

lets get to it

1. NFS Server Export Setup

Ensure your NFS server exports the shared directory correctly:

# /etc/exports
/rke-pv-storage  worker-node-ips(rw,sync,no_subtree_check,no_root_squash)

Replace worker-node-ips with actual IPs or CIDR blocks of your worker nodes.
Run sudo exportfs -r to reload the export table.

2. Install NFS Subdir External Provisioner

Add the Helm repo and install the provisioned:

helm repo add nfs-subdir-external-provisioner \
  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

helm install nfs-client-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --namespace kube-system \
  --set nfs.server=192.168.162.100 \
  --set nfs.path=/rke-pv-storage \
  --set storageClass.name=nfs-client \
  --set storageClass.defaultClass=false

Notes:

If you want this to be the default storage class, change storageClass.defaultClass=true.
nfs.server should point to the IP of your NFS server.
nfs.path must be a valid exported directory from that NFS server.
storageClass.name can be referenced in your PersistentVolumeClaim YAMLs using storageClassName: nfs-client.

3. PVC and Pod Test

Create a test PVC and pod using the following YAML:

# test-nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: test-nfs-pod
spec:
  containers:
  - name: shell
    image: busybox
    command: [ "sh", "-c", "sleep 3600" ]
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-nfs-pvc

Apply it:

kubectl apply -f test-nfs-pvc.yaml
kubectl get pvc test-nfs-pvc -w

Expected output:

NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-nfs-pvc   Bound    pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   1Gi        RWX            nfs-client     30s

4. Troubleshooting

If the PVC remains in Pending, follow these steps:

Check the provisioner pod status:

kubectl get pods -n kube-system | grep nfs-client-provisioner

Inspect the provisioner pod:

kubectl describe pod -n kube-system <pod-name>
kubectl logs -n kube-system <pod-name>

Common Issues:

Broken State: Bad NFS mount
```
mount.nfs: access denied by server while mounting 192.168.162.100:/pl-elt-kakfka
```
- This usually means the NFS path is misspelled or not exported properly.

Broken State: root_squash enabled

failed to provision volume with StorageClass "nfs-client": unable to create directory to provision new pv: mkdir /persistentvolumes/…: permission denied

Fix by changing the export to use no_root_squash or chown the directory to nobody:nogroup.

ImagePullBackOff
- Ensure nodes have internet access and can reach registry.k8s.io.
RBAC errors
- Make sure the ServiceAccount used by the provisioner has permissions to watch PVCs and create PVs.

5. Healthy State Example

kubectl get pods -n kube-system | grep nfs-client-provisioner-nfs-subdir-external-provisioner
nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m   1/1     Running     0          3m39s

kubectl describe pod -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
# Output shows pod is Running with Ready=True

kubectl logs -n kube-system nfs-client-provisioner-nfs-subdir-external-provisioner-7992kq7m
...
I0512 21:46:03.752701       1 controller.go:1420] provision "default/test-nfs-pvc" class "nfs-client": volume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" provisioned
I0512 21:46:03.752763       1 volume_store.go:212] Trying to save persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d"
I0512 21:46:03.772301       1 volume_store.go:219] persistentvolume "pvc-73481f45-3055-4b4b-80f4-e68ffe83802d" saved
I0512 21:46:03.772353       1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Name:"test-nfs-pvc"}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-73481f45-3055-4b4b-80f4-e68ffe83802d
...

Once test-nfs-pvc is bound and the pod starts successfully, your setup is working. You can now safely use storageClass: nfs-client in other workloads (e.g., Strimzi KafkaNodePool).

Advertising VRF Connected/Static routes via MP BGP to OSPF – Guide Dell S4112F-ON – OS 10.5.1.3

Posted on July 6, 2020July 6, 2020 by Malinda RathnayakeLeave a Comment

Im going to base this off my VRF Setup and Route leaking article and continue building on top of it

Lets say we need to advertise connected routes within VRFs using IGP to an upstream or downstream iP address this is one of many ways to get to that objective

For this example we are going to use BGP to collect connected routes and advertise that over OSPF

Setup the BGP process to collect connected routes

router bgp 65000
 router-id 10.252.250.6
 !
 address-family ipv4 unicast
 !
 neighbor 10.252.250.1
!
vrf Tenant01_VRF
 !
 address-family ipv4 unicast
  redistribute connected
!
vrf Tenant02_VRF
 !
 address-family ipv4 unicast
  redistribute connected
!
vrf Tenant03_VRF
 !
 address-family ipv4 unicast
  redistribute connected
!
vrf Shared_VRF
 !
 address-family ipv4 unicast
  redistribute connected

Setup OSPF to Redistribute the routes collected via BGP

router ospf 250 vrf Shared_VRF
 area 0.0.0.0 default-cost 0
 redistribute bgp 65000

interface vlan250
 mode L3
 description OSPF_Routing
 no shutdown
 ip vrf forwarding Shared_VRF
 ip address 10.252.250.6/29
 ip ospf 250 area 0.0.0.0
 ip ospf mtu-ignore
 ip ospf priority 10

Testing and confirmation

Local OSPF Database

Remote device