Mapping Activation Space: Peeking Inside the Model

The previous post covered what happens when you hit enter. Tokens flow through layers, probabilities get shaped, text comes out. System prompts anchor the model in activation space. Temperature controls how tightly it follows that anchor.

This post goes one level deeper, into the model’s layers. What are those activations? What shape do they take?

For this experiment, I used a well-documented workflow built around Google’s Gemma 2 2B model and the Gemma Scope residual stream SAEs. These are Sparse Autoencoders trained by Google on Gemma’s residual stream activations. They act as auxiliary models that decompose dense internal states into sparse, more interpretable features. The tools are Gemma-specific but the concepts apply to any transformer model.


Vectors: The Model’s Internal Language

Think about the gas laws. Compress a gas, it gets hotter. The macroscopic behavior is simple. Underneath, there’s a seething mass of atoms bouncing around, and the real explanation to why it gets hot when compressed lives in those microscopic dynamics.

Neural networks work the same way.

  • The macroscopic behavior is “the model talks about France.”
  • The microscopic dynamics are activation patterns flowing through 2304-dimensional space.

We’re mapping the microscopic level

A vector is a list of numbers. In Gemma 2 2B, the vector at each token position is a list of 2304 numbers. That’s the model’s d_model dimension, its internal working width.

No single number in that list is human-readable. But the pattern across all 2304 numbers encodes what the model “knows” at that point in the text.

When Gemma processes “The capital of France is”, it does not have a dedicated slot for “France” and another slot for “capital”. Instead, the model has a specific geometric direction for “France” and another direction for “capital”. It adds those vectors together.

The vector at the last token position is just the mathematical sum of all those active concepts.

Because it was built through vector addition, that single coordinate in 2304-dimensional space contains the combined geometry of “France”, “geography”, and “Paris is next”.

We cannot visualize that space directly, but the model moves through it constantly, and every prediction depends on exactly where that final coordinate lands.

Toy Models of Superposition

The Linear Representation Hypothesis and the Geometry of Large Language Models


The Residual Stream: The Highway Everything Rides On

Vectors are the payload; the Residual stream is the bus. It is the main data pipeline through the transformer.

That 2304-dimensional vector persists across all 26 layers. Each layer does not replace it outright. It reads the current residual stream, computes an update, and adds that update back in.

The token embedding creates the initial vector. Layer 0 attention reads it, computes an update, adds it back. Layer 0 MLP does the same. This continues through all 26 layers, and the final vector gets mapped to token probabilities.
That additive pattern is why the residual stream is such an important place to inspect. Earlier information can remain available while later layers keep reshaping it. What survives, what gets amplified, and what becomes less useful depends on the sequence of updates across the network.

  • A prompt goes in. The model processes it token by token.
  • At layer 12, each token has a vector of shape (2304,) in the residual stream.
  • The SAE encoder maps that dense vector into a sparse feature space (16,384 dimensions for the coarse SAE, 262,144 for the fine SAE) where only a small number of features activate.
  • The SAE decoder maps the sparse representation back into (2304,) to approximate the original residual state.
  • A steering vector is the decoder vector for a specific SAE feature. Adding it into the residual stream biases the model toward that feature’s pattern. Phase 2 will test this.

Why Layer 12?

Gemma 2 2B has 26 transformer layers, numbered 0 through 25. I’m hooking into layer 12 — roughly 46% depth.

The hook point is blocks.12.hook_resid_post, capturing the residual stream after layer 12 finishes processing.

Production inference engines and local runners use quantized formats and custom C++ engines for speed, which strips out the Python-level hook system.

To get around this, we load the exact same Gemma 2 2B weights from HuggingFace as native PyTorch tensors (raw weights), which gives us the ability to tap any layer.

Gemma Scope trained SAEs on every layer of Gemma 2 2B, so layer 12 was a deliberate choice was based on Google’s own interpretability research papers.

Announcing Gemma Scope 2 — AI Alignment Forum

Gemma_Scope_2_Technical_Paper.pdf


What Does “Mapping Activations” Mean?

  • Coarse (16K SAE): fewer features, each covering more conceptual ground.
  • Fine (262K SAE): more features, each with higher resolution on specific facets.

An SAE decomposes a single 2304-number vector into thousands of sparse components called features.

The residual stream is dense and compressed. Gemma has to represent a huge amount of structure in just 2304 dimensions, so many different patterns get packed together in superposition. That means individual residual dimensions are not clean, human-readable concept slots. A single dimension can participate in multiple unrelated behaviors depending on the context.

This is where the Sparse Autoencoder helps. Instead of trying to interpret the raw 2304-dimensional state directly, the SAE projects it into a much larger sparse feature space, such as 16K or 262K features. In that expanded space, only a small number of features activate for a given input, and those features are often easier to interpret than the original dense residual dimensions.

So back to the actual numbers. The SAE takes that tightly packed, dense circuitry and expands it out into a much wider space so we can see which patterns actually fired.

How sparse is sparse, what is a SAE?

Think of an audio spectrum analyzer. It takes a single, dense audio wave where the kick drum, bass, and vocals are all mashed together, and splits them out into distinct frequency bands. Most bands stay flat. Only the frequencies actually present in the audio spike up.

The SAE is a semantic spectrum analyzer.

The model’s Layer 12 vector is the dense audio wave, a 2304-number mess where all the active concepts are added together.

The SAE takes that dense wave and splits it out across 16,384 distinct “frequency bands” (features).

Because a single sentence only contains a handful of concepts, the SAE only spikes about 82 of those bands to represent the entire input.

The other 16,000+ bands stay flat at zero simply because those concepts are not in the sentence. Out of 16,384 possible patterns the SAE can detect, a typical input lights up fewer than 1% of them. The rest stay silent. That is what sparse means.

Feature #3999:   mean activation 3.67, selectivity 0.97  ← France (strongest signal)
Feature #11333:  mean activation 4.48, selectivity 0.89  ← France-related
Feature #9473:   mean activation 1.21, selectivity 0.88  ← France-related
Feature #14805:  mean activation 1.24, selectivity 0.86  ← France-related
Feature #6211:   mean activation 1.27, selectivity 0.86  ← France-related
...
Feature #6,004:  activation 0.00                         ← silent
Feature #6,005:  activation 0.00                         ← silent
[~11,000 more at zero]

 

The key column we should focus on is selectivity: how exclusively does a feature fire on France prompts versus neutral ones?

Feature #3999 scores 0.97. This means it almost exclusively fires on France-related input. Notice #11333 actually fires harder (mean 4.48 vs 3.67) but has lower selectivity. It bleeds into neutral prompts too. A feature that fires on everything is not telling you anything useful, no matter how loud it is.

I chose France as a sanity-check target because the interpretability method was already grounded in prior work, and Gemma Scope gave me a validated SAE setup to test whether my own pipeline outputted correct Data.

The experiment follows standard A/B logic.

  • Run 400 prompts total through Gemma: 200 France-themed prompts and 200 neutral filler prompts.
  • Capture the layer 12 activations for both sets.
  • Decompose both through the same SAEs.
  • Use Python to compare features that fire consistently on the France set but stay quiet on the neutral sets.

Now we have the candidates for “France features.” Features that fire on everything else can be safely considered noise.

Two Resolutions, One Concept

The 16K SAE (coarse) might give you one feature, call it #3999, that broadly responds to “France.” One fat bucket for the whole concept.

The 262K SAE (fine) might split that same concept into multiple features.

Feature #86473 fires on a subset of France prompts. Feature #249284 fires on a different subset. Feature #243533 on another.

Each one potentially encodes a different facet (cuisine, geography, landmarks, language), though we do not know which facets they are yet. More on that problem in a minute.

This is hierarchical feature decomposition.

Same logic as hand-designing a vision network:

  • Edges combine into shapes
  • Shapes combine into objects
  • Objects together in a scene change the meaning again (meaning as in what regions get excited to infer what it is)

but instead of happening across multiple sequential layers, this hierarchy exists at different levels of granularity within the exact same layer.

The coarse SAE has a smaller dictionary, so it is forced to find the whole object.

The fine SAE has a massive dictionary, so it can afford to isolate the specific parts and edges. Nobody hand-wired these decompositions. They emerged naturally from the training data.

The cross-resolution mapping step figures out which fine features correspond to which coarse features. It measures two things:

  • Activation correlation: Do they fire on the same prompts? (behavioral evidence)
  • Decoder cosine similarity: Do their decoder vectors point in the same direction in the 2304-dimensional residual stream? (Geometric evidence)

Here are the actual results from the experiment.

The Heatmap: Decoder Cosine Similarity

The heatmap shows the same data from a different angle. Each cell measures how much a coarse features and a fine features point in the same direction inside the model.

  • High values mean they’re encoding the same thing at different zoom levels.
  • Low values mean they’re unrelated.
CoarseStrongest Fine MatchSimilarityReading
C-3999F-2492840.80Near-identical direction. Almost certainly the same concept at different granularity
C-11333F-864730.54Strong overlap, plus two more matches at 0.42 and 0.34
C-9473F-2435330.47One clear sub-feature

Values near zero mean unrelated. Above 0.3 is meaningful geometric correspondence.

Each of the 262,000 fine features was compared against C-3999 to see how closely they point in the same direction. Almost all of them scored near zero, meaning no geometric alignment.
The histogram shows where the crowd is: piled up at zero, with a long empty stretch before the red line at 0.3. The handful of features past that line are the ones that actually share a direction with C-3999. That’s why 0.3 is the cutoff. It’s where the crowd ends and the signal begins.

Coarse-to-Fine Feature Decomposition

Every prompt was decoded through both SAEs at the same time. The coarse SAE gave us three broad France features. The fine SAE gave us six narrow ones.

The interactive dashboard below maps every possible pair of these coarse and fine features to measure how strong their relationship actually is.

All data shown is derived from real activations collected by running 400 prompts through Gemma 2 2B and decomposing layer 12 through both Gemma Scope SAEs.

The left side shows the decomposition graph. It maps exactly which fine features branch off from our three coarse anchors.

  • Link width represents behavioral correlation (do they fire together?).
  • Link color represents geometric similarity (do they point the same way?).

The right side plots these exact same pairs in metric space:

  • The X-axis is behavioral correlation: do they fire on the exact same prompts?
  • The Y-axis is geometric similarity: do their vectors point in the same direction inside the model?

Look at the color coding on the scatter plot. It tells the whole story:

Blue dots (Top Right): These are the true sub-feature matches. They have high correlation and high geometric similarity.

They co-fire AND point the same way inside the model. The strongest pair is C-3999 to F-249284.

Orange dots (Middle): These are partial overlaps.

They fire together often, but their geometry is drifting apart.

Red dots (Bottom): These are co-occurring concepts. They sit in the lower half with moderate correlation but cosine similarity near zero. These features fire on many of the same prompts but point in completely different directions.

They co-occur with France, but they are not encoding the same concept. They are related ideas that travel together, not the same idea at two zoom levels.

Reading the scatter plot

To better understand the graph, think of these features as passive sensors hooked up to the Layer 12 data bus:

1. The Empty Top-Left (High Similarity, Low Correlation)
If Sensor A and Sensor B point in the exact same direction, they are going to catch the exact same meandering vectors. Always. You physically cannot have a vector pass by that trips Sensor A but misses Sensor B. That is why high geometric similarity mathematically forces high correlation. The top-left is empty because it defies physics.

2. The Bottom-Right (Low Similarity, High Correlation)
Sensor A points at “France”. Sensor B points at “Food”. They point in completely different directions (low geometric similarity). But when the vector for “French Cuisine” meanders down the data bus, it contains enough geometry to trip both sensors at the same time. They fire together (high correlation) even though they are looking for different things. These are your co-occurring concepts.

3. Low Selectivity (The noisy features)
If a sensor points in a direction that catches “France” but also catches a bunch of other random vectors meandering by (like “Germany” or “cheese”), it will have a lower selectivity score. This is exactly what you saw with Feature #11333 earlier in the post—it fired loud, but it fired on too much unrelated traffic to be a clean “France” feature.


The Interpretation Gap

C-3999 gets labeled “France” because it reliably activates on France prompts and not on neutral ones. That’s standard practice in interpretability work. Label a feature by what activates it.

But the model doesn’t know the word “France.” It has a direction in 2304-dimensional space that, for whatever internal reason, turned out to be useful for predicting the next token when France-related patterns show up. We call it “France” because that’s the human category our test prompts were organized around. The model’s internal geometry might carve the world along boundaries that don’t align with our conceptual categories at all — we just can’t tell, because we only test with prompts organized around our categories.

What we actually knowWhat we’re assuming
C-3999 fires on France prompts, not neutralC-3999 “means” France
F-86473 fires on a subset of France promptsF-86473 is a France sub-concept
C-3999 and F-249284 point in similar directionsThey encode related meanings
Injecting C-3999’s direction changes output toward France-like textThe feature is causally involved in France generation

The labels (“French cuisine,” “Paris landmarks”) are human interpretations based on which prompts activate a feature. The model doesn’t have those labels. It has frozen weights and an activation landscape that we’re projecting human categories onto.

Neural nets don’t memorize data. They find regularities and generalize those regularities to new data. A model will generate plausible text about unicorns even though it’s never seen one described as real, because it’s learned the relational structure of mythical creatures, horses, and horns. The internal representation that enables that generalization doesn’t need to map onto our concept of “unicorn.” It just needs to be useful for next-token prediction. When we label a feature “France,” we’re assuming the model’s useful regularity aligns with our semantic boundary. Sometimes it does. Sometimes we don’t know.

This is the wall that people like Neel Nanda keep writing about in mechanistic interpretability research. It was interesting to actually hit it myself. I can identify which features fire and when.

We can measure geometric relationships between them. But mapping that to human-readable meaning is always an inference, never ground truth.

When I started this project, I wanted to build something like the Activation Space Navigator from the previous post, but using real model data derived from SAEs, I pictured clean clusters with labeled regions where you could point and say, “that’s France.”

The real data did not look like that. What it gave me instead were directions in the model’s internal representation that reliably correlate with France-themed input.


Every feature in the SAE is a literal 2304-dimensional vector stored in the SAE’s decoder matrix. Feature C-3999 is just row #3999 in that matrix. It acts as a static reference coordinate for “France”.

This exact mechanic applies to any concept. If we were testing Python code, HTML tables, or HTTP status codes, there would be a different row acting as the reference coordinate for that specific pattern.

The reference vectors in the SAE do not move. They are fixed in place like highway signs. The model’s dynamic state passes by them.

When the France prompt generates a vector that passes close to the C-3999 sign, that specific band on our semantic spectrum analyzer spikes.

Neutral prompts pass further away, so the band stays flat. Those spikes are the sparse values we actually record.

One thing worth naming: this is all an approximation. The SAE reconstructs the residual stream from its learned directions, and the reconstruction is not perfect. Some signal is always lost. We are working with a useful approximation of the model’s internal state, not the thing itself.

So it is not that “France activated these regions of the model’s weights.” The weights are frozen. The model has learned internal directions for France-related patterns, and when France text flows through, the residual stream aligns with those directions. That alignment is what we are measuring.


Closing the loop

The previous post described system prompts as activation-space manipulation. This experiment gives me supporting evidence for that framing.

The directions those prompts appear to push activations toward are measurable, and some of that structure can be decomposed into narrower features.

What I found is suggestive structure, not full semantic ground truth.

The coarse-to-fine matches look real enough to justify the next step, which is testing whether steering along those directions’ changes generation in a targeted way.

What’s Next for This Project

The activation mapping is done. I found and observed suggestive structure:

coarse France-related features that appear to decompose into finer sub-features, supported by both behavioral correlation and geometric similarity.

The GO/NO-GO question was whether the coarse-to-fine mapping would produce 3+ meaningful sub-features per coarse anchor.

C-11333 has three above threshold. That’s a GO.

Phase 2 is the actual Steering Experiment.

Take those mapped features and test whether multi-resolution steering (coarse “France” + fine sub-features) produces better, more targeted output than single-resolution steering alone.

If it does, that’s evidence the cross-resolution structure isn’t just a statistical artifact. It’s a lever we can pull to tweak the behavior of a model

If it doesn’t, the structure is real but doesn’t actually control what the model generates. Correlation isn’t causation, even inside a neural network.

Either way, I’ll know more about what those frozen weights are actually doing when we hit enter.


References

 

ASIC’S in Cisco Catalyst switches

Preface-

After I started working with Open networking switches, wanted to know more about the Cisco catalyst range I work with every day.

Information on older ASICS is very hard to find, but recently they have started to talk a lot about the new chips like UADP 2.0 with the Catalyst 9k / Nexus launch, This is more likely due to the rise of Desegregated Network Operating Systems DNOS such as Cumulus and PICA8, etc forcing customers to be more aware of what’s under the hood rather than listening and believing shiny PDF files with a laundry list of features.

The information was there but scattered all over the web, I went though CiscoLive, TechFieldDay slides/videos, interviews, partner update PDFs, Press leases and whitepapers and even LinkedIn profiles to gather information

If you notice a mistake please let me know

Scope –

we are going to focus on the ASIC’s used in the well-known 2960S/X/XR and the 36xx,37xx,38XX  and the new Cat 9K series

Timeline

 

Summary

Cisco Brought a bunch of companies to acquire the switching technology they needed that later bloomed into the switching platforms we know today

  • Crescendo Communications (1993) – Catalyst 5K and 6K chassis
  • Kalpana (1994) – Catalyst 3K (Fun Fact they invented VLANs that later got standardized as 802.1q)
  • Grand Junction (1995) – Catalyst 17xx, 19xx, 28xx, 29xx
  • Granite Systems (1996) – Catalyst 4K (K series ASIC)

After the original Cisco 3750/2950 switches, Cisco 3xxx/2xxx-G  (G for Gigabit) was released

Next, the Cisco 3xxx-E series with enterprise management functions was released

later, Cisco developed the Cisco 3750-V series with the function of energy-saving version for –E series, later replaced by Cisco 3750 V2 series (Possibly a die shrink)

G series and E series were later phased out and integrated into Cisco X series. which is still being sold and supported

in 2017-2018 Cisco released the catalyst 9k family to replace the 2K and 3K families

Sasquatch ASIC

from what I could  find there are two variants of this ASIC

The initial release in 2003

  • Fixed pipeline ASIC
  • 180 Nano-meter process
  • 60 Million Transistors

Shipped with the 10/100 3750 and 2960

Die Shrink to 130nm in 2005

  • Fixed pipeline ASIC
  • 130 Nano-meter process
  • 210 Million Transistors

Shipped in the 2960-G/3560-G/3750-G series

I couldn’t find much info about the chip design. will update this section when I find more.

 Strider ASIC

Initially Release in 2010

  • Fixed pipeline ASIC
  • Built on the 65-nanometer process
  • 1.3 Billion Transistors

Strider ASIC (circa 2010) was an improved design based on the 3750-E series was first shipped with the 2960-S family.

S88G_ASIC design
S88G ASIC

later in 2012-2013 with a die shrink to 45-nanometer, they managed to fit 2 ASICs in the same silicon, This shipped with the 2960-X/XR which replaced the 2960-S

  • higher stack speeds and features
  • limited layer 3 capabilities IP Lite feature (2960-XR only)
  • Better QoS and Netflow lite

Later down the line they silently rolled the ASIC design to a 32-nanometer process for better yield to achieve cheaper manufacturing costs

this switch is still being sold with no EOL announced as a cheaper Acess layer switch

On a side note – in 2017 Cisco released another version of the 2960 family the WS-2960-L This is a cheaper version built on a Marvel ASIC (Same as the SG55x) with a web UI and fanless design. I personally think this is the next evolution of their SMB market-oriented family the popular Cisco SG-5xx series. for the first the time the 2960 family had a fairly usable and pleasant web interface for the configuration and management. the new 9K series seems to be containing a more polished version of the web-UI

Unified Access Data Plane (UADP)

Due to the limitations in the fixed pipeline architecture and the costs involved with the re-rolling process to fix bugs they needed something new and had three options

As a compromise between all three Cisco Started dreaming up this programmable ASIC design in 2007-2008 the idea was to build a chip with programmable stages that can be updated with firmware updates instead of writing the logic into the silicon permanently.

they released the programmable ASIC technology initially for their QFP (Quantum flow processor) ASIC in the ISR router family to meet the customer needs (service providers and large enterprises)

This chip allowed them to support advanced routing technologies and new protocols without changing hardware simply via firmware updates improving the longevity of the investment allowing them to make more money out of the chips extended life cycle.

Eventually, this technology trickled downstream and the Doppler 1.0 was born

Improvements and features in UADP 1.0/1.1/2.0

  • Programmable stages in the pipeline
  • Cisco intent Driven networking support – DNA Center with ISE
  • Intergarted Stacking Support with Stack power – ASIC is built with pinouts for the stacking fabric allowing faster stacking performance
  • Rapid Recirculation (Encapsulation such as MPLS, VXLAN)
  • TrustSec
  • Advance on-chip QOS
  • Software-defined networking support – integrated NetFlow, SD access
  • Flex Parser – Programmable packet parser allowing them to introduce support for new protocols with firmware updates
  • On-chip Micro-engines – Highly specialized engines within in the chip to perform repetitive tasks such as
    • Reassembly
    • Encryption/Decryption
    •  Fragmentation
  • CAPWAP – Switch can function as a wireless Lan Controller
    • Mobility agent – Offload Qo and other functions from the WLC (IMO Works really nicely with multi-site wireless deployments)
    • Mobility Controller – Fullblown Wireless LAN controller (WLC)
  • Extended life cycle allowed integration of Cisco security technologies such as Cisco DNA + ISE later down the line even on the first generation switches
  • Multigig and 40GE speed support
  • Advanced malware detection capabilities via packet fingerprinting

Legacy Fixed pipeline architecture

Programmable pipeline architecture

Doppler/UADP 1.0 (2013)

While doppler1.0 programmable ASIC handling the Data plane coupled with Cavium CPU for the control plane the first generation of the switches to ship with these chip was the 3650-x and 3850-x gigabit versions

  • Built on 65 Nanometer Process
  • 1.3 billion transistors

UADP 1.1 (2015)

  • Die Shrink to 45 Nanometer
  • 3 billion transistors

UADP 2.0 (2017)

  • Built on 28nm/16nm Process
  • Equipped with an Intel Xeon D (Dual-core X86) CPU for the control plane
  • Open-IOS-XE

7.4 billion transistors

Flexible ASIC Templates –  

Allows Cisco to write templates that can optimize the chip resources for different use cases

the new Catalyst 9000 series will replace the following campus switching families built on the older Strider and more recent UADP 1 and 1.1 ASICS

  • Catalyst 2K —–> Catalyst 9200
  • Catalyst 3K —–> Catalyst 9300
  • Catalyst 4K —–> Catalyst 9400
  • Catalyst 6K —–> Catalyst 9500

I’m will update/fix this post when I find more info about the UADP 2 and the next evolution, stay tuned for a few more articles based on the silicon used in open networking X86 chassis.