Mapping the Concept Space Inside Language Models

How we used sparse features, auto-interpretability, and steering experiments to turn hidden activations into usable signals.

Introduction

Large language models are very good at producing polished outputs. They can explain inflation, summarise research, write code, imitate tone, and sound surprisingly confident about almost anything. But most of the time, we only see the final answer. We do not see the hidden activity that formed before the answer appeared.

That hidden activity is where much of the real story lives.

Before a model writes a response, it builds internal representations. Some concepts become active. Some remain quiet. Certain signals fire only for specific tokens. Some features cluster into larger neighbourhoods of meaning. If we only inspect the final output, we miss the internal concept space where behaviour first begins to form.

This work started from a simple question:

Can we map that hidden concept space well enough to search it, label it, test it, and steer with it?

That question is the foundation of what became Concept Studio. The goal was not to build another surface-level prompting tool. The goal was to create a workflow for understanding model internals in a way that is useful for research, control, and discovery. In practice, that meant building a system that could extract sparse internal features from model activations, automatically interpret and label those features, let us search and inspect them, identify feature families associated with a concept or behaviour, compare examples to find what separates one behaviour from another, and finally test whether those features could actually move model behaviour.

In other words, the objective was not simply to observe the model's internal space - but to turn it into something more like a research instrument.

From Dense Activations to Sparse Features

The core technical challenge is that a language model's raw activations are hard to interpret directly. At each layer and token position, the model produces a dense activation vector. These vectors contain a huge amount of information, but they are not naturally human-readable. A single dimension rarely corresponds neatly to a concept like "financial analysis" or "central bank policy." Instead, the model tends to pack many ideas into overlapping directions.

Sparse autoencoders provide a way to make that space more legible. A sparse autoencoder learns a dictionary of latent features that can reconstruct the model's internal activations using only a small number of active components at a time. The key word is sparse: if every feature were active all the time, the decomposition would not help much. But if only a small set of features fires for a given input, those features become inspectable.

That sparse feature space is what makes the rest of the workflow possible. A feature in this space is not just a number. It becomes an object we can work with: something that has a layer, an index, an activation pattern, top-activating examples, and eventually a label. Once those pieces exist, the internal representation starts becoming searchable and testable.

This is the shift from "the model has hidden activations" to "the model has candidate internal concepts."

How We Built the Feature Dictionary

The first important artefact in the system is the feature dictionary.

For each sparse feature, we store a structured set of metadata: the model and layer it comes from, the feature index, activation statistics, top-activating examples, lower-activating or near-miss examples, an automatically generated label, and supporting notes about its likely meaning.

Our approach to labelling was based on auto-interpretability. For each feature, we gathered the text spans that activated it most strongly. We also looked at examples that were similar but did not activate the feature as much. This contrast is important. If we only inspect top activations, the label can easily become too broad. By comparing high-activation and near-miss examples, we can sharpen the boundary of what the feature really responds to.

An interpretation model then uses this evidence to propose a short, human-readable label. A good label should be specific enough to be falsifiable:

"Finance" is usually too broad.
"Economic policy language" is better.
"References to central banks and interest-rate decisions" is better still - if that is what the evidence supports.

Figure 2 shows how this pipeline works in practice: a latent feature, its top-activating examples, its near-misses, and the auto-generated searchable label.

Figure 2 - Auto-interpretability pipeline. A latent feature is characterised by its top-activating examples and near-miss examples. An auto-label and searchable concept name are generated, turning an opaque feature index into a human-readable internal concept that can be browsed, filtered, and acted on.

That distinction matters because the label becomes the bridge between model internals and human reasoning. If the label is vague, we get a nice story. If the label is precise, we get a usable signal.

Search Is Not Enough - Activation Is the Real Test

Once features are labelled, the next step is to make them searchable. But there are actually two different kinds of search happening here.

The first is description search - the intuitive one: search for features using labels, explanations, or concept phrases such as "financial expert analyst," "monetary policy," or "formal professional tone."

The second is activation search. Instead of asking which features sound related to a concept, we ask which features actually fire when a real input passes through the model.

This distinction turned out to be one of the most important findings in the project. A feature can have a very promising label and still fail to activate on the example we actually care about. Another feature may fire strongly even though its label was not the first thing we would have searched for.

Search gives us hypotheses. Activations give us evidence.

A robust workflow needs both. Description search helps us navigate the feature space. Activation search shows what the model is actually doing for a live example. When the two agree, we gain confidence. When they disagree, that disagreement is often where the interesting insight begins.

Token-Level Maps: Where Does the Concept Fire?

One of the most revealing parts of the workflow was looking at feature activations token by token.

Consider a sentence like: "The Federal Reserve raised interest rates to combat inflation." If we run this sentence through the model and inspect feature activations, we find that one feature peaks around "Federal Reserve," another around "interest rates," and another around "inflation."

That tells us something important: the model is not simply activating a single "finance mode." Different pieces of the sentence are triggering different parts of the internal concept space. Figure 3 shows this directly - a token x feature heatmap where colour intensity reveals which tokens most strongly activate each internal concept.

Figure 3 - Token x feature activation heatmap. Colour intensity shows which tokens most strongly activate each concept, revealing the model's internal reading of the sentence word by word.

Token-level inspection is useful because it helps us answer questions like: Is a concept genuinely tied to the right part of the text? Is a feature firing because of semantic meaning or because of a superficial lexical shortcut? Are two similar prompts activating the same feature family for the same reason? This is where the feature label becomes testable, not just descriptive.

Feature Families, Not Single "Magic Neurons"

One of the strongest findings from the work was that useful concepts rarely live in a single feature.

It is tempting to imagine that somewhere inside the model there is one clean "finance feature," one "biology feature," or one "safety feature." That is a helpful story, but it is usually too simple. What we found instead is that robust concepts tend to behave like feature families.

A finance-related neighbourhood may include signals for central banks, macroeconomic language, valuation, earnings tone, risk framing, institutional references, and analytical style. A safety-related region may include refusal patterns, risky instructions, policy-sensitive language, and adversarial phrasing.

This matters because it changes how we think about concept discovery. The question is no longer "Which feature is the finance feature?" It becomes "Which family of related features consistently represents finance-like behaviour across different examples?" That shift makes the workflow more stable, more realistic, and much more useful for both steering and analysis.

Contrastive Fingerprints: "More Like This, Less Like That"

Keyword search is useful when we already know the name of the concept we want. But often we do not. Sometimes what we know is not a label, but a contrast.

We want an answer to sound more like a scientific explanation and less like a generic summary. More like an analyst note and less like a casual response. More mechanism, less fluff.

This is where contrastive feature discovery became especially valuable. The workflow is simple:

Choose an avoid example representing behaviour we want less of
Choose an encourage example representing behaviour we want more of
Inspect the feature activations for both
Identify which features are much stronger in one versus the other

The result is a contrastive fingerprint: features to boost, features to suppress. This turns a human preference into an internal representation. Figure 4 shows the full flow.

Figure 4 - Contrastive fingerprint construction. Avoid examples are contrasted with encourage examples. The comparison identifies features to suppress and features to boost, producing a concept fingerprint that captures the behavioural difference as an internal signal set.

Contrastive examples often capture behaviour better than keyword search alone. A keyword gives a label. A contrast gives a difference. And in practice, difference is often what we care about.

Steering Experiments: Testing Whether the Features Matter

Once we have an interpretable feature or feature family, the next question is whether it can actually move behaviour. That is where steering comes in.

At a high level, steering means nudging the model's internal state along selected feature directions during generation. If a feature corresponds to a particular behaviour, increasing it should increase the probability of that behaviour appearing in the output. Suppressing a feature should reduce it.

But steering is not a magic button. It behaves more like a dose-response experiment. At low strength, a feature may do very little. At moderate strength, the desired behaviour may emerge while the answer stays coherent. At high strength, the model may overdo it or drift too far from the original task.

That is why we used strength sweeps. For the same prompt and the same feature set, we tested multiple steering strengths and compared the resulting outputs. Figure 5 shows the complete picture: the amplify/suppress feature selection panel, the alignment curve with a clear sweet spot at moderate strength, and the before/after comparison showing target signals increasing while unwanted signals decrease.

Figure 5 - Steering mechanism and alignment curve. The alignment curve reveals a sweet spot at moderate strength. Before and after steering: target signals increase while unwanted signals decrease, confirming that the selected features genuinely influence model predictions.

Steering quality depends as much on calibration as on feature selection. A strong feature at the wrong strength can produce a poor result. A well-chosen feature family at the right strength often produces a much cleaner shift.

Reinspection: Did the Internal Signal Actually Move?

A model output can look different for many reasons. It may sound more analytical because it used a few finance-like words. It may sound more formal because the phrasing changed. But if we want to understand whether the intervention really worked, we need to look back inside the model.

That is why the final step in the workflow is reinspection. After steering, we run the generated text back through the same activation pipeline and check whether the intended feature families became more active - and whether the suppressed features decreased.

This changes the workflow from "the answer looks better" to "the internal signal moved in the direction we intended." That is the difference between a nice demo and a real research loop.

What We Found

Several patterns emerged repeatedly across experiments.

Feature labels are useful, but not sufficient. A label helps us navigate the concept space, but it is only a hypothesis until activation evidence supports it.

The most useful concepts behave like families, not isolated units. This was true across domains and examples.

Contrastive fingerprints were often more expressive than simple keyword search. They captured behaviour, not just terminology.

Token-level inspection was surprisingly important. It helped distinguish genuine semantic signals from accidental triggers.

Steering required calibration. More strength did not automatically mean better results. The useful range was usually somewhere in the middle.

Reinspection made the entire workflow more rigorous. It forced us to check whether a behaviour shift was accompanied by a meaningful internal signal shift.

Without these findings, the feature space is interesting. With them, it becomes actionable.

Why This Matters Beyond the Lab

If we can expose and label internal concept spaces, we gain a new kind of interface to a model's learned knowledge. Figure 6 shows the breadth of where this applies.

Figure 6 - Cross-domain applicability. The same internal-signal workflow supports discovery and control across every domain where a model makes inspectable decisions, from finance and safety to biology and healthcare.

In finance, such signals could help uncover internal representations of macro stress, valuation language, policy sensitivity, or risk framing. In safety, they could expose the early internal precursors of risky behaviour. In science and healthcare, similar techniques may help surface model-internal structures related to mechanisms, pathways, or domain-specific hypotheses.

The broader pattern is:

model activations → sparse features → interpreted concepts
        → searchable space → controlled interventions → usable signals

That is why this research matters. It is not only about making models easier to control. It is about making their hidden internal structure more accessible to human reasoning.

Limitations

This approach is promising, but it should be described honestly.

Sparse features do not capture all of model computation. Automated labels are not ground truth. A feature that is good for classification may not be good for steering. Steering is probabilistic, not deterministic. And a beautiful feature map is not the same thing as causal proof.

The right claim is not "we understand the model." The right claim is:

We can map useful parts of its internal concept space and turn them into searchable, testable, and sometimes steerable signals. That is already a meaningful step.

Conclusion

The first wave of AI tooling taught us how to prompt models. The next wave will teach us how to inspect their internal representations.

Sparse feature spaces give us a way into that hidden layer. Auto-interpretability gives those features names. Activation maps show when they fire. Feature neighbourhoods reveal broader structures. Contrastive fingerprints extract behavioural differences. Steering tests whether those signals matter. Reinspection checks whether the internal state actually changed.

That is the research arc behind Concept Studio. Not just better outputs. Better instruments for understanding what happens inside the model before the output exists.

NeuronLens Research - May 2026