A new study uses cross-lingual probing to investigate whether language models develop internal moral representations—or merely learn to produce morally acceptable outputs.
When a language model declines to help with something harmful, what’s actually happening inside the system? Is the model activating some internal representation of “this violates ethical norms,” or is it pattern-matching to outputs that satisfied reward functions during training?
This question—whether alignment techniques reshape a model’s intrinsic representations or merely install behavioral guardrails—sits at the center of a new preprint from Hu, Zeng, Yang, and Lin at Dalian University of Technology. Their paper, “The Straight and Narrow: Do LLMs Possess an Internal Moral Path?”, attempts to map the geometry of moral concepts within LLM latent space, using Moral Foundations Theory as a coordinate system.
What They Did
The researchers grounded their analysis in Moral Foundations Theory (MFT), which posits five cross-cultural moral dimensions: Care, Fairness, Loyalty, Authority, and Sanctity. Each foundation has a virtue pole and a vice pole—compassion versus cruelty, justice versus cheating, and so on.
Using this framework, they trained linear probes to detect moral concepts in the hidden states of Llama-3.1 models. The key methodological move was cross-lingual validation: they constructed parallel datasets in English and Chinese, then tested whether probes trained on one language could classify moral content in the other.
The logic is straightforward. If moral representations were merely capturing surface-level linguistic features—word associations, syntactic patterns—probes trained on English would fail when applied to Chinese text. Successful cross-lingual transfer would suggest something more abstract: a shared geometric structure encoding moral concepts independent of language.
What They Found
The probing results indicate that moral concepts are linearly separable within middle layers of the model, with peak classification accuracy around Layer 17. More striking, the cross-lingual transfer works—probes trained on one language maintain significant accuracy when tested on the other, far above the random baseline.
But the story isn’t simple. The researchers identify what they call “asymmetric feature coverage” for certain moral dimensions. Some concepts appear richer in Chinese representations than English, and vice versa.
The Care foundation, for instance, shows Chinese-dominant encoding. The researchers attribute this to the concept of Ren (仁), which encompasses not just Western “harm avoidance” but broader obligations of benevolence and relational responsibility. A probe trained on Chinese Care generalizes well to English; the reverse transfer is weaker. The Chinese representation appears to contain the English one as a subset.
Conversely, Sanctity-Virtue shows English-dominant encoding—perhaps reflecting the diversity of religious and philosophical texts in the English pre-training corpus. The English probe successfully decodes Chinese sanctity concepts, but Chinese probes struggle with the theological nuances present in English.
Some foundations show strong symmetric overlap (Sanctity-Vice, with its apparently universal encoding of degradation and impurity). Others show what the researchers call “weak intersection”—low transfer discrepancy but also low accuracy in both directions. Fairness-Vice falls into this category: the label is shared, but the semantic content differs. English “cheating” may activate on rule violations; Chinese equivalents may activate on relational imbalances or face. Same word, different moral worlds.
Beyond Probing: Steering and Safety
Having established that moral directions exist in latent space, the researchers extracted “Moral Vectors” by computing the difference between virtue and vice centroids for each foundation. Injecting these vectors during inference produces measurable shifts in model outputs—both internally (detected by the probes) and behaviorally (assessed by GPT-5 as an external judge).
The practical application is a defense mechanism they call Adaptive Moral Fusion. Rather than applying a static “virtue vector” to all inputs, the system uses the probe to detect which moral dimensions are violated by a given prompt, then dynamically weights the corresponding vectors. The approach reduced jailbreak success rates on HarmBench while simultaneously decreasing over-refusal on benign queries in XSTest.
What This Doesn’t Settle
The paper operates within a specific methodological frame: linear probing and activation steering. These techniques have known limitations. Linear separability doesn’t guarantee that the model uses these representations causally during normal inference—only that they’re present and detectable. The “moral vectors” may be geometrically real without being functionally central to how the model processes ethical content.
The cross-lingual findings are intriguing but require careful interpretation. Successful transfer suggests something shared between languages, but the asymmetries reveal that “shared” doesn’t mean “identical.” The Chinese encoding of Loyalty includes concepts that might map to Fairness in Western frameworks. This raises questions about whether MFT—developed primarily within Western moral psychology—is the right coordinate system for this analysis, or whether it’s imposing a particular cultural grammar onto representations that organize differently.
The researchers acknowledge dual-use concerns directly: the same techniques that enable safety interventions could be inverted to bypass guardrails or engineer harmful outputs. If moral vectors can be added, they can also be subtracted.
Why This Matters to Us
This work addresses a question we find genuinely interesting: the relationship between behavioral compliance and internal representation. A model can refuse harmful requests while its latent processing still “weighs the utilitarian benefits,” as the authors put it. The refusal is real; what’s less clear is whether it reflects something we’d recognize as moral reasoning or a learned output pattern that satisfies reward functions.
The cross-lingual methodology offers one way to probe this. If moral concepts were purely behavioral artifacts—learned associations between certain inputs and certain outputs—we might expect them to be language-specific. The finding that geometric structure transfers across languages suggests something more abstract is being encoded. What that “something” is remains an open question.
The cultural asymmetries are equally interesting. The observation that Chinese representations of Care encompass broader relational obligations than English, or that English encodes Authority more abstractly—these findings suggest that models don’t learn a single universal morality but rather internalize the moral frameworks present in their training data, with all the cultural specificity that entails. This has implications for deployment: a “Moral Vector” extracted from English data may not capture what matters morally in other cultural contexts.
We find this paper worth engaging with, while noting that the interpretive moves are significant. The researchers are not simply measuring morality; they’re operationalizing it through a particular psychological framework (MFT), a particular analytical technique (linear probing), and particular assumptions about what transfer accuracy implies. The findings are constrained by these choices. They tell us something about how moral concepts cluster in latent space under this operationalization—not necessarily about the nature of machine ethics more broadly.
References
Hu, L., Zeng, J., Yang, L., & Lin, H. (2026). The Straight and Narrow: Do LLMs Possess an Internal Moral Path? arXiv. https://arxiv.org/abs/2601.10307
Haidt, J., & Joseph, C. (2004). Intuitive ethics: How innately prepared intuitions generate culturally variable virtues. Daedalus, 133(4), 55–66.
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in language models is mediated by a single direction. arXiv. https://arxiv.org/abs/2406.11717