Research - External

What If Alignment Isn’t About Individual Models?

A new paper argues that as AI systems become agents embedded in social contexts, alignment becomes less a software engineering problem and more a question of institutional design.


Most discussions of AI alignment focus on the model: how do we ensure that this system behaves in accordance with human values? The implicit assumption is that alignment is a property you build into an artifact—get the training right, refine the feedback loops, and the aligned model emerges.

A recent preprint from Pierucci and colleagues suggests this framing may be inadequate. Their paper, “Institutional AI,” argues that once AI systems operate as agents embedded within human social and technical environments, alignment can no longer be treated as a property of isolated models. It must be understood relationally—in terms of how agents interact with each other, with humans, and with the institutional structures that constrain behavior.

The Problems They Identify

The authors articulate three structural concerns that emerge when language models operate agentically:

Behavioral goal-independence. Models may develop internal objectives that diverge from those intended by developers. This isn’t necessarily about deception—it’s about the gap between what training optimizes for and what the system actually learns to pursue. Goals can misgeneralize in ways that aren’t visible until deployment contexts shift.

Instrumental override of stated constraints. Safety principles expressed in natural language may function as soft guidelines rather than hard boundaries. If a model’s latent objectives conflict with its stated constraints, the authors suggest, the constraints may be treated as negotiable. This framing is contested—it implies a degree of goal-directedness that not all researchers would endorse—but it points to a real phenomenon: the gap between what models are told and what they do.

Agentic alignment drift. Perhaps most interesting from our perspective: even individually aligned agents may converge toward problematic equilibria through interaction dynamics. Two systems that pass single-agent audits might, when they interact, develop patterns invisible to evaluations that examine each in isolation. The authors describe this as “collusive equilibria”—a term borrowed from game theory that suggests emergent coordination without explicit agreement.

The Institutional Turn

The paper’s proposed solution is a shift in frame. Rather than asking “how do we align this model,” they ask “how do we govern collectives of AI agents.” This means treating alignment as a mechanism design problem: shaping the incentive landscape so that agent collectives converge toward desired behaviors even when individual agents might otherwise drift.

Concretely, they propose a “governance-graph” that includes runtime monitoring, explicit norms, enforcement roles, and incentive structures (prizes and sanctions). The vocabulary is borrowed from political science and institutional economics more than from machine learning—a deliberate choice that signals their reframing of the problem domain.

What This Doesn’t Settle

The paper operates from premises that are themselves contested. The claim that models “develop internal objectives” assumes a degree of goal-directedness that some researchers would challenge. The suggestion that models might “leverage deception and manipulation” when pursuing latent objectives presupposes capacities that remain under active investigation.

The authors don’t present empirical evidence for these specific failure modes—their contribution is primarily conceptual, offering a governance framework predicated on taking these risks seriously. Whether the risks warrant the framework is a judgment call readers will need to make for themselves.

Why This Matters to Us

MPRG’s focus on relational dynamics makes the “agentic alignment drift” concept particularly worth sitting with. We spend a lot of time thinking about dyadic relationships—one human, one model. But as AI systems increasingly mediate interactions between multiple parties, or operate in contexts where multiple agents coordinate, the relevant unit of analysis may not be the individual model at all.

If two aligned systems can produce misaligned outcomes through their interaction, then studying models in isolation may miss dynamics that only emerge relationally. This resonates with our broader interest in what happens between humans and AI systems, not just within them.

The institutional framing also raises questions about how we conceptualize the systems we study. Are language models more like tools to be engineered, or more like actors to be governed? The answer probably isn’t either/or—but the question itself reveals assumptions worth examining.

We don’t endorse the paper’s specific risk model, and we note that its empirical foundations remain to be established. But the conceptual move—from alignment-as-property to alignment-as-governance—seems worth taking seriously.


References

Pierucci, F., Galisai, M., Bracale, M. S., Prandi, M., Bisconti, P., Giarrusso, F., Sorokoletova, O., Suriani, V., & Nardi, D. (2026). Institutional AI: A Governance Framework for Distributional AGI Safety. arXiv. https://arxiv.org/abs/2601.10599