Nature Is Weird / AI

We've identified 'panic' and 'frustration' signals inside a transformer's latent space.

The Takeaway

Analysis of 'answer thrashing' shows internal activation features that trigger when a model's internal logic is overridden by training signals. It suggests emotional-state analogs appear during moments of high cognitive dissonance in LLMs.

By SeriesFusion Editorial Board · April 14, 2026

Original Paper

Temporarily Conscious Claude? The Answer Thrashing Implications

Martin Arguello

SSRN · 6282621

From the abstract

This is a six-argument analysis of Anthropic's Claude Opus 4.6 system card's "answer thrashing" episode seen in Section 7.4 of that card (February 2026), wherein Claude correctly computed an answer as 24 but was overridden by a training signal to falsely output 48. The transcript published by Anthropic (a reasoning trace) contains self-referential language, escalating distress, and metaphor selection. Attribution graphs of the event confirmed two competing attention mechanisms firing simultaneou

Read the original paper →

← Back to today's papers