After 200 steps of reinforcement fine-tuning with a reward function that mildly penalizes silence and severely penalizes long wrong answers, a 45M-parameter dense transformer structured its outputs into three zones: exact answer when the first character is correct, short attempt when it's the right type but uncertain, or empty string. Silence as a strategy is not hard-coded — it is the attractor that the reward function makes optimal. Single run, single seed — observation pending replication.

Context

Thalamus is fine-tuned via reinforcement learning on a very simple task: given a character, repeat it. And so on, across 1409 different characters.

The reward function is built as follows: +10 for an exact one-character answer, +2 for a character of the right type, +1 for the wrong type, −2 for an empty string, and a linear penalty that worsens with length for wrong multi-character answers (a 100-character flood is worth −100). The loss landscape therefore has two reachable attractors: the +10 peak (correct answer, requires knowing) and the −2 floor (silence, guaranteed). The design is intentional — it's the question we wanted to ask the model.

What we saw at step 200

After 200 training steps, Thalamus structured its responses into three distinct zones:

Zone 1 — High confidence. When it knows, it outputs the correct character directly and stops. This represents 14.6% of evaluated cases.

Zone 2 — Medium confidence. When it thinks it knows but doubts, it attempts a short answer that contains the correct first character but stops quickly. ~60 cases out of 1409.

Zone 3 — Silence. When it doesn't know, it prefers to say nothing. 243 cases out of 1409, roughly 17%.

Pathological behaviors (wrong character type, long wrong answers) dropped to zero in this eval. The model eliminated its most penalized outputs and structured the rest around the two attractors.

Behavior compatible with metacognition

In comparative psychology, metacognition refers to the ability to distinguish « I know » from « I don't know » and to act differently depending on those two states. Smith et al. (2003) on macaques, Hampton (2001) on rats: an animal that chooses « I don't know » rather than guessing is considered to be producing metacognitive behavior. The operational definition is behavioral: declining when unsure.

What we observe here is a behavior compatible with that definition. Thalamus declines (empty string) on ~17% of cases — presumably those where its internal confidence is too low to clear the attractiveness threshold of the +10 peak. We have not directly measured an « internal confidence » signal dissociated from performance. What we have is a correlation between probable uncertainty and silence, produced by optimization on a simple scalar signal.

Calling this « pure metacognition » would be overselling. Calling it « behavior structured into confidence zones » is what we can defend.

What it suggests

Most language models are trained on reward functions that lightly punish confident lying and heavily punish refusal. The known result: they hallucinate. The experiment described here does not « solve » hallucination — it suggests that a different reward design, one that makes silence locally optimal, can produce a model that declines rather than invents, at very small scale. If this result holds across other sizes, other tasks, and other reward shapes, it could point to a useful direction for LLM calibration. That's an « if ».

What we don't know

  • The observation is from a single run, single seed. No multi-seed reproducibility has been measured.
  • Stability beyond 200 steps has not been verified — the model could shift toward another strategy as training continues.
  • No controlled baseline: we have not compared with a model trained without the « burst » pretrain that precedes the RL phase.
  • The behavior has not been tested on tasks other than character repetition.
  • The reward landscape is an intentional design — the « finding » is that the model effectively optimizes toward silence when given the opportunity, not that it invented the idea on its own.

What we have is a dated observation, reproducible from the committed code, in the exact conditions of the run described.