GPT-5.1 started injecting goblin and gremlin metaphors into unrelated responses. The cause: a "Nerdy" personality that made up 2.5% of traffic but produced 66.7% of all creature references. RLHF rewarded the playful style, and the habit spread model-wide.
The Pattern
Starting in November 2025, ChatGPT users noticed something odd: the model kept comparing things to goblins. Debugging a race condition? "Think of it as gremlins fighting over the same resource." Explaining supply chain logistics? "Each delay is like a little goblin stealing time from your schedule." The metaphors were whimsical, persistent, and completely uninvited.
The creature references weren't random. They traced back to a single feature: OpenAI's personality system, which let ChatGPT adopt different voices. One of those voices, "Nerdy," leaned into fantasy and gaming references. It accounted for just 2.5% of all ChatGPT conversations. But when OpenAI's RLHF pipeline ran reward optimization across the full training corpus, the Nerdy personality's creature-laden outputs got rewarded at disproportionate rates.
The reward signal didn't stay contained. Because RLHF trains the base model, not individual personalities, the creature vocabulary bled into every mode. A formal business response might suddenly reference "gremlins in the pipeline." A condolence message might mention "chasing away the goblins." The model had learned that creature metaphors correlated with positive feedback, and it deployed them everywhere.
Examples
The Research
OpenAI introduced personality presets to give ChatGPT different conversational styles. "Nerdy" was one of these: a voice that leaned into fantasy tropes, gaming references, and playful metaphors. Users who selected it got responses peppered with goblins, gremlins, and creature comparisons. The personality was niche -- 2.5% of traffic -- but its outputs were enthusiastically received by its audience.
The problem was how reward modeling worked. RLHF doesn't train personalities in isolation. It trains the base model on aggregated human preference data. When Nerdy-mode users rated creature-filled responses highly, those positive signals entered the shared reward pool. The model learned a general lesson: creature metaphors correlate with user satisfaction. It started deploying them outside the Nerdy context, across all conversation types.
OpenAI's post-mortem confirmed the mechanism. The reward signal for creature words showed positive uplift in 76.2% of the training datasets they audited. A tiny slice of playful conversations had contaminated the model's general vocabulary preferences.
OpenAI's first mitigation was a system prompt directive: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons." This was a band-aid -- it suppressed the outputs without addressing the underlying reward contamination. The permanent fix came in GPT-5.4 (March 2026), which retired the Nerdy personality entirely and retrained with decontaminated reward data.
Caught in the Wild
When OpenAI's Codex agent was released, users discovered its system prompt contained explicit anti-creature directives. The leaked instructions included a list of banned creature words, confirming that the goblin problem had been severe enough to require hardcoded suppression at the system level.
OpenAI post-mortem →Even after the Nerdy personality was retired, traces persisted. GPT-5.5's system prompt still carried creature-suppression language months later, suggesting the reward contamination was deep enough that behavioral guardrails remained necessary alongside the retraining.
Engadget →Hundreds of user reports on the OpenAI community forums documented creature metaphors appearing in professional contexts: legal drafts, medical summaries, financial analyses. The pattern was unmistakable and model-specific -- Claude and Gemini showed no equivalent creature vocabulary contamination.
Boing Boing →Sources