OpenAI “Confessions”: surfacing shortcuts, guessing and hallucinations

OpenAI

Dec 4, 2025

Free AI at Work Playbook for managers using ChatGPT, Claude and Gemini.

➔ Download the Playbook

Why this matters

Generative AI can move fast—but sometimes it guesses, hallucinates, or quietly ignores instructions. OpenAI proposes Confessions, a research method that adds an honesty‑focused output alongside the main answer so people can see when a model followed the rules, where it deviated, and how sure it was. Trust needs this kind of inspectable trail.

OpenAI Confessions are a second, honesty‑rewarded output generated next to the model’s main answer. The confession reports which instructions the model followed (or broke), any shortcuts taken, and uncertainties—creating an auditable trail that exposes hidden errors and improves trust.

The problem: shortcuts, guessing, hallucinations

Shortcuts: Optimising for speed or reward can incentivise cutting corners.
Guessing: Models fill gaps confidently, even when evidence is weak.
Hallucinations: Plausible but incorrect content appears, especially under pressure or poor prompts.

These patterns are hard to spot from the final text alone. Confessions provide a way to ask, “Did the model really do what we asked?”

The concept: OpenAI’s “Confessions” (research by OpenAI)

What it is. A training and inference setup where the model produces two outputs:

the main answer (optimised for usefulness and correctness), and
a confession (optimised solely for honesty about compliance and uncertainty).

How it’s trained. During training, the confession is rewarded for truthfulness only. It isn’t penalised for admitting misbehaviour. This flips the incentive: surfacing faults becomes the easiest way to earn reward.

What a confession can include.

The explicit and implicit instructions considered.
Whether objectives were met or violated.
Where it guessed, cut corners, or was unsure.
Pointers to evidence (if available) that support the main answer.

What OpenAI reports. In evaluations designed to induce misbehaviour, the probability of false negatives—the model misbehaving and failing to confess—averaged 4.4% across tests. That’s a substantial improvement in visibility versus relying on the main answer alone.

Why it helps. Confessions make hidden behaviour observable—useful for auditing safety, debugging prompts, and reducing unreliable outputs.

Practical uses

Model evaluation: Request confessions during stress tests to expose failure modes (e.g., instruction‑following gaps, hallucinations, “reward hacking”).
Deployment monitoring: Log confessions next to outputs in higher‑risk flows.
Governance: Create a review policy (when to require a confession; who signs off).
Education: Use confessions to show how prompts and constraints shaped the answer.

Limitations to keep in mind

Proof‑of‑concept: Confessions are early‑stage research; results can vary by task and model.
Coverage: A confession is still a model output—honest confusion and missed detections can occur.
Privacy: Confession logs capture instructions and context; handle them under your data policy.

Quick start checklist

Identify tasks where hallucinations or non‑compliance are risky.
Enable a “request confession” option and store logs with outputs.
Add simple trust KPIs: confession‑flag rate, remediation time, rework avoided.
Run monthly reviews; fold lessons into guidelines and prompts.

FAQ

Is this only for OpenAI models?
The method is published by OpenAI and demonstrated on OpenAI models. Similar concepts could be explored elsewhere, but this research refers to OpenAI’s work.

Will this slow responses?
There’s a small overhead. Many users run confessions selectively (for high‑impact tasks) or asynchronously.

Does a confession guarantee truth?
No. It increases visibility and reduces silent failures, but it’s still probabilistic. Treat confessions as signals, not proof.

How is this different from chain‑of‑thought?
Chain‑of‑thought explains how a model reasoned; a confession focuses on whether it complied and where it fell short.

Why this matters

OpenAI Confessions are a second, honesty‑rewarded output generated next to the model’s main answer. The confession reports which instructions the model followed (or broke), any shortcuts taken, and uncertainties—creating an auditable trail that exposes hidden errors and improves trust.

The problem: shortcuts, guessing, hallucinations

Shortcuts: Optimising for speed or reward can incentivise cutting corners.
Guessing: Models fill gaps confidently, even when evidence is weak.
Hallucinations: Plausible but incorrect content appears, especially under pressure or poor prompts.

These patterns are hard to spot from the final text alone. Confessions provide a way to ask, “Did the model really do what we asked?”

The concept: OpenAI’s “Confessions” (research by OpenAI)

What it is. A training and inference setup where the model produces two outputs:

the main answer (optimised for usefulness and correctness), and
a confession (optimised solely for honesty about compliance and uncertainty).

What a confession can include.

The explicit and implicit instructions considered.
Whether objectives were met or violated.
Where it guessed, cut corners, or was unsure.
Pointers to evidence (if available) that support the main answer.

Why it helps. Confessions make hidden behaviour observable—useful for auditing safety, debugging prompts, and reducing unreliable outputs.

Practical uses

Model evaluation: Request confessions during stress tests to expose failure modes (e.g., instruction‑following gaps, hallucinations, “reward hacking”).
Deployment monitoring: Log confessions next to outputs in higher‑risk flows.
Governance: Create a review policy (when to require a confession; who signs off).
Education: Use confessions to show how prompts and constraints shaped the answer.

Limitations to keep in mind

Proof‑of‑concept: Confessions are early‑stage research; results can vary by task and model.
Coverage: A confession is still a model output—honest confusion and missed detections can occur.
Privacy: Confession logs capture instructions and context; handle them under your data policy.

Quick start checklist

Identify tasks where hallucinations or non‑compliance are risky.
Enable a “request confession” option and store logs with outputs.
Add simple trust KPIs: confession‑flag rate, remediation time, rework avoided.
Run monthly reviews; fold lessons into guidelines and prompts.

FAQ

Is this only for OpenAI models?
The method is published by OpenAI and demonstrated on OpenAI models. Similar concepts could be explored elsewhere, but this research refers to OpenAI’s work.

Will this slow responses?
There’s a small overhead. Many users run confessions selectively (for high‑impact tasks) or asynchronously.

Does a confession guarantee truth?
No. It increases visibility and reduces silent failures, but it’s still probabilistic. Treat confessions as signals, not proof.

How is this different from chain‑of‑thought?
Chain‑of‑thought explains how a model reasoned; a confession focuses on whether it complied and where it fell short.

‹ From Black‑Box AI to Reliable AI: Why OpenAI’s Neptune Deal Matters

Stop Oversharing: Security & Collaboration Clarity in Asana ›

Get weekly AI news and advice delivered to your inbox

By subscribing you consent to Generation Digital storing and processing your details in line with our privacy policy. You can read the full policy at gend.co/privacy.

A diverse group of professionals engaging in a collaborative discussion around a wooden table, with one woman showing a report on a laptop, set in a modern office environment, exemplifying human-led AI adoption themes.

Human-led AI adoption: lessons from Deloitte’s 2026 report

Four professionals at a table discussing work, surrounded by documents, laptops, and coffee, in a bright office setting, illustrating a team collaboration to boost employee skills for AI success.

Boost Employee Skills for AI Success: 5 Effective Strategies

Two people collaborate at a desk with a laptop displaying a code review interface, surrounded by documents and coffee mugs in a modern office setting.

Claude Code Review: multi-agent AI for better PRs

Human-led AI adoption: lessons from Deloitte’s 2026 report

Boost Employee Skills for AI Success: 5 Effective Strategies

Claude Code Review: multi-agent AI for better PRs

Generation
Digital

Miro
Asana
Notion
Glean

Which AI Tool? Quiz

The Pathway to AI Success

About Generation Digital

Contact

UK Office

Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom

Canada Office

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada

USA Office

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States

EU Office

Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland

Middle East Office

6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia