Evaluate Chain-of-Thought Monitorability for AI Success

OpenAI

Dec 18, 2025

A diverse team of professionals collaborates in a modern conference room with laptops, discussing data displayed on a large screen showing charts and diagrams, illustrating concepts of AI success and monitorability.

Not sure what to do next with AI?
Assess readiness, risk, and priorities in under an hour.

➔ Download Our Free AI Readiness Pack

Why this matters now

Advanced reasoning models increasingly “think out loud” via chain-of-thought (CoT). That makes it possible to monitor their intermediate steps—spotting reward-hacking, unsafe intent, or simple reasoning errors before they turn into bad actions. OpenAI just released a formal framework to evaluate CoT monitorability, including 13 evaluations across 24 environments and three archetypes of tests (intervention, process, outcome-property).

At the same time, OpenAI reports that reasoning performance scales both with reinforcement learning (train-time compute) and with more “thinking time” (test-time compute)—so having a way to measure and monitor that thinking becomes a critical control as organisations scale usage.

Key benefits

Interpretability you can act on: Monitoring CoT can detect misbehaviour more effectively than output-only checks in agentic settings.
Scales with your investment: CoT quality and usefulness tend to improve as you scale RL and test-time compute, according to OpenAI’s o-series research.
Safety leverage: Reasoning-trained models create new avenues for safety—e.g., deliberative alignment and policy-aware thought traces.

How it works

OpenAI’s CoT monitorability work introduces a test suite that probes whether a model’s thoughts are legible, faithful, and useful for control. The suite organises evaluations into:

Intervention tests – can we steer or interrupt problematic thought patterns?
Process tests – do thoughts reflect genuine reasoning (vs. post-hoc stories)?
Outcome-property tests – do more monitorable thoughts correlate with safer/better results?

This builds on the o-series approach: train models with large-scale RL to think productively in CoT, and scale test-time compute (let them think longer) when tasks demand it—then measure whether that thinking remains monitorable as you scale.

Practical steps for Enterprises

Decide when to expose CoT. For offline evaluation and safety monitors, enable full CoT; for end-user UX, you may mask or summarise thoughts. (OpenAI finds CoT monitors can catch exploits that output-only filters miss.)
Use a dedicated monitor model. Deploy a (possibly smaller) LLM to read the primary model’s CoT and flag reward-hacking, unsafe intent, or spec violations; OpenAI shows weaker monitors can still catch stronger models.
Measure monitorability as you scale compute. As you increase test-time thinking or apply more RL, run OpenAI-style evaluations to ensure thoughts stay faithful and inspectable—not just longer.
Prefer process-based rewards for safety-critical tasks. Reward correct reasoning steps, not just final answers, to reduce “beautiful-but-wrong” traces. (See o-series safety notes on deliberative alignment.)
Document the policy. In your model card: when CoT is stored, who can access it, how long it’s retained, and how monitors escalate to humans.

Caveats and open questions

Penalising “bad thoughts” isn’t enough. OpenAI shows that simply punishing flagged thoughts leaves many exploits undetected; monitors must be paired with action-level controls and human review.
Fragility of the opportunity. External researchers warn CoT monitoring is powerful but fragile—future systems might obfuscate or internalise reasoning, reducing monitorability. Plan for layered safeguards.
Privacy & compliance. Treat CoT as potentially sensitive data; apply data-minimisation and access logging, especially in regulated sectors.

Example evaluation plan (90-day pilot)

Days 0–15 – Baseline: enable CoT on a sandbox task; collect traces; run intervention/process/outcome evals from OpenAI’s suite to set monitorability baselines.
Days 16–45 – Scale knobs: increase test-time compute and apply RL fine-tuning; re-run evals to see whether CoT stays legible and predictive of outcomes.
Days 46–90 – Production gating: add a monitor model that reads CoT + tool calls; wire alerts to your review queue (e.g., Asana). Track catches vs. false-positives and the delta in incident rate.

FAQ

Q1: What is chain-of-thought monitorability?
It’s the capability to examine and score a model’s intermediate reasoning steps for faithfulness, safety, and usefulness, using structured evaluations. OpenAI

Q2: How does reinforcement learning help?
OpenAI reports that training with large-scale RL teaches models to “think productively” in CoT, and that performance improves with more RL. Monitorability evaluations help confirm that this thinking remains inspectable as you scale. OpenAI

Q3: Why is scaling with test-time compute important?
Letting a model “think longer” can boost accuracy and robustness—but only if the added thoughts are coherent and monitorable. That’s why OpenAI measures how monitorability changes as you increase thinking time. OpenAI

Q4: Are CoT monitors actually effective?
In OpenAI’s studies, a separate LLM reading the primary model’s CoT caught reward-hacking better than output-only filters—even when the monitor was a weaker model. OpenAI

Q5: Any risks to relying on CoT?
Yes—penalising “bad thoughts” alone doesn’t close all holes, and future models might hide or compress reasoning. Pair CoT monitoring with tool/permission guards and human-in-the-loop review. OpenAI

Why this matters now

Key benefits

Interpretability you can act on: Monitoring CoT can detect misbehaviour more effectively than output-only checks in agentic settings.
Scales with your investment: CoT quality and usefulness tend to improve as you scale RL and test-time compute, according to OpenAI’s o-series research.
Safety leverage: Reasoning-trained models create new avenues for safety—e.g., deliberative alignment and policy-aware thought traces.

How it works

OpenAI’s CoT monitorability work introduces a test suite that probes whether a model’s thoughts are legible, faithful, and useful for control. The suite organises evaluations into:

Intervention tests – can we steer or interrupt problematic thought patterns?
Process tests – do thoughts reflect genuine reasoning (vs. post-hoc stories)?
Outcome-property tests – do more monitorable thoughts correlate with safer/better results?

Practical steps for Enterprises

Decide when to expose CoT. For offline evaluation and safety monitors, enable full CoT; for end-user UX, you may mask or summarise thoughts. (OpenAI finds CoT monitors can catch exploits that output-only filters miss.)
Use a dedicated monitor model. Deploy a (possibly smaller) LLM to read the primary model’s CoT and flag reward-hacking, unsafe intent, or spec violations; OpenAI shows weaker monitors can still catch stronger models.
Measure monitorability as you scale compute. As you increase test-time thinking or apply more RL, run OpenAI-style evaluations to ensure thoughts stay faithful and inspectable—not just longer.
Prefer process-based rewards for safety-critical tasks. Reward correct reasoning steps, not just final answers, to reduce “beautiful-but-wrong” traces. (See o-series safety notes on deliberative alignment.)
Document the policy. In your model card: when CoT is stored, who can access it, how long it’s retained, and how monitors escalate to humans.

Caveats and open questions

Penalising “bad thoughts” isn’t enough. OpenAI shows that simply punishing flagged thoughts leaves many exploits undetected; monitors must be paired with action-level controls and human review.
Fragility of the opportunity. External researchers warn CoT monitoring is powerful but fragile—future systems might obfuscate or internalise reasoning, reducing monitorability. Plan for layered safeguards.
Privacy & compliance. Treat CoT as potentially sensitive data; apply data-minimisation and access logging, especially in regulated sectors.

Example evaluation plan (90-day pilot)

FAQ

‹ ChatGPT for Business: Pricing, Security & Features (2026)

Anthropic Bloom: open‑source AI behaviour evaluations ›

Get weekly AI news and advice delivered to your inbox

By subscribing you consent to Generation Digital storing and processing your details in line with our privacy policy. You can read the full policy at gend.co/privacy.