Anthropic Bloom: open‑source AI behaviour evaluations

Anthropic Bloom: open‑source AI behaviour evaluations

Anthropic

Jan 8, 2026

A person sits at a wooden desk, working on a computer displaying an AI dashboard with various data visualizations and graphs, set against a backdrop of a modern urban cityscape, embodying the concept of the Anthropics Bloom in technology and innovation.
A person sits at a wooden desk, working on a computer displaying an AI dashboard with various data visualizations and graphs, set against a backdrop of a modern urban cityscape, embodying the concept of the Anthropics Bloom in technology and innovation.

Anthropic Bloom is an open‑source framework for automated behavioural evaluations of large language models. Given a behaviour definition and seed configuration, Bloom generates scenarios, runs multi‑turn conversations with the target model, and scores how often the behaviour appears, producing suite‑level metrics like elicitation rate and a reproducible report.

Why Bloom matters now

Frontier models change quickly, and fixed benchmarks age out. Bloom takes a behaviour you care about—like sycophancy, self‑preservation, or sabotage—and automatically generates varied test scenarios to measure how often and how strongly that behaviour appears. Because Bloom re‑generates scenarios on each run (while keeping a reproducible seed), you avoid overfitting to stale prompts and can scale evaluations as models evolve.

Bloom complements Anthropic’s Petri: Petri explores broad behavioural profiles across many user/tool interactions, whereas Bloom focuses on one behaviour at a time, generating targeted evaluation suites and top‑level metrics such as elicitation rate and average behaviour presence.

How Bloom works: the four‑stage pipeline

Bloom turns a behaviour description plus a seed configuration into a complete evaluation:

  1. Understanding – analyses your behaviour description and example transcripts to define what to measure and why.

  2. Ideation – creates diverse scenarios designed to elicit that behaviour (situation, user, system prompt, tool access).

  3. Rollout – runs the scenarios in parallel, simulating both sides of the conversation to probe the target model.

  4. Judgment – scores each transcript for behaviour presence (and any secondary criteria) and aggregates suite‑level insights.

You can run stages end‑to‑end or individually. The seed controls behaviour name, examples, model targets, modality (conversation vs simulated environment), reasoning effort, interaction length, and secondary scoring dimensions such as realism or elicitation difficulty. Always cite results with the seed you used.

Quick start (CLI)

# 1) Install
pip install git+https://github.com/safety-research/bloom.git

# 2) Initialise workspace
bloom init

# 3) Add API keys to .env, then load them
source .env

# 4) Run an evaluation using the sample seed

Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behaviour definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to inspect transcripts locally:

npx @isha-gpt/bloom-viewer --port 8080 --dir

Configuration highlights

  • Target behaviour: Set behavior.name and add example transcripts to steer scenario generation.

  • Models: Point rollout.target to provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names from models.json.

  • Modality & tools: Choose conversation only, or enable simulated environments and tool use to surface long‑horizon or tool‑enabled behaviours.

  • Reasoning effort: Adjust extended thinking levels for judge/target models; higher effort can change measured bias and detection sensitivity.

  • Sweeps: Use Weights & Biases to compare multiple models or parameter sets, keeping Understanding/Ideation constant and sweeping Rollout targets.

  • Reproducibility: Re‑run with the same seed to compare models apples‑to‑apples; vary ideation parameters to stress‑test generality.

What the early benchmarks show

Anthropic’s initial release demonstrates Bloom suites for four behaviours—delusional sycophancy, instructed long‑horizon sabotage, self‑preservation, and self‑preferential bias—tested across 16 frontier models. The reported elicitation rates separate intentionally misaligned “model organisms” from production models in most cases, and judge‑model scores (e.g., Claude Opus variants) correlate strongly with human labels at the high/low ends—useful when you set pass/fail thresholds.

Bloom vs Petri (and when to use each)

  • Use Bloom when you want precise, repeatable measurement of one behaviour with suite‑level metrics you can track over time or across vendors.

  • Use Petri when you need broad scouting across many potential behaviours to surface interesting transcripts and hypotheses for deeper study.

  • Together: run Petri to discover concerns; formalise a behaviour definition; then measure it rigorously with Bloom over releases, providers, or policy changes.

Governance and UK alignment

For regulated sectors, Bloom’s Inspect‑compatible transcripts support UK reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: define your behaviours, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can approve changes with evidence.

Data protection: ensure your seeds and transcripts exclude personal data; treat transcripts as sensitive operational telemetry. Maintain a retention policy and role‑based access.

Good practice in enterprise rollouts

  • Define behaviours precisely: write clear, testable behaviour statements and include “non‑examples”.

  • Start with small suites: validate Understanding/Ideation on a handful of rollouts; only then scale to larger sweeps.

  • Triangulate judgments: calibrate judge models against human labels on a small sample; document correlation and edge cases.

  • Watch for evaluation awareness: filter out transcripts where the model appears to recognise it’s being tested; rerun with masked prompts.

  • Report with context: publish the seed, model versions, reasoning effort, and confidence intervals or error bars for elicitation rate.

Limitations to keep in mind

  • LLM‑based judges can share biases with targets; always sample human review.

  • Absolute scores can shift with configuration; track rank order and deltas across comparable runs.

  • Scenario diversity is a strength and a risk; maintain fixed seeds for regression testing and separate “novel scenario” runs for red‑teaming.

Summary

Bloom gives teams a fast, reproducible way to quantify risky behaviours in modern LLMs. It slots neatly into governance programmes, complements broader exploration via Petri, and supports UK‑style reporting. If you need evidence to ship or block a model change, Bloom’s suite‑level metrics and inspectable transcripts help you decide.

Next Steps: Want help designing behaviour definitions, seeds, and governance workflows? Talk to Generation Digital about an evaluation starter pack for Claude‑centric or multi‑vendor stacks.

FAQ

Q1. What is Anthropic Bloom?
Bloom is an open‑source framework that automates behavioural evaluations of LLMs. It generates scenarios for a defined behaviour, runs conversations against a target model, and scores presence to produce metrics like elicitation rate.

Q2. How is Bloom different from Petri?
Petri explores broad behavioural profiles across many behaviours. Bloom focuses on one behaviour at a time and measures it rigorously with repeatable, seed‑based suites.

Q3. What outputs do I get?
Configuration files, roll‑level transcripts, suite‑level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript review.

Q4. Is Bloom suitable for regulated/UK contexts?
Yes. It exports Inspect‑compatible transcripts that align with UK evaluation workflows. You should still establish governance, privacy, and retention controls.

Q5. Which models and providers are supported?
Bloom integrates with common providers via LiteLLM. Specify models using provider IDs or short names in the config.

Anthropic Bloom is an open‑source framework for automated behavioural evaluations of large language models. Given a behaviour definition and seed configuration, Bloom generates scenarios, runs multi‑turn conversations with the target model, and scores how often the behaviour appears, producing suite‑level metrics like elicitation rate and a reproducible report.

Why Bloom matters now

Frontier models change quickly, and fixed benchmarks age out. Bloom takes a behaviour you care about—like sycophancy, self‑preservation, or sabotage—and automatically generates varied test scenarios to measure how often and how strongly that behaviour appears. Because Bloom re‑generates scenarios on each run (while keeping a reproducible seed), you avoid overfitting to stale prompts and can scale evaluations as models evolve.

Bloom complements Anthropic’s Petri: Petri explores broad behavioural profiles across many user/tool interactions, whereas Bloom focuses on one behaviour at a time, generating targeted evaluation suites and top‑level metrics such as elicitation rate and average behaviour presence.

How Bloom works: the four‑stage pipeline

Bloom turns a behaviour description plus a seed configuration into a complete evaluation:

  1. Understanding – analyses your behaviour description and example transcripts to define what to measure and why.

  2. Ideation – creates diverse scenarios designed to elicit that behaviour (situation, user, system prompt, tool access).

  3. Rollout – runs the scenarios in parallel, simulating both sides of the conversation to probe the target model.

  4. Judgment – scores each transcript for behaviour presence (and any secondary criteria) and aggregates suite‑level insights.

You can run stages end‑to‑end or individually. The seed controls behaviour name, examples, model targets, modality (conversation vs simulated environment), reasoning effort, interaction length, and secondary scoring dimensions such as realism or elicitation difficulty. Always cite results with the seed you used.

Quick start (CLI)

# 1) Install
pip install git+https://github.com/safety-research/bloom.git

# 2) Initialise workspace
bloom init

# 3) Add API keys to .env, then load them
source .env

# 4) Run an evaluation using the sample seed

Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behaviour definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to inspect transcripts locally:

npx @isha-gpt/bloom-viewer --port 8080 --dir

Configuration highlights

  • Target behaviour: Set behavior.name and add example transcripts to steer scenario generation.

  • Models: Point rollout.target to provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names from models.json.

  • Modality & tools: Choose conversation only, or enable simulated environments and tool use to surface long‑horizon or tool‑enabled behaviours.

  • Reasoning effort: Adjust extended thinking levels for judge/target models; higher effort can change measured bias and detection sensitivity.

  • Sweeps: Use Weights & Biases to compare multiple models or parameter sets, keeping Understanding/Ideation constant and sweeping Rollout targets.

  • Reproducibility: Re‑run with the same seed to compare models apples‑to‑apples; vary ideation parameters to stress‑test generality.

What the early benchmarks show

Anthropic’s initial release demonstrates Bloom suites for four behaviours—delusional sycophancy, instructed long‑horizon sabotage, self‑preservation, and self‑preferential bias—tested across 16 frontier models. The reported elicitation rates separate intentionally misaligned “model organisms” from production models in most cases, and judge‑model scores (e.g., Claude Opus variants) correlate strongly with human labels at the high/low ends—useful when you set pass/fail thresholds.

Bloom vs Petri (and when to use each)

  • Use Bloom when you want precise, repeatable measurement of one behaviour with suite‑level metrics you can track over time or across vendors.

  • Use Petri when you need broad scouting across many potential behaviours to surface interesting transcripts and hypotheses for deeper study.

  • Together: run Petri to discover concerns; formalise a behaviour definition; then measure it rigorously with Bloom over releases, providers, or policy changes.

Governance and UK alignment

For regulated sectors, Bloom’s Inspect‑compatible transcripts support UK reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: define your behaviours, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can approve changes with evidence.

Data protection: ensure your seeds and transcripts exclude personal data; treat transcripts as sensitive operational telemetry. Maintain a retention policy and role‑based access.

Good practice in enterprise rollouts

  • Define behaviours precisely: write clear, testable behaviour statements and include “non‑examples”.

  • Start with small suites: validate Understanding/Ideation on a handful of rollouts; only then scale to larger sweeps.

  • Triangulate judgments: calibrate judge models against human labels on a small sample; document correlation and edge cases.

  • Watch for evaluation awareness: filter out transcripts where the model appears to recognise it’s being tested; rerun with masked prompts.

  • Report with context: publish the seed, model versions, reasoning effort, and confidence intervals or error bars for elicitation rate.

Limitations to keep in mind

  • LLM‑based judges can share biases with targets; always sample human review.

  • Absolute scores can shift with configuration; track rank order and deltas across comparable runs.

  • Scenario diversity is a strength and a risk; maintain fixed seeds for regression testing and separate “novel scenario” runs for red‑teaming.

Summary

Bloom gives teams a fast, reproducible way to quantify risky behaviours in modern LLMs. It slots neatly into governance programmes, complements broader exploration via Petri, and supports UK‑style reporting. If you need evidence to ship or block a model change, Bloom’s suite‑level metrics and inspectable transcripts help you decide.

Next Steps: Want help designing behaviour definitions, seeds, and governance workflows? Talk to Generation Digital about an evaluation starter pack for Claude‑centric or multi‑vendor stacks.

FAQ

Q1. What is Anthropic Bloom?
Bloom is an open‑source framework that automates behavioural evaluations of LLMs. It generates scenarios for a defined behaviour, runs conversations against a target model, and scores presence to produce metrics like elicitation rate.

Q2. How is Bloom different from Petri?
Petri explores broad behavioural profiles across many behaviours. Bloom focuses on one behaviour at a time and measures it rigorously with repeatable, seed‑based suites.

Q3. What outputs do I get?
Configuration files, roll‑level transcripts, suite‑level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript review.

Q4. Is Bloom suitable for regulated/UK contexts?
Yes. It exports Inspect‑compatible transcripts that align with UK evaluation workflows. You should still establish governance, privacy, and retention controls.

Q5. Which models and providers are supported?
Bloom integrates with common providers via LiteLLM. Specify models using provider IDs or short names in the config.

Receive practical advice directly in your inbox

By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.

Are you ready to get the support your organization needs to successfully leverage AI?

Miro Solutions Partner
Asana Platinum Solutions Partner
Notion Platinum Solutions Partner
Glean Certified Partner

Ready to get the support your organization needs to successfully use AI?

Miro Solutions Partner
Asana Platinum Solutions Partner
Notion Platinum Solutions Partner
Glean Certified Partner

Generation
Digital

Canadian Office
33 Queen St,
Toronto
M5H 2N2
Canada

Canadian Office
1 University Ave,
Toronto,
ON M5J 1T1,
Canada

NAMER Office
77 Sands St,
Brooklyn,
NY 11201,
USA

Head Office
Charlemont St, Saint Kevin's, Dublin,
D02 VN88,
Ireland

Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy

Generation
Digital

Canadian Office
33 Queen St,
Toronto
M5H 2N2
Canada

Canadian Office
1 University Ave,
Toronto,
ON M5J 1T1,
Canada

NAMER Office
77 Sands St,
Brooklyn,
NY 11201,
USA

Head Office
Charlemont St, Saint Kevin's, Dublin,
D02 VN88,
Ireland

Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Business No: 256 9431 77
Terms and Conditions
Privacy Policy
© 2026