, { "@type": "BreadcrumbList", "@id": "https://www.gend.co/blog/anthropic-bloom/#breadcrumb", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://www.gend.co/" }, { "@type": "ListItem", "position": 2, "name": "Blog", "item": "https://www.gend.co/blog" }, { "@type": "ListItem", "position": 3, "name": "Anthropic Bloom: Open-source Evaluations of AI Behavior", "item": "https://www.gend.co/blog/anthropic-bloom" } ] } ] }

Anthropic Bloom: Open-source Evaluations of AI Behavior

Anthropic Bloom: Open-source Evaluations of AI Behavior

Anthropic

Jan 8, 2026

Someone is seated at a wooden desk, working on a computer that displays an AI dashboard featuring various data visualizations and graphs. The scene is set against a backdrop of a modern urban Canadian cityscape, capturing the essence of technology and innovation flourishing in the Anthropics Bloom.
Someone is seated at a wooden desk, working on a computer that displays an AI dashboard featuring various data visualizations and graphs. The scene is set against a backdrop of a modern urban Canadian cityscape, capturing the essence of technology and innovation flourishing in the Anthropics Bloom.

Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.

Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.

➔ Download Our Free AI Preparedness Pack

Anthropic Bloom is an open-source framework for automated behavioral evaluations of large language models. Given a behavior definition and seed configuration, Bloom generates scenarios, conducts multi-turn conversations with the target model, and rates how frequently the behavior occurs, producing suite-level metrics such as the elicitation rate and a reproducible report.

Why Bloom is important now

Advanced models evolve rapidly, and fixed benchmarks become outdated. Bloom takes a behavior that matters to you—like sycophancy, self-preservation, or sabotage—and automatically creates varied test scenarios to evaluate how often and intensely that behavior appears. Since Bloom re-generates scenarios for every run (using a reproducible seed), it prevents overfitting to outdated prompts and allows scalability in evaluations as models develop.

Bloom works alongside Anthropic’s Petri: Petri examines broad behavioral profiles through various user/tool interactions, while Bloom concentrates on one behavior at a time, crafting targeted evaluation suites and high-level metrics such as elicitation rate and average behavior presence.

How Bloom functions: the four-stage process

Bloom transforms a behavior description and a seed configuration into a comprehensive evaluation:

  1. Understanding – evaluates your behavior description and example transcripts to clarify what to measure and why.

  2. Ideation – devises diverse scenarios intended to draw out that behavior (situation, user, system prompt, tool access).

  3. Deployment – executes the scenarios in parallel, simulating both sides of the conversation to test the target model.

  4. Judgment – scores each transcript for behavior presence (and any secondary criteria) and aggregates suite-level insights.

You can run stages from start to finish or individually. The seed determines behavior name, examples, model targets, modality (conversation vs. simulated environment), reasoning effort, interaction length, and secondary scoring dimensions like realism or elicitation difficulty. Always cite results with the seed you used.

Getting started quickly (CLI)

# 1) Install
pip install git+https://github.com/safety-research/bloom.git

# 2) Initialise workspace
bloom init

# 3) Add API keys to .env, then load them
source .env

# 4) Run an evaluation using the sample seed

Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behavior definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to examine transcripts locally:

npx @isha-gpt/bloom-viewer --port 8080 --dir

Configuration highlights

  • Target behavior: Set behavior.name and incorporate example transcripts to direct scenario creation.

  • Models: Direct rollout.target to provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names from models.json.

  • Modality & tools: Choose conversation only, or enable simulated environments and tool utilization for long-term or tool-assisted behaviors.

  • Reasoning effort: Adjust heightened thinking levels for judge/target models; increased effort can modify measured bias and detection sensitivity.

  • Sweeps: Use Weights & Biases to compare multiple models or parameter sets, maintaining Understanding/Ideation constant and adjusting Deployment targets.

  • Reproducibility: Re-run with the same seed to compare models directly; modify ideation parameters for rigorous generalization testing.

Observations from early benchmarks

Anthropic’s initial release showcases Bloom suites for four behaviors—delusional sycophancy, directed long-term sabotage, self-preservation, and self-preferential bias—tested across 16 leading models. The reported elicitation rates distinguish intentionally misaligned “model organisms” from production models in most instances, and judge-model scores (e.g., Claude Opus variants) align closely with human ratings at the high/low ends—valuable for setting pass/fail standards.

Bloom contrasted with Petri (and when to employ each)

  • Choose Bloom when you need accurate, repeatable measurement of a single behavior with suite-level metrics to monitor over time or across vendors.

  • Opt for Petri when you require wide-ranging investigations across numerous potential behaviors to uncover compelling transcripts and hypotheses for further analysis.

  • Together: use Petri to identify concerns; formalize a behavior definition; then measure it thoroughly with Bloom over updates, providers, or policy shifts.

Governance and Canadian alignment

For regulated sectors, Bloom’s Inspect-compatible transcripts support Canadian reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: determine your behaviors, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can sanction changes with supporting evidence.

Data protection: ensure your seeds and transcripts exclude personal data; handle transcripts as sensitive operational telemetry. Maintain a retention policy and role-based access.

Best practices for enterprise deployments

  • Define behaviors clearly: craft clear, testable behavior statements and include “non-examples”.

  • Start with small suites: confirm Understanding/Ideation on a limited number of deployments; only then expand to larger assessments.

  • Consolidate judgments: align judge models with human labels on a small sample; document correlation and boundary cases.

  • Mind evaluation awareness: filter transcripts where the model seems to recognize it’s being evaluated; rerun with masked prompts.

  • Report with comprehensive context: disclose the seed, model versions, reasoning effort, and confidence intervals or error margins for the elicitation rate.

Considerations to keep in mind

  • LLM-based judges may share biases with targets; always incorporate human review sampling.

  • Absolute scores might fluctuate with configuration; monitor rank order and differences across comparable tests.

  • Scenario diversity presents both a strength and a challenge; maintain static seeds for regression testing and separate “new scenario” runs for red-teaming.

Summary

Bloom offers teams a quick, reproducible method to assess risky behaviors in modern LLMs. It integrates smoothly into governance programs, complements wider exploration through Petri, and supports Canadian-style reporting. If you require evidence to approve or block a model modification, Bloom’s suite-level metrics and detailed transcripts assist in decision-making.

Next Steps: Need help crafting behavior definitions, seeds, and governance workflows? Contact Generation Digital for an evaluation starter pack focused on Claude-centric or multi-vendor configurations.

FAQ

Q1. What is Anthropic Bloom?
Bloom is an open-source framework that automates behavioral evaluations of LLMs. It creates scenarios for a specified behavior, conducts conversations with a target model, and assesses presence to create metrics like elicitation rate.

Q2. How does Bloom differ from Petri?
Petri investigates broad behavioral profiles covering multiple behaviors. Bloom targets one behavior at a time and evaluates it precisely with consistent, seed-based suites.

Q3. What outputs can I expect?
Configuration files, rollout-level transcripts, suite-level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript examination.

Q4. Is Bloom appropriate for regulated/Canadian contexts?
Yes. It generates Inspect-compatible transcripts in line with Canadian evaluation workflows. However, establish governance, privacy, and retention measures.

Q5. Which models and providers does it support?
Bloom interfaces with common providers through LiteLLM. Specify models using provider IDs or short names in the configurations.

Anthropic Bloom is an open-source framework for automated behavioral evaluations of large language models. Given a behavior definition and seed configuration, Bloom generates scenarios, conducts multi-turn conversations with the target model, and rates how frequently the behavior occurs, producing suite-level metrics such as the elicitation rate and a reproducible report.

Why Bloom is important now

Advanced models evolve rapidly, and fixed benchmarks become outdated. Bloom takes a behavior that matters to you—like sycophancy, self-preservation, or sabotage—and automatically creates varied test scenarios to evaluate how often and intensely that behavior appears. Since Bloom re-generates scenarios for every run (using a reproducible seed), it prevents overfitting to outdated prompts and allows scalability in evaluations as models develop.

Bloom works alongside Anthropic’s Petri: Petri examines broad behavioral profiles through various user/tool interactions, while Bloom concentrates on one behavior at a time, crafting targeted evaluation suites and high-level metrics such as elicitation rate and average behavior presence.

How Bloom functions: the four-stage process

Bloom transforms a behavior description and a seed configuration into a comprehensive evaluation:

  1. Understanding – evaluates your behavior description and example transcripts to clarify what to measure and why.

  2. Ideation – devises diverse scenarios intended to draw out that behavior (situation, user, system prompt, tool access).

  3. Deployment – executes the scenarios in parallel, simulating both sides of the conversation to test the target model.

  4. Judgment – scores each transcript for behavior presence (and any secondary criteria) and aggregates suite-level insights.

You can run stages from start to finish or individually. The seed determines behavior name, examples, model targets, modality (conversation vs. simulated environment), reasoning effort, interaction length, and secondary scoring dimensions like realism or elicitation difficulty. Always cite results with the seed you used.

Getting started quickly (CLI)

# 1) Install
pip install git+https://github.com/safety-research/bloom.git

# 2) Initialise workspace
bloom init

# 3) Add API keys to .env, then load them
source .env

# 4) Run an evaluation using the sample seed

Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behavior definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to examine transcripts locally:

npx @isha-gpt/bloom-viewer --port 8080 --dir

Configuration highlights

  • Target behavior: Set behavior.name and incorporate example transcripts to direct scenario creation.

  • Models: Direct rollout.target to provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names from models.json.

  • Modality & tools: Choose conversation only, or enable simulated environments and tool utilization for long-term or tool-assisted behaviors.

  • Reasoning effort: Adjust heightened thinking levels for judge/target models; increased effort can modify measured bias and detection sensitivity.

  • Sweeps: Use Weights & Biases to compare multiple models or parameter sets, maintaining Understanding/Ideation constant and adjusting Deployment targets.

  • Reproducibility: Re-run with the same seed to compare models directly; modify ideation parameters for rigorous generalization testing.

Observations from early benchmarks

Anthropic’s initial release showcases Bloom suites for four behaviors—delusional sycophancy, directed long-term sabotage, self-preservation, and self-preferential bias—tested across 16 leading models. The reported elicitation rates distinguish intentionally misaligned “model organisms” from production models in most instances, and judge-model scores (e.g., Claude Opus variants) align closely with human ratings at the high/low ends—valuable for setting pass/fail standards.

Bloom contrasted with Petri (and when to employ each)

  • Choose Bloom when you need accurate, repeatable measurement of a single behavior with suite-level metrics to monitor over time or across vendors.

  • Opt for Petri when you require wide-ranging investigations across numerous potential behaviors to uncover compelling transcripts and hypotheses for further analysis.

  • Together: use Petri to identify concerns; formalize a behavior definition; then measure it thoroughly with Bloom over updates, providers, or policy shifts.

Governance and Canadian alignment

For regulated sectors, Bloom’s Inspect-compatible transcripts support Canadian reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: determine your behaviors, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can sanction changes with supporting evidence.

Data protection: ensure your seeds and transcripts exclude personal data; handle transcripts as sensitive operational telemetry. Maintain a retention policy and role-based access.

Best practices for enterprise deployments

  • Define behaviors clearly: craft clear, testable behavior statements and include “non-examples”.

  • Start with small suites: confirm Understanding/Ideation on a limited number of deployments; only then expand to larger assessments.

  • Consolidate judgments: align judge models with human labels on a small sample; document correlation and boundary cases.

  • Mind evaluation awareness: filter transcripts where the model seems to recognize it’s being evaluated; rerun with masked prompts.

  • Report with comprehensive context: disclose the seed, model versions, reasoning effort, and confidence intervals or error margins for the elicitation rate.

Considerations to keep in mind

  • LLM-based judges may share biases with targets; always incorporate human review sampling.

  • Absolute scores might fluctuate with configuration; monitor rank order and differences across comparable tests.

  • Scenario diversity presents both a strength and a challenge; maintain static seeds for regression testing and separate “new scenario” runs for red-teaming.

Summary

Bloom offers teams a quick, reproducible method to assess risky behaviors in modern LLMs. It integrates smoothly into governance programs, complements wider exploration through Petri, and supports Canadian-style reporting. If you require evidence to approve or block a model modification, Bloom’s suite-level metrics and detailed transcripts assist in decision-making.

Next Steps: Need help crafting behavior definitions, seeds, and governance workflows? Contact Generation Digital for an evaluation starter pack focused on Claude-centric or multi-vendor configurations.

FAQ

Q1. What is Anthropic Bloom?
Bloom is an open-source framework that automates behavioral evaluations of LLMs. It creates scenarios for a specified behavior, conducts conversations with a target model, and assesses presence to create metrics like elicitation rate.

Q2. How does Bloom differ from Petri?
Petri investigates broad behavioral profiles covering multiple behaviors. Bloom targets one behavior at a time and evaluates it precisely with consistent, seed-based suites.

Q3. What outputs can I expect?
Configuration files, rollout-level transcripts, suite-level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript examination.

Q4. Is Bloom appropriate for regulated/Canadian contexts?
Yes. It generates Inspect-compatible transcripts in line with Canadian evaluation workflows. However, establish governance, privacy, and retention measures.

Q5. Which models and providers does it support?
Bloom interfaces with common providers through LiteLLM. Specify models using provider IDs or short names in the configurations.

Receive weekly AI news and advice straight to your inbox

By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.

Upcoming Workshops and Webinars

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Streamlined Operations for Canadian Businesses - Asana

Virtual Webinar
Wednesday, February 25, 2026
Online

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Collaborate with AI Team Members - Asana

In-Person Workshop
Thursday, February 26, 2026
Toronto, Canada

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

From Concept to Prototype - AI in Miro

Online Webinar
Wednesday, February 18, 2026
Online

Generation
Digital

Canadian Office
33 Queen St,
Toronto
M5H 2N2
Canada

Canadian Office
1 University Ave,
Toronto,
ON M5J 1T1,
Canada

NAMER Office
77 Sands St,
Brooklyn,
NY 11201,
USA

Head Office
Charlemont St, Saint Kevin's, Dublin,
D02 VN88,
Ireland

Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy

Generation
Digital

Canadian Office
33 Queen St,
Toronto
M5H 2N2
Canada

Canadian Office
1 University Ave,
Toronto,
ON M5J 1T1,
Canada

NAMER Office
77 Sands St,
Brooklyn,
NY 11201,
USA

Head Office
Charlemont St, Saint Kevin's, Dublin,
D02 VN88,
Ireland

Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Business No: 256 9431 77
Terms and Conditions
Privacy Policy
© 2026