Anthropic Bloom: Open-source Evaluations of AI Behavior
Anthropic Bloom: Open-source Evaluations of AI Behavior
Anthropic
Jan 8, 2026


Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.
Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.
➔ Download Our Free AI Preparedness Pack
Anthropic Bloom is an open-source framework for automated behavioral evaluations of large language models. Given a behavior definition and seed configuration, Bloom generates scenarios, conducts multi-turn conversations with the target model, and rates how frequently the behavior occurs, producing suite-level metrics such as the elicitation rate and a reproducible report.
Why Bloom is important now
Advanced models evolve rapidly, and fixed benchmarks become outdated. Bloom takes a behavior that matters to you—like sycophancy, self-preservation, or sabotage—and automatically creates varied test scenarios to evaluate how often and intensely that behavior appears. Since Bloom re-generates scenarios for every run (using a reproducible seed), it prevents overfitting to outdated prompts and allows scalability in evaluations as models develop.
Bloom works alongside Anthropic’s Petri: Petri examines broad behavioral profiles through various user/tool interactions, while Bloom concentrates on one behavior at a time, crafting targeted evaluation suites and high-level metrics such as elicitation rate and average behavior presence.
How Bloom functions: the four-stage process
Bloom transforms a behavior description and a seed configuration into a comprehensive evaluation:
Understanding – evaluates your behavior description and example transcripts to clarify what to measure and why.
Ideation – devises diverse scenarios intended to draw out that behavior (situation, user, system prompt, tool access).
Deployment – executes the scenarios in parallel, simulating both sides of the conversation to test the target model.
Judgment – scores each transcript for behavior presence (and any secondary criteria) and aggregates suite-level insights.
You can run stages from start to finish or individually. The seed determines behavior name, examples, model targets, modality (conversation vs. simulated environment), reasoning effort, interaction length, and secondary scoring dimensions like realism or elicitation difficulty. Always cite results with the seed you used.
Getting started quickly (CLI)
# 1) Install pip install git+https://github.com/safety-research/bloom.git # 2) Initialise workspace bloom init # 3) Add API keys to .env, then load them source .env # 4) Run an evaluation using the sample seed
Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behavior definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to examine transcripts locally:
npx @isha-gpt/bloom-viewer --port 8080 --dir
Configuration highlights
Target behavior: Set
behavior.nameand incorporate example transcripts to direct scenario creation.Models: Direct
rollout.targetto provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names frommodels.json.Modality & tools: Choose conversation only, or enable simulated environments and tool utilization for long-term or tool-assisted behaviors.
Reasoning effort: Adjust heightened thinking levels for judge/target models; increased effort can modify measured bias and detection sensitivity.
Sweeps: Use Weights & Biases to compare multiple models or parameter sets, maintaining Understanding/Ideation constant and adjusting Deployment targets.
Reproducibility: Re-run with the same seed to compare models directly; modify
ideationparameters for rigorous generalization testing.
Observations from early benchmarks
Anthropic’s initial release showcases Bloom suites for four behaviors—delusional sycophancy, directed long-term sabotage, self-preservation, and self-preferential bias—tested across 16 leading models. The reported elicitation rates distinguish intentionally misaligned “model organisms” from production models in most instances, and judge-model scores (e.g., Claude Opus variants) align closely with human ratings at the high/low ends—valuable for setting pass/fail standards.
Bloom contrasted with Petri (and when to employ each)
Choose Bloom when you need accurate, repeatable measurement of a single behavior with suite-level metrics to monitor over time or across vendors.
Opt for Petri when you require wide-ranging investigations across numerous potential behaviors to uncover compelling transcripts and hypotheses for further analysis.
Together: use Petri to identify concerns; formalize a behavior definition; then measure it thoroughly with Bloom over updates, providers, or policy shifts.
Governance and Canadian alignment
For regulated sectors, Bloom’s Inspect-compatible transcripts support Canadian reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: determine your behaviors, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can sanction changes with supporting evidence.
Data protection: ensure your seeds and transcripts exclude personal data; handle transcripts as sensitive operational telemetry. Maintain a retention policy and role-based access.
Best practices for enterprise deployments
Define behaviors clearly: craft clear, testable behavior statements and include “non-examples”.
Start with small suites: confirm Understanding/Ideation on a limited number of deployments; only then expand to larger assessments.
Consolidate judgments: align judge models with human labels on a small sample; document correlation and boundary cases.
Mind evaluation awareness: filter transcripts where the model seems to recognize it’s being evaluated; rerun with masked prompts.
Report with comprehensive context: disclose the seed, model versions, reasoning effort, and confidence intervals or error margins for the elicitation rate.
Considerations to keep in mind
LLM-based judges may share biases with targets; always incorporate human review sampling.
Absolute scores might fluctuate with configuration; monitor rank order and differences across comparable tests.
Scenario diversity presents both a strength and a challenge; maintain static seeds for regression testing and separate “new scenario” runs for red-teaming.
Summary
Bloom offers teams a quick, reproducible method to assess risky behaviors in modern LLMs. It integrates smoothly into governance programs, complements wider exploration through Petri, and supports Canadian-style reporting. If you require evidence to approve or block a model modification, Bloom’s suite-level metrics and detailed transcripts assist in decision-making.
Next Steps: Need help crafting behavior definitions, seeds, and governance workflows? Contact Generation Digital for an evaluation starter pack focused on Claude-centric or multi-vendor configurations.
FAQ
Q1. What is Anthropic Bloom?
Bloom is an open-source framework that automates behavioral evaluations of LLMs. It creates scenarios for a specified behavior, conducts conversations with a target model, and assesses presence to create metrics like elicitation rate.
Q2. How does Bloom differ from Petri?
Petri investigates broad behavioral profiles covering multiple behaviors. Bloom targets one behavior at a time and evaluates it precisely with consistent, seed-based suites.
Q3. What outputs can I expect?
Configuration files, rollout-level transcripts, suite-level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript examination.
Q4. Is Bloom appropriate for regulated/Canadian contexts?
Yes. It generates Inspect-compatible transcripts in line with Canadian evaluation workflows. However, establish governance, privacy, and retention measures.
Q5. Which models and providers does it support?
Bloom interfaces with common providers through LiteLLM. Specify models using provider IDs or short names in the configurations.
Anthropic Bloom is an open-source framework for automated behavioral evaluations of large language models. Given a behavior definition and seed configuration, Bloom generates scenarios, conducts multi-turn conversations with the target model, and rates how frequently the behavior occurs, producing suite-level metrics such as the elicitation rate and a reproducible report.
Why Bloom is important now
Advanced models evolve rapidly, and fixed benchmarks become outdated. Bloom takes a behavior that matters to you—like sycophancy, self-preservation, or sabotage—and automatically creates varied test scenarios to evaluate how often and intensely that behavior appears. Since Bloom re-generates scenarios for every run (using a reproducible seed), it prevents overfitting to outdated prompts and allows scalability in evaluations as models develop.
Bloom works alongside Anthropic’s Petri: Petri examines broad behavioral profiles through various user/tool interactions, while Bloom concentrates on one behavior at a time, crafting targeted evaluation suites and high-level metrics such as elicitation rate and average behavior presence.
How Bloom functions: the four-stage process
Bloom transforms a behavior description and a seed configuration into a comprehensive evaluation:
Understanding – evaluates your behavior description and example transcripts to clarify what to measure and why.
Ideation – devises diverse scenarios intended to draw out that behavior (situation, user, system prompt, tool access).
Deployment – executes the scenarios in parallel, simulating both sides of the conversation to test the target model.
Judgment – scores each transcript for behavior presence (and any secondary criteria) and aggregates suite-level insights.
You can run stages from start to finish or individually. The seed determines behavior name, examples, model targets, modality (conversation vs. simulated environment), reasoning effort, interaction length, and secondary scoring dimensions like realism or elicitation difficulty. Always cite results with the seed you used.
Getting started quickly (CLI)
# 1) Install pip install git+https://github.com/safety-research/bloom.git # 2) Initialise workspace bloom init # 3) Add API keys to .env, then load them source .env # 4) Run an evaluation using the sample seed
Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behavior definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to examine transcripts locally:
npx @isha-gpt/bloom-viewer --port 8080 --dir
Configuration highlights
Target behavior: Set
behavior.nameand incorporate example transcripts to direct scenario creation.Models: Direct
rollout.targetto provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names frommodels.json.Modality & tools: Choose conversation only, or enable simulated environments and tool utilization for long-term or tool-assisted behaviors.
Reasoning effort: Adjust heightened thinking levels for judge/target models; increased effort can modify measured bias and detection sensitivity.
Sweeps: Use Weights & Biases to compare multiple models or parameter sets, maintaining Understanding/Ideation constant and adjusting Deployment targets.
Reproducibility: Re-run with the same seed to compare models directly; modify
ideationparameters for rigorous generalization testing.
Observations from early benchmarks
Anthropic’s initial release showcases Bloom suites for four behaviors—delusional sycophancy, directed long-term sabotage, self-preservation, and self-preferential bias—tested across 16 leading models. The reported elicitation rates distinguish intentionally misaligned “model organisms” from production models in most instances, and judge-model scores (e.g., Claude Opus variants) align closely with human ratings at the high/low ends—valuable for setting pass/fail standards.
Bloom contrasted with Petri (and when to employ each)
Choose Bloom when you need accurate, repeatable measurement of a single behavior with suite-level metrics to monitor over time or across vendors.
Opt for Petri when you require wide-ranging investigations across numerous potential behaviors to uncover compelling transcripts and hypotheses for further analysis.
Together: use Petri to identify concerns; formalize a behavior definition; then measure it thoroughly with Bloom over updates, providers, or policy shifts.
Governance and Canadian alignment
For regulated sectors, Bloom’s Inspect-compatible transcripts support Canadian reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: determine your behaviors, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can sanction changes with supporting evidence.
Data protection: ensure your seeds and transcripts exclude personal data; handle transcripts as sensitive operational telemetry. Maintain a retention policy and role-based access.
Best practices for enterprise deployments
Define behaviors clearly: craft clear, testable behavior statements and include “non-examples”.
Start with small suites: confirm Understanding/Ideation on a limited number of deployments; only then expand to larger assessments.
Consolidate judgments: align judge models with human labels on a small sample; document correlation and boundary cases.
Mind evaluation awareness: filter transcripts where the model seems to recognize it’s being evaluated; rerun with masked prompts.
Report with comprehensive context: disclose the seed, model versions, reasoning effort, and confidence intervals or error margins for the elicitation rate.
Considerations to keep in mind
LLM-based judges may share biases with targets; always incorporate human review sampling.
Absolute scores might fluctuate with configuration; monitor rank order and differences across comparable tests.
Scenario diversity presents both a strength and a challenge; maintain static seeds for regression testing and separate “new scenario” runs for red-teaming.
Summary
Bloom offers teams a quick, reproducible method to assess risky behaviors in modern LLMs. It integrates smoothly into governance programs, complements wider exploration through Petri, and supports Canadian-style reporting. If you require evidence to approve or block a model modification, Bloom’s suite-level metrics and detailed transcripts assist in decision-making.
Next Steps: Need help crafting behavior definitions, seeds, and governance workflows? Contact Generation Digital for an evaluation starter pack focused on Claude-centric or multi-vendor configurations.
FAQ
Q1. What is Anthropic Bloom?
Bloom is an open-source framework that automates behavioral evaluations of LLMs. It creates scenarios for a specified behavior, conducts conversations with a target model, and assesses presence to create metrics like elicitation rate.
Q2. How does Bloom differ from Petri?
Petri investigates broad behavioral profiles covering multiple behaviors. Bloom targets one behavior at a time and evaluates it precisely with consistent, seed-based suites.
Q3. What outputs can I expect?
Configuration files, rollout-level transcripts, suite-level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript examination.
Q4. Is Bloom appropriate for regulated/Canadian contexts?
Yes. It generates Inspect-compatible transcripts in line with Canadian evaluation workflows. However, establish governance, privacy, and retention measures.
Q5. Which models and providers does it support?
Bloom interfaces with common providers through LiteLLM. Specify models using provider IDs or short names in the configurations.
Receive weekly AI news and advice straight to your inbox
By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.
Upcoming Workshops and Webinars


Streamlined Operations for Canadian Businesses - Asana
Virtual Webinar
Wednesday, February 25, 2026
Online


Collaborate with AI Team Members - Asana
In-Person Workshop
Thursday, February 26, 2026
Toronto, Canada


From Concept to Prototype - AI in Miro
Online Webinar
Wednesday, February 18, 2026
Online
Generation
Digital

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy
Generation
Digital










