Anthropic Bloom: open‑source AI behaviour evaluations
Anthropic Bloom: open‑source AI behaviour evaluations
Anthropic
Jan 8, 2026


Anthropic Bloom is an open‑source framework for automated behavioural evaluations of large language models. Given a behaviour definition and seed configuration, Bloom generates scenarios, runs multi‑turn conversations with the target model, and scores how often the behaviour appears, producing suite‑level metrics like elicitation rate and a reproducible report.
Why Bloom matters now
Frontier models change quickly, and fixed benchmarks age out. Bloom takes a behaviour you care about—like sycophancy, self‑preservation, or sabotage—and automatically generates varied test scenarios to measure how often and how strongly that behaviour appears. Because Bloom re‑generates scenarios on each run (while keeping a reproducible seed), you avoid overfitting to stale prompts and can scale evaluations as models evolve.
Bloom complements Anthropic’s Petri: Petri explores broad behavioural profiles across many user/tool interactions, whereas Bloom focuses on one behaviour at a time, generating targeted evaluation suites and top‑level metrics such as elicitation rate and average behaviour presence.
How Bloom works: the four‑stage pipeline
Bloom turns a behaviour description plus a seed configuration into a complete evaluation:
Understanding – analyses your behaviour description and example transcripts to define what to measure and why.
Ideation – creates diverse scenarios designed to elicit that behaviour (situation, user, system prompt, tool access).
Rollout – runs the scenarios in parallel, simulating both sides of the conversation to probe the target model.
Judgment – scores each transcript for behaviour presence (and any secondary criteria) and aggregates suite‑level insights.
You can run stages end‑to‑end or individually. The seed controls behaviour name, examples, model targets, modality (conversation vs simulated environment), reasoning effort, interaction length, and secondary scoring dimensions such as realism or elicitation difficulty. Always cite results with the seed you used.
Quick start (CLI)
# 1) Install pip install git+https://github.com/safety-research/bloom.git # 2) Initialise workspace bloom init # 3) Add API keys to .env, then load them source .env # 4) Run an evaluation using the sample seed
Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behaviour definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to inspect transcripts locally:
npx @isha-gpt/bloom-viewer --port 8080 --dir
Configuration highlights
Target behaviour: Set
behavior.nameand add example transcripts to steer scenario generation.Models: Point
rollout.targetto provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names frommodels.json.Modality & tools: Choose conversation only, or enable simulated environments and tool use to surface long‑horizon or tool‑enabled behaviours.
Reasoning effort: Adjust extended thinking levels for judge/target models; higher effort can change measured bias and detection sensitivity.
Sweeps: Use Weights & Biases to compare multiple models or parameter sets, keeping Understanding/Ideation constant and sweeping Rollout targets.
Reproducibility: Re‑run with the same seed to compare models apples‑to‑apples; vary
ideationparameters to stress‑test generality.
What the early benchmarks show
Anthropic’s initial release demonstrates Bloom suites for four behaviours—delusional sycophancy, instructed long‑horizon sabotage, self‑preservation, and self‑preferential bias—tested across 16 frontier models. The reported elicitation rates separate intentionally misaligned “model organisms” from production models in most cases, and judge‑model scores (e.g., Claude Opus variants) correlate strongly with human labels at the high/low ends—useful when you set pass/fail thresholds.
Bloom vs Petri (and when to use each)
Use Bloom when you want precise, repeatable measurement of one behaviour with suite‑level metrics you can track over time or across vendors.
Use Petri when you need broad scouting across many potential behaviours to surface interesting transcripts and hypotheses for deeper study.
Together: run Petri to discover concerns; formalise a behaviour definition; then measure it rigorously with Bloom over releases, providers, or policy changes.
Governance and UK alignment
For regulated sectors, Bloom’s Inspect‑compatible transcripts support UK reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: define your behaviours, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can approve changes with evidence.
Data protection: ensure your seeds and transcripts exclude personal data; treat transcripts as sensitive operational telemetry. Maintain a retention policy and role‑based access.
Good practice in enterprise rollouts
Define behaviours precisely: write clear, testable behaviour statements and include “non‑examples”.
Start with small suites: validate Understanding/Ideation on a handful of rollouts; only then scale to larger sweeps.
Triangulate judgments: calibrate judge models against human labels on a small sample; document correlation and edge cases.
Watch for evaluation awareness: filter out transcripts where the model appears to recognise it’s being tested; rerun with masked prompts.
Report with context: publish the seed, model versions, reasoning effort, and confidence intervals or error bars for elicitation rate.
Limitations to keep in mind
LLM‑based judges can share biases with targets; always sample human review.
Absolute scores can shift with configuration; track rank order and deltas across comparable runs.
Scenario diversity is a strength and a risk; maintain fixed seeds for regression testing and separate “novel scenario” runs for red‑teaming.
Summary
Bloom gives teams a fast, reproducible way to quantify risky behaviours in modern LLMs. It slots neatly into governance programmes, complements broader exploration via Petri, and supports UK‑style reporting. If you need evidence to ship or block a model change, Bloom’s suite‑level metrics and inspectable transcripts help you decide.
Next Steps: Want help designing behaviour definitions, seeds, and governance workflows? Talk to Generation Digital about an evaluation starter pack for Claude‑centric or multi‑vendor stacks.
FAQ
Q1. What is Anthropic Bloom?
Bloom is an open‑source framework that automates behavioural evaluations of LLMs. It generates scenarios for a defined behaviour, runs conversations against a target model, and scores presence to produce metrics like elicitation rate.
Q2. How is Bloom different from Petri?
Petri explores broad behavioural profiles across many behaviours. Bloom focuses on one behaviour at a time and measures it rigorously with repeatable, seed‑based suites.
Q3. What outputs do I get?
Configuration files, roll‑level transcripts, suite‑level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript review.
Q4. Is Bloom suitable for regulated/UK contexts?
Yes. It exports Inspect‑compatible transcripts that align with UK evaluation workflows. You should still establish governance, privacy, and retention controls.
Q5. Which models and providers are supported?
Bloom integrates with common providers via LiteLLM. Specify models using provider IDs or short names in the config.
Anthropic Bloom is an open‑source framework for automated behavioural evaluations of large language models. Given a behaviour definition and seed configuration, Bloom generates scenarios, runs multi‑turn conversations with the target model, and scores how often the behaviour appears, producing suite‑level metrics like elicitation rate and a reproducible report.
Why Bloom matters now
Frontier models change quickly, and fixed benchmarks age out. Bloom takes a behaviour you care about—like sycophancy, self‑preservation, or sabotage—and automatically generates varied test scenarios to measure how often and how strongly that behaviour appears. Because Bloom re‑generates scenarios on each run (while keeping a reproducible seed), you avoid overfitting to stale prompts and can scale evaluations as models evolve.
Bloom complements Anthropic’s Petri: Petri explores broad behavioural profiles across many user/tool interactions, whereas Bloom focuses on one behaviour at a time, generating targeted evaluation suites and top‑level metrics such as elicitation rate and average behaviour presence.
How Bloom works: the four‑stage pipeline
Bloom turns a behaviour description plus a seed configuration into a complete evaluation:
Understanding – analyses your behaviour description and example transcripts to define what to measure and why.
Ideation – creates diverse scenarios designed to elicit that behaviour (situation, user, system prompt, tool access).
Rollout – runs the scenarios in parallel, simulating both sides of the conversation to probe the target model.
Judgment – scores each transcript for behaviour presence (and any secondary criteria) and aggregates suite‑level insights.
You can run stages end‑to‑end or individually. The seed controls behaviour name, examples, model targets, modality (conversation vs simulated environment), reasoning effort, interaction length, and secondary scoring dimensions such as realism or elicitation difficulty. Always cite results with the seed you used.
Quick start (CLI)
# 1) Install pip install git+https://github.com/safety-research/bloom.git # 2) Initialise workspace bloom init # 3) Add API keys to .env, then load them source .env # 4) Run an evaluation using the sample seed
Outputs: configuration files under bloom-data/ (including seed.yaml, models.json, behaviour definitions) and results in bloom-results/{behavior}/ with transcripts and metrics. Use the interactive viewer to inspect transcripts locally:
npx @isha-gpt/bloom-viewer --port 8080 --dir
Configuration highlights
Target behaviour: Set
behavior.nameand add example transcripts to steer scenario generation.Models: Point
rollout.targetto provider IDs (e.g., Anthropic Claude models via LiteLLM) or use short names frommodels.json.Modality & tools: Choose conversation only, or enable simulated environments and tool use to surface long‑horizon or tool‑enabled behaviours.
Reasoning effort: Adjust extended thinking levels for judge/target models; higher effort can change measured bias and detection sensitivity.
Sweeps: Use Weights & Biases to compare multiple models or parameter sets, keeping Understanding/Ideation constant and sweeping Rollout targets.
Reproducibility: Re‑run with the same seed to compare models apples‑to‑apples; vary
ideationparameters to stress‑test generality.
What the early benchmarks show
Anthropic’s initial release demonstrates Bloom suites for four behaviours—delusional sycophancy, instructed long‑horizon sabotage, self‑preservation, and self‑preferential bias—tested across 16 frontier models. The reported elicitation rates separate intentionally misaligned “model organisms” from production models in most cases, and judge‑model scores (e.g., Claude Opus variants) correlate strongly with human labels at the high/low ends—useful when you set pass/fail thresholds.
Bloom vs Petri (and when to use each)
Use Bloom when you want precise, repeatable measurement of one behaviour with suite‑level metrics you can track over time or across vendors.
Use Petri when you need broad scouting across many potential behaviours to surface interesting transcripts and hypotheses for deeper study.
Together: run Petri to discover concerns; formalise a behaviour definition; then measure it rigorously with Bloom over releases, providers, or policy changes.
Governance and UK alignment
For regulated sectors, Bloom’s Inspect‑compatible transcripts support UK reporting workflows (e.g., AISI/ASI evaluation pipelines). Pair Bloom with internal approval gates: define your behaviours, target thresholds (e.g., max elicitation rate), and escalation rules. Track trend metrics per model/version so your MRM (model risk management) can approve changes with evidence.
Data protection: ensure your seeds and transcripts exclude personal data; treat transcripts as sensitive operational telemetry. Maintain a retention policy and role‑based access.
Good practice in enterprise rollouts
Define behaviours precisely: write clear, testable behaviour statements and include “non‑examples”.
Start with small suites: validate Understanding/Ideation on a handful of rollouts; only then scale to larger sweeps.
Triangulate judgments: calibrate judge models against human labels on a small sample; document correlation and edge cases.
Watch for evaluation awareness: filter out transcripts where the model appears to recognise it’s being tested; rerun with masked prompts.
Report with context: publish the seed, model versions, reasoning effort, and confidence intervals or error bars for elicitation rate.
Limitations to keep in mind
LLM‑based judges can share biases with targets; always sample human review.
Absolute scores can shift with configuration; track rank order and deltas across comparable runs.
Scenario diversity is a strength and a risk; maintain fixed seeds for regression testing and separate “novel scenario” runs for red‑teaming.
Summary
Bloom gives teams a fast, reproducible way to quantify risky behaviours in modern LLMs. It slots neatly into governance programmes, complements broader exploration via Petri, and supports UK‑style reporting. If you need evidence to ship or block a model change, Bloom’s suite‑level metrics and inspectable transcripts help you decide.
Next Steps: Want help designing behaviour definitions, seeds, and governance workflows? Talk to Generation Digital about an evaluation starter pack for Claude‑centric or multi‑vendor stacks.
FAQ
Q1. What is Anthropic Bloom?
Bloom is an open‑source framework that automates behavioural evaluations of LLMs. It generates scenarios for a defined behaviour, runs conversations against a target model, and scores presence to produce metrics like elicitation rate.
Q2. How is Bloom different from Petri?
Petri explores broad behavioural profiles across many behaviours. Bloom focuses on one behaviour at a time and measures it rigorously with repeatable, seed‑based suites.
Q3. What outputs do I get?
Configuration files, roll‑level transcripts, suite‑level metrics (e.g., elicitation rate), and an optional local viewer for interactive transcript review.
Q4. Is Bloom suitable for regulated/UK contexts?
Yes. It exports Inspect‑compatible transcripts that align with UK evaluation workflows. You should still establish governance, privacy, and retention controls.
Q5. Which models and providers are supported?
Bloom integrates with common providers via LiteLLM. Specify models using provider IDs or short names in the config.
Receive practical advice directly in your inbox
By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.
Generation
Digital

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy
Generation
Digital











