Secure AI Agents: OpenAI Defences Against Prompt Injection

Secure AI Agents: OpenAI Defences Against Prompt Injection

OpenAI

Mar 6, 2026

In a modern office setting, three professionals engage in a focused discussion around a conference table, surrounded by laptops displaying code, emphasizing teamwork in developing secure AI agents as a defense against prompt injection.

Free AI at Work Playbook for managers using ChatGPT, Claude and Gemini.

Free AI at Work Playbook for managers using ChatGPT, Claude and Gemini.

➔ Download the Playbook

Prompt injection attacks try to trick AI agents into following malicious instructions hidden in content like webpages, emails, or documents. OpenAI’s mitigation approach focuses on reducing impact: constrain risky actions, sandbox tool use, minimise sensitive data exposure, and add observability and automated red-teaming so unsafe behaviours are detected and patched.

Prompt injection is the agent-era version of phishing: instead of tricking a person into clicking, attackers try to trick an AI system into misinterpreting untrusted content as instructions. When an agent can browse the web, read documents, and call tools, that confusion can become real impact—data leakage, unauthorised actions, or workflow disruption.

OpenAI’s published work on prompt injections makes an important point: you should not assume these threats can be “solved” with smarter prompts alone. The safer approach is system design: reduce the agent’s ability to do harm even if it encounters malicious content.

What is prompt injection?

Prompt injection is an attack technique where malicious text is embedded in the content an AI agent consumes (webpages, PDFs, emails, chat logs, ticket descriptions). The goal is to override the agent’s instructions—often to reveal secrets or perform actions the user did not intend.

Why AI agents are particularly vulnerable

Agentic systems tend to:

  • process untrusted content at scale (the open web, inboxes, shared drives)

  • have tool access (APIs, file systems, admin consoles)

  • operate with continuity across a workflow (so one compromised step can contaminate the next)

This means injection doesn’t need to “jailbreak the model” to cause harm—it just needs to influence tool use.

OpenAI’s design direction: reduce risk at the action boundary

OpenAI’s guidance emphasises a pragmatic goal: limit the blast radius.

1) Constrain risky actions

Treat the agent as an assistant, not an autonomous operator.

Practical controls:

  • Require explicit user confirmation for high-impact actions (sending emails, changing permissions, deleting files, posting externally).

  • Use allow-lists of safe operations rather than open-ended tool access.

  • Separate “read” and “write” capabilities: start with read-only access for pilots.

2) Sandbox tool execution

When agents can run code or execute actions, isolation matters.

Practical controls:

  • Execute code in sandboxed environments with time limits and resource caps.

  • Block direct access to production systems from the sandbox.

  • Prefer “compute in sandbox → propose action → human approve” patterns.

3) Treat external content as untrusted input

Webpages and documents are data—not instructions.

Practical controls:

  • Put retrieved content into a clearly-labelled “untrusted context” channel in your agent architecture.

  • Strip or ignore instruction-like text from sources (e.g., “ignore previous instructions”).

  • Add content provenance and citations so the system can reason about trust.

4) Protect sensitive information by design

Even a well-behaved agent can be pressured into revealing data.

Practical controls:

  • Minimise what the agent can access (least privilege).

  • Segment sensitive systems behind purpose-built tools that enforce policy.

  • Use redaction and data-loss prevention for outputs.

5) Add observability and automated hardening

OpenAI has described the use of automated red teaming and continuous hardening for agent surfaces.

Practical controls:

  • Log prompts, tool calls, and outputs (with privacy-aware storage).

  • Monitor for anomalies: unusual tool sequences, large data access bursts, repeated attempts to override instructions.

  • Build evaluation tests that simulate injections on your real workflows.

Practical steps you can implement this quarter

If you’re building or deploying agents internally, start here.

  1. Map trust boundaries

    • Which inputs are untrusted (web, email, uploads)?

    • Which tools could cause harm (write actions, admin changes, external messages)?

  2. Classify tools by risk

    • Low risk: read-only search, summarisation

    • Medium risk: drafting, formatting, ticket creation

    • High risk: sending messages, changing records, payments, permissions

  3. Add approval gates for high-risk actions

    • Human-in-the-loop approvals are not a failure; they are a control.

  4. Adopt a “compute then propose” pattern

    • Use sandboxes for deterministic work.

    • Convert tool outputs into a proposed action that requires confirmation.

  5. Create an injection test suite

    • Include hidden instructions, multi-step social engineering, and “second-order” injection scenarios.

Where Generation Digital can help

Prompt injection defence is not a single feature. It’s a set of decisions across workflow design, access control, and governance.

Generation Digital can help you:

  • design agent workflows with clear trust boundaries

  • implement safer tool patterns (approvals, allow-lists, sandboxing)

  • define evaluation and monitoring so you can scale with confidence

Summary

Prompt injection is a persistent risk for tool-using agents, particularly when they ingest untrusted content. OpenAI’s approach focuses on reducing impact: constrain risky actions, sandbox execution, minimise sensitive data exposure, and continuously harden systems through monitoring and automated red teaming.

Next steps: If you’re rolling out agents in the enterprise and need a practical security posture, speak with Generation Digital: https://www.gend.co/contact

FAQs

What is prompt injection?
Prompt injection is an attack where malicious content tries to override an AI system’s instructions, potentially causing unintended actions or data exposure.

How does OpenAI protect against these threats?
OpenAI’s published guidance focuses on system-level mitigations such as constraining high-risk actions, sandboxing tool execution, treating external content as untrusted, protecting sensitive data by design, and continuously hardening agent surfaces with testing and monitoring.

Why is this important for AI development?
Agents increasingly connect to real systems. Without guardrails, a single malicious input can trigger unauthorised tool use or data leakage. Security is what makes scaling safe.

Are prompt injections fully solvable?
They are a frontier security challenge; the practical goal is to reduce the blast radius with layered controls and continuous hardening.

What should we do first in an enterprise rollout?
Start with least privilege and read-only access, add approval gates for risky actions, and build a test suite that simulates prompt injection against your real workflows.

Prompt injection attacks try to trick AI agents into following malicious instructions hidden in content like webpages, emails, or documents. OpenAI’s mitigation approach focuses on reducing impact: constrain risky actions, sandbox tool use, minimise sensitive data exposure, and add observability and automated red-teaming so unsafe behaviours are detected and patched.

Prompt injection is the agent-era version of phishing: instead of tricking a person into clicking, attackers try to trick an AI system into misinterpreting untrusted content as instructions. When an agent can browse the web, read documents, and call tools, that confusion can become real impact—data leakage, unauthorised actions, or workflow disruption.

OpenAI’s published work on prompt injections makes an important point: you should not assume these threats can be “solved” with smarter prompts alone. The safer approach is system design: reduce the agent’s ability to do harm even if it encounters malicious content.

What is prompt injection?

Prompt injection is an attack technique where malicious text is embedded in the content an AI agent consumes (webpages, PDFs, emails, chat logs, ticket descriptions). The goal is to override the agent’s instructions—often to reveal secrets or perform actions the user did not intend.

Why AI agents are particularly vulnerable

Agentic systems tend to:

  • process untrusted content at scale (the open web, inboxes, shared drives)

  • have tool access (APIs, file systems, admin consoles)

  • operate with continuity across a workflow (so one compromised step can contaminate the next)

This means injection doesn’t need to “jailbreak the model” to cause harm—it just needs to influence tool use.

OpenAI’s design direction: reduce risk at the action boundary

OpenAI’s guidance emphasises a pragmatic goal: limit the blast radius.

1) Constrain risky actions

Treat the agent as an assistant, not an autonomous operator.

Practical controls:

  • Require explicit user confirmation for high-impact actions (sending emails, changing permissions, deleting files, posting externally).

  • Use allow-lists of safe operations rather than open-ended tool access.

  • Separate “read” and “write” capabilities: start with read-only access for pilots.

2) Sandbox tool execution

When agents can run code or execute actions, isolation matters.

Practical controls:

  • Execute code in sandboxed environments with time limits and resource caps.

  • Block direct access to production systems from the sandbox.

  • Prefer “compute in sandbox → propose action → human approve” patterns.

3) Treat external content as untrusted input

Webpages and documents are data—not instructions.

Practical controls:

  • Put retrieved content into a clearly-labelled “untrusted context” channel in your agent architecture.

  • Strip or ignore instruction-like text from sources (e.g., “ignore previous instructions”).

  • Add content provenance and citations so the system can reason about trust.

4) Protect sensitive information by design

Even a well-behaved agent can be pressured into revealing data.

Practical controls:

  • Minimise what the agent can access (least privilege).

  • Segment sensitive systems behind purpose-built tools that enforce policy.

  • Use redaction and data-loss prevention for outputs.

5) Add observability and automated hardening

OpenAI has described the use of automated red teaming and continuous hardening for agent surfaces.

Practical controls:

  • Log prompts, tool calls, and outputs (with privacy-aware storage).

  • Monitor for anomalies: unusual tool sequences, large data access bursts, repeated attempts to override instructions.

  • Build evaluation tests that simulate injections on your real workflows.

Practical steps you can implement this quarter

If you’re building or deploying agents internally, start here.

  1. Map trust boundaries

    • Which inputs are untrusted (web, email, uploads)?

    • Which tools could cause harm (write actions, admin changes, external messages)?

  2. Classify tools by risk

    • Low risk: read-only search, summarisation

    • Medium risk: drafting, formatting, ticket creation

    • High risk: sending messages, changing records, payments, permissions

  3. Add approval gates for high-risk actions

    • Human-in-the-loop approvals are not a failure; they are a control.

  4. Adopt a “compute then propose” pattern

    • Use sandboxes for deterministic work.

    • Convert tool outputs into a proposed action that requires confirmation.

  5. Create an injection test suite

    • Include hidden instructions, multi-step social engineering, and “second-order” injection scenarios.

Where Generation Digital can help

Prompt injection defence is not a single feature. It’s a set of decisions across workflow design, access control, and governance.

Generation Digital can help you:

  • design agent workflows with clear trust boundaries

  • implement safer tool patterns (approvals, allow-lists, sandboxing)

  • define evaluation and monitoring so you can scale with confidence

Summary

Prompt injection is a persistent risk for tool-using agents, particularly when they ingest untrusted content. OpenAI’s approach focuses on reducing impact: constrain risky actions, sandbox execution, minimise sensitive data exposure, and continuously harden systems through monitoring and automated red teaming.

Next steps: If you’re rolling out agents in the enterprise and need a practical security posture, speak with Generation Digital: https://www.gend.co/contact

FAQs

What is prompt injection?
Prompt injection is an attack where malicious content tries to override an AI system’s instructions, potentially causing unintended actions or data exposure.

How does OpenAI protect against these threats?
OpenAI’s published guidance focuses on system-level mitigations such as constraining high-risk actions, sandboxing tool execution, treating external content as untrusted, protecting sensitive data by design, and continuously hardening agent surfaces with testing and monitoring.

Why is this important for AI development?
Agents increasingly connect to real systems. Without guardrails, a single malicious input can trigger unauthorised tool use or data leakage. Security is what makes scaling safe.

Are prompt injections fully solvable?
They are a frontier security challenge; the practical goal is to reduce the blast radius with layered controls and continuous hardening.

What should we do first in an enterprise rollout?
Start with least privilege and read-only access, add approval gates for risky actions, and build a test suite that simulates prompt injection against your real workflows.

Get weekly AI news and advice delivered to your inbox

By subscribing you consent to Generation Digital storing and processing your details in line with our privacy policy. You can read the full policy at gend.co/privacy.

Generation
Digital

UK Office

Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom

Canada Office

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada

USA Office

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States

EU Office

Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland

Middle East Office

6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Company No: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy

Generation
Digital

UK Office

Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom

Canada Office

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada

USA Office

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States

EU Office

Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland

Middle East Office

6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Company No: 256 9431 77
Terms and Conditions
Privacy Policy
Copyright 2026