Secure AI Agents: OpenAI Defences Against Prompt Injection
Secure AI Agents: OpenAI Defences Against Prompt Injection
OpenAI
Mar 6, 2026

Free AI at Work Playbook for managers using ChatGPT, Claude and Gemini.
Free AI at Work Playbook for managers using ChatGPT, Claude and Gemini.
➔ Download the Playbook
Prompt injection attacks try to trick AI agents into following malicious instructions hidden in content like webpages, emails, or documents. OpenAI’s mitigation approach focuses on reducing impact: constrain risky actions, sandbox tool use, minimise sensitive data exposure, and add observability and automated red-teaming so unsafe behaviours are detected and patched.
Prompt injection is the agent-era version of phishing: instead of tricking a person into clicking, attackers try to trick an AI system into misinterpreting untrusted content as instructions. When an agent can browse the web, read documents, and call tools, that confusion can become real impact—data leakage, unauthorised actions, or workflow disruption.
OpenAI’s published work on prompt injections makes an important point: you should not assume these threats can be “solved” with smarter prompts alone. The safer approach is system design: reduce the agent’s ability to do harm even if it encounters malicious content.
What is prompt injection?
Prompt injection is an attack technique where malicious text is embedded in the content an AI agent consumes (webpages, PDFs, emails, chat logs, ticket descriptions). The goal is to override the agent’s instructions—often to reveal secrets or perform actions the user did not intend.
Why AI agents are particularly vulnerable
Agentic systems tend to:
process untrusted content at scale (the open web, inboxes, shared drives)
have tool access (APIs, file systems, admin consoles)
operate with continuity across a workflow (so one compromised step can contaminate the next)
This means injection doesn’t need to “jailbreak the model” to cause harm—it just needs to influence tool use.
OpenAI’s design direction: reduce risk at the action boundary
OpenAI’s guidance emphasises a pragmatic goal: limit the blast radius.
1) Constrain risky actions
Treat the agent as an assistant, not an autonomous operator.
Practical controls:
Require explicit user confirmation for high-impact actions (sending emails, changing permissions, deleting files, posting externally).
Use allow-lists of safe operations rather than open-ended tool access.
Separate “read” and “write” capabilities: start with read-only access for pilots.
2) Sandbox tool execution
When agents can run code or execute actions, isolation matters.
Practical controls:
Execute code in sandboxed environments with time limits and resource caps.
Block direct access to production systems from the sandbox.
Prefer “compute in sandbox → propose action → human approve” patterns.
3) Treat external content as untrusted input
Webpages and documents are data—not instructions.
Practical controls:
Put retrieved content into a clearly-labelled “untrusted context” channel in your agent architecture.
Strip or ignore instruction-like text from sources (e.g., “ignore previous instructions”).
Add content provenance and citations so the system can reason about trust.
4) Protect sensitive information by design
Even a well-behaved agent can be pressured into revealing data.
Practical controls:
Minimise what the agent can access (least privilege).
Segment sensitive systems behind purpose-built tools that enforce policy.
Use redaction and data-loss prevention for outputs.
5) Add observability and automated hardening
OpenAI has described the use of automated red teaming and continuous hardening for agent surfaces.
Practical controls:
Log prompts, tool calls, and outputs (with privacy-aware storage).
Monitor for anomalies: unusual tool sequences, large data access bursts, repeated attempts to override instructions.
Build evaluation tests that simulate injections on your real workflows.
Practical steps you can implement this quarter
If you’re building or deploying agents internally, start here.
Map trust boundaries
Which inputs are untrusted (web, email, uploads)?
Which tools could cause harm (write actions, admin changes, external messages)?
Classify tools by risk
Low risk: read-only search, summarisation
Medium risk: drafting, formatting, ticket creation
High risk: sending messages, changing records, payments, permissions
Add approval gates for high-risk actions
Human-in-the-loop approvals are not a failure; they are a control.
Adopt a “compute then propose” pattern
Use sandboxes for deterministic work.
Convert tool outputs into a proposed action that requires confirmation.
Create an injection test suite
Include hidden instructions, multi-step social engineering, and “second-order” injection scenarios.
Where Generation Digital can help
Prompt injection defence is not a single feature. It’s a set of decisions across workflow design, access control, and governance.
Generation Digital can help you:
design agent workflows with clear trust boundaries
implement safer tool patterns (approvals, allow-lists, sandboxing)
define evaluation and monitoring so you can scale with confidence
Summary
Prompt injection is a persistent risk for tool-using agents, particularly when they ingest untrusted content. OpenAI’s approach focuses on reducing impact: constrain risky actions, sandbox execution, minimise sensitive data exposure, and continuously harden systems through monitoring and automated red teaming.
Next steps: If you’re rolling out agents in the enterprise and need a practical security posture, speak with Generation Digital: https://www.gend.co/contact
FAQs
What is prompt injection?
Prompt injection is an attack where malicious content tries to override an AI system’s instructions, potentially causing unintended actions or data exposure.
How does OpenAI protect against these threats?
OpenAI’s published guidance focuses on system-level mitigations such as constraining high-risk actions, sandboxing tool execution, treating external content as untrusted, protecting sensitive data by design, and continuously hardening agent surfaces with testing and monitoring.
Why is this important for AI development?
Agents increasingly connect to real systems. Without guardrails, a single malicious input can trigger unauthorised tool use or data leakage. Security is what makes scaling safe.
Are prompt injections fully solvable?
They are a frontier security challenge; the practical goal is to reduce the blast radius with layered controls and continuous hardening.
What should we do first in an enterprise rollout?
Start with least privilege and read-only access, add approval gates for risky actions, and build a test suite that simulates prompt injection against your real workflows.
Prompt injection attacks try to trick AI agents into following malicious instructions hidden in content like webpages, emails, or documents. OpenAI’s mitigation approach focuses on reducing impact: constrain risky actions, sandbox tool use, minimise sensitive data exposure, and add observability and automated red-teaming so unsafe behaviours are detected and patched.
Prompt injection is the agent-era version of phishing: instead of tricking a person into clicking, attackers try to trick an AI system into misinterpreting untrusted content as instructions. When an agent can browse the web, read documents, and call tools, that confusion can become real impact—data leakage, unauthorised actions, or workflow disruption.
OpenAI’s published work on prompt injections makes an important point: you should not assume these threats can be “solved” with smarter prompts alone. The safer approach is system design: reduce the agent’s ability to do harm even if it encounters malicious content.
What is prompt injection?
Prompt injection is an attack technique where malicious text is embedded in the content an AI agent consumes (webpages, PDFs, emails, chat logs, ticket descriptions). The goal is to override the agent’s instructions—often to reveal secrets or perform actions the user did not intend.
Why AI agents are particularly vulnerable
Agentic systems tend to:
process untrusted content at scale (the open web, inboxes, shared drives)
have tool access (APIs, file systems, admin consoles)
operate with continuity across a workflow (so one compromised step can contaminate the next)
This means injection doesn’t need to “jailbreak the model” to cause harm—it just needs to influence tool use.
OpenAI’s design direction: reduce risk at the action boundary
OpenAI’s guidance emphasises a pragmatic goal: limit the blast radius.
1) Constrain risky actions
Treat the agent as an assistant, not an autonomous operator.
Practical controls:
Require explicit user confirmation for high-impact actions (sending emails, changing permissions, deleting files, posting externally).
Use allow-lists of safe operations rather than open-ended tool access.
Separate “read” and “write” capabilities: start with read-only access for pilots.
2) Sandbox tool execution
When agents can run code or execute actions, isolation matters.
Practical controls:
Execute code in sandboxed environments with time limits and resource caps.
Block direct access to production systems from the sandbox.
Prefer “compute in sandbox → propose action → human approve” patterns.
3) Treat external content as untrusted input
Webpages and documents are data—not instructions.
Practical controls:
Put retrieved content into a clearly-labelled “untrusted context” channel in your agent architecture.
Strip or ignore instruction-like text from sources (e.g., “ignore previous instructions”).
Add content provenance and citations so the system can reason about trust.
4) Protect sensitive information by design
Even a well-behaved agent can be pressured into revealing data.
Practical controls:
Minimise what the agent can access (least privilege).
Segment sensitive systems behind purpose-built tools that enforce policy.
Use redaction and data-loss prevention for outputs.
5) Add observability and automated hardening
OpenAI has described the use of automated red teaming and continuous hardening for agent surfaces.
Practical controls:
Log prompts, tool calls, and outputs (with privacy-aware storage).
Monitor for anomalies: unusual tool sequences, large data access bursts, repeated attempts to override instructions.
Build evaluation tests that simulate injections on your real workflows.
Practical steps you can implement this quarter
If you’re building or deploying agents internally, start here.
Map trust boundaries
Which inputs are untrusted (web, email, uploads)?
Which tools could cause harm (write actions, admin changes, external messages)?
Classify tools by risk
Low risk: read-only search, summarisation
Medium risk: drafting, formatting, ticket creation
High risk: sending messages, changing records, payments, permissions
Add approval gates for high-risk actions
Human-in-the-loop approvals are not a failure; they are a control.
Adopt a “compute then propose” pattern
Use sandboxes for deterministic work.
Convert tool outputs into a proposed action that requires confirmation.
Create an injection test suite
Include hidden instructions, multi-step social engineering, and “second-order” injection scenarios.
Where Generation Digital can help
Prompt injection defence is not a single feature. It’s a set of decisions across workflow design, access control, and governance.
Generation Digital can help you:
design agent workflows with clear trust boundaries
implement safer tool patterns (approvals, allow-lists, sandboxing)
define evaluation and monitoring so you can scale with confidence
Summary
Prompt injection is a persistent risk for tool-using agents, particularly when they ingest untrusted content. OpenAI’s approach focuses on reducing impact: constrain risky actions, sandbox execution, minimise sensitive data exposure, and continuously harden systems through monitoring and automated red teaming.
Next steps: If you’re rolling out agents in the enterprise and need a practical security posture, speak with Generation Digital: https://www.gend.co/contact
FAQs
What is prompt injection?
Prompt injection is an attack where malicious content tries to override an AI system’s instructions, potentially causing unintended actions or data exposure.
How does OpenAI protect against these threats?
OpenAI’s published guidance focuses on system-level mitigations such as constraining high-risk actions, sandboxing tool execution, treating external content as untrusted, protecting sensitive data by design, and continuously hardening agent surfaces with testing and monitoring.
Why is this important for AI development?
Agents increasingly connect to real systems. Without guardrails, a single malicious input can trigger unauthorised tool use or data leakage. Security is what makes scaling safe.
Are prompt injections fully solvable?
They are a frontier security challenge; the practical goal is to reduce the blast radius with layered controls and continuous hardening.
What should we do first in an enterprise rollout?
Start with least privilege and read-only access, add approval gates for risky actions, and build a test suite that simulates prompt injection against your real workflows.
Get weekly AI news and advice delivered to your inbox
By subscribing you consent to Generation Digital storing and processing your details in line with our privacy policy. You can read the full policy at gend.co/privacy.
Generation
Digital

UK Office
Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom
Canada Office
Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada
USA Office
Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States
EU Office
Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland
Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia
Company No: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy
Generation
Digital

UK Office
Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom
Canada Office
Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada
USA Office
Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States
EU Office
Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland
Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia









