Secure AI Agents: OpenAI Defences Against Prompt Injection

Secure AI Agents: OpenAI Defences Against Prompt Injection

OpenAI

6 mar 2026

In a modern office setting, three professionals engage in a focused discussion around a conference table, surrounded by laptops displaying code, emphasizing teamwork in developing secure AI agents as a defense against prompt injection.

¿No sabes por dónde empezar con la IA?Evalúa preparación, riesgos y prioridades en menos de una hora.

¿No sabes por dónde empezar con la IA?Evalúa preparación, riesgos y prioridades en menos de una hora.

➔ Descarga nuestro paquete gratuito de preparación para IA

Prompt injection attacks try to trick AI agents into following malicious instructions hidden in content like webpages, emails, or documents. OpenAI’s mitigation approach focuses on reducing impact: constrain risky actions, sandbox tool use, minimise sensitive data exposure, and add observability and automated red-teaming so unsafe behaviours are detected and patched.

Prompt injection is the agent-era version of phishing: instead of tricking a person into clicking, attackers try to trick an AI system into misinterpreting untrusted content as instructions. When an agent can browse the web, read documents, and call tools, that confusion can become real impact—data leakage, unauthorised actions, or workflow disruption.

OpenAI’s published work on prompt injections makes an important point: you should not assume these threats can be “solved” with smarter prompts alone. The safer approach is system design: reduce the agent’s ability to do harm even if it encounters malicious content.

What is prompt injection?

Prompt injection is an attack technique where malicious text is embedded in the content an AI agent consumes (webpages, PDFs, emails, chat logs, ticket descriptions). The goal is to override the agent’s instructions—often to reveal secrets or perform actions the user did not intend.

Why AI agents are particularly vulnerable

Agentic systems tend to:

  • process untrusted content at scale (the open web, inboxes, shared drives)

  • have tool access (APIs, file systems, admin consoles)

  • operate with continuity across a workflow (so one compromised step can contaminate the next)

This means injection doesn’t need to “jailbreak the model” to cause harm—it just needs to influence tool use.

OpenAI’s design direction: reduce risk at the action boundary

OpenAI’s guidance emphasises a pragmatic goal: limit the blast radius.

1) Constrain risky actions

Treat the agent as an assistant, not an autonomous operator.

Practical controls:

  • Require explicit user confirmation for high-impact actions (sending emails, changing permissions, deleting files, posting externally).

  • Use allow-lists of safe operations rather than open-ended tool access.

  • Separate “read” and “write” capabilities: start with read-only access for pilots.

2) Sandbox tool execution

When agents can run code or execute actions, isolation matters.

Practical controls:

  • Execute code in sandboxed environments with time limits and resource caps.

  • Block direct access to production systems from the sandbox.

  • Prefer “compute in sandbox → propose action → human approve” patterns.

3) Treat external content as untrusted input

Webpages and documents are data—not instructions.

Practical controls:

  • Put retrieved content into a clearly-labelled “untrusted context” channel in your agent architecture.

  • Strip or ignore instruction-like text from sources (e.g., “ignore previous instructions”).

  • Add content provenance and citations so the system can reason about trust.

4) Protect sensitive information by design

Even a well-behaved agent can be pressured into revealing data.

Practical controls:

  • Minimise what the agent can access (least privilege).

  • Segment sensitive systems behind purpose-built tools that enforce policy.

  • Use redaction and data-loss prevention for outputs.

5) Add observability and automated hardening

OpenAI has described the use of automated red teaming and continuous hardening for agent surfaces.

Practical controls:

  • Log prompts, tool calls, and outputs (with privacy-aware storage).

  • Monitor for anomalies: unusual tool sequences, large data access bursts, repeated attempts to override instructions.

  • Build evaluation tests that simulate injections on your real workflows.

Practical steps you can implement this quarter

If you’re building or deploying agents internally, start here.

  1. Map trust boundaries

    • Which inputs are untrusted (web, email, uploads)?

    • Which tools could cause harm (write actions, admin changes, external messages)?

  2. Classify tools by risk

    • Low risk: read-only search, summarisation

    • Medium risk: drafting, formatting, ticket creation

    • High risk: sending messages, changing records, payments, permissions

  3. Add approval gates for high-risk actions

    • Human-in-the-loop approvals are not a failure; they are a control.

  4. Adopt a “compute then propose” pattern

    • Use sandboxes for deterministic work.

    • Convert tool outputs into a proposed action that requires confirmation.

  5. Create an injection test suite

    • Include hidden instructions, multi-step social engineering, and “second-order” injection scenarios.

Where Generation Digital can help

Prompt injection defence is not a single feature. It’s a set of decisions across workflow design, access control, and governance.

Generation Digital can help you:

  • design agent workflows with clear trust boundaries

  • implement safer tool patterns (approvals, allow-lists, sandboxing)

  • define evaluation and monitoring so you can scale with confidence

Summary

Prompt injection is a persistent risk for tool-using agents, particularly when they ingest untrusted content. OpenAI’s approach focuses on reducing impact: constrain risky actions, sandbox execution, minimise sensitive data exposure, and continuously harden systems through monitoring and automated red teaming.

Next steps: If you’re rolling out agents in the enterprise and need a practical security posture, speak with Generation Digital: https://www.gend.co/contact

FAQs

What is prompt injection?
Prompt injection is an attack where malicious content tries to override an AI system’s instructions, potentially causing unintended actions or data exposure.

How does OpenAI protect against these threats?
OpenAI’s published guidance focuses on system-level mitigations such as constraining high-risk actions, sandboxing tool execution, treating external content as untrusted, protecting sensitive data by design, and continuously hardening agent surfaces with testing and monitoring.

Why is this important for AI development?
Agents increasingly connect to real systems. Without guardrails, a single malicious input can trigger unauthorised tool use or data leakage. Security is what makes scaling safe.

Are prompt injections fully solvable?
They are a frontier security challenge; the practical goal is to reduce the blast radius with layered controls and continuous hardening.

What should we do first in an enterprise rollout?
Start with least privilege and read-only access, add approval gates for risky actions, and build a test suite that simulates prompt injection against your real workflows.

Prompt injection attacks try to trick AI agents into following malicious instructions hidden in content like webpages, emails, or documents. OpenAI’s mitigation approach focuses on reducing impact: constrain risky actions, sandbox tool use, minimise sensitive data exposure, and add observability and automated red-teaming so unsafe behaviours are detected and patched.

Prompt injection is the agent-era version of phishing: instead of tricking a person into clicking, attackers try to trick an AI system into misinterpreting untrusted content as instructions. When an agent can browse the web, read documents, and call tools, that confusion can become real impact—data leakage, unauthorised actions, or workflow disruption.

OpenAI’s published work on prompt injections makes an important point: you should not assume these threats can be “solved” with smarter prompts alone. The safer approach is system design: reduce the agent’s ability to do harm even if it encounters malicious content.

What is prompt injection?

Prompt injection is an attack technique where malicious text is embedded in the content an AI agent consumes (webpages, PDFs, emails, chat logs, ticket descriptions). The goal is to override the agent’s instructions—often to reveal secrets or perform actions the user did not intend.

Why AI agents are particularly vulnerable

Agentic systems tend to:

  • process untrusted content at scale (the open web, inboxes, shared drives)

  • have tool access (APIs, file systems, admin consoles)

  • operate with continuity across a workflow (so one compromised step can contaminate the next)

This means injection doesn’t need to “jailbreak the model” to cause harm—it just needs to influence tool use.

OpenAI’s design direction: reduce risk at the action boundary

OpenAI’s guidance emphasises a pragmatic goal: limit the blast radius.

1) Constrain risky actions

Treat the agent as an assistant, not an autonomous operator.

Practical controls:

  • Require explicit user confirmation for high-impact actions (sending emails, changing permissions, deleting files, posting externally).

  • Use allow-lists of safe operations rather than open-ended tool access.

  • Separate “read” and “write” capabilities: start with read-only access for pilots.

2) Sandbox tool execution

When agents can run code or execute actions, isolation matters.

Practical controls:

  • Execute code in sandboxed environments with time limits and resource caps.

  • Block direct access to production systems from the sandbox.

  • Prefer “compute in sandbox → propose action → human approve” patterns.

3) Treat external content as untrusted input

Webpages and documents are data—not instructions.

Practical controls:

  • Put retrieved content into a clearly-labelled “untrusted context” channel in your agent architecture.

  • Strip or ignore instruction-like text from sources (e.g., “ignore previous instructions”).

  • Add content provenance and citations so the system can reason about trust.

4) Protect sensitive information by design

Even a well-behaved agent can be pressured into revealing data.

Practical controls:

  • Minimise what the agent can access (least privilege).

  • Segment sensitive systems behind purpose-built tools that enforce policy.

  • Use redaction and data-loss prevention for outputs.

5) Add observability and automated hardening

OpenAI has described the use of automated red teaming and continuous hardening for agent surfaces.

Practical controls:

  • Log prompts, tool calls, and outputs (with privacy-aware storage).

  • Monitor for anomalies: unusual tool sequences, large data access bursts, repeated attempts to override instructions.

  • Build evaluation tests that simulate injections on your real workflows.

Practical steps you can implement this quarter

If you’re building or deploying agents internally, start here.

  1. Map trust boundaries

    • Which inputs are untrusted (web, email, uploads)?

    • Which tools could cause harm (write actions, admin changes, external messages)?

  2. Classify tools by risk

    • Low risk: read-only search, summarisation

    • Medium risk: drafting, formatting, ticket creation

    • High risk: sending messages, changing records, payments, permissions

  3. Add approval gates for high-risk actions

    • Human-in-the-loop approvals are not a failure; they are a control.

  4. Adopt a “compute then propose” pattern

    • Use sandboxes for deterministic work.

    • Convert tool outputs into a proposed action that requires confirmation.

  5. Create an injection test suite

    • Include hidden instructions, multi-step social engineering, and “second-order” injection scenarios.

Where Generation Digital can help

Prompt injection defence is not a single feature. It’s a set of decisions across workflow design, access control, and governance.

Generation Digital can help you:

  • design agent workflows with clear trust boundaries

  • implement safer tool patterns (approvals, allow-lists, sandboxing)

  • define evaluation and monitoring so you can scale with confidence

Summary

Prompt injection is a persistent risk for tool-using agents, particularly when they ingest untrusted content. OpenAI’s approach focuses on reducing impact: constrain risky actions, sandbox execution, minimise sensitive data exposure, and continuously harden systems through monitoring and automated red teaming.

Next steps: If you’re rolling out agents in the enterprise and need a practical security posture, speak with Generation Digital: https://www.gend.co/contact

FAQs

What is prompt injection?
Prompt injection is an attack where malicious content tries to override an AI system’s instructions, potentially causing unintended actions or data exposure.

How does OpenAI protect against these threats?
OpenAI’s published guidance focuses on system-level mitigations such as constraining high-risk actions, sandboxing tool execution, treating external content as untrusted, protecting sensitive data by design, and continuously hardening agent surfaces with testing and monitoring.

Why is this important for AI development?
Agents increasingly connect to real systems. Without guardrails, a single malicious input can trigger unauthorised tool use or data leakage. Security is what makes scaling safe.

Are prompt injections fully solvable?
They are a frontier security challenge; the practical goal is to reduce the blast radius with layered controls and continuous hardening.

What should we do first in an enterprise rollout?
Start with least privilege and read-only access, add approval gates for risky actions, and build a test suite that simulates prompt injection against your real workflows.

Recibe noticias y consejos sobre IA cada semana en tu bandeja de entrada

Al suscribirte, das tu consentimiento para que Generation Digital almacene y procese tus datos de acuerdo con nuestra política de privacidad. Puedes leer la política completa en gend.co/privacy.

Generación
Digital

Oficina en Reino Unido

Generation Digital Ltd
33 Queen St,
Londres
EC4R 1AP
Reino Unido

Oficina en Canadá

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canadá

Oficina en EE. UU.

Generation Digital Américas Inc
77 Sands St,
Brooklyn, NY 11201,
Estados Unidos

Oficina de la UE

Software Generación Digital
Edificio Elgee
Dundalk
A91 X2R3
Irlanda

Oficina en Medio Oriente

6994 Alsharq 3890,
An Narjis,
Riad 13343,
Arabia Saudita

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Número de la empresa: 256 9431 77 | Derechos de autor 2026 | Términos y Condiciones | Política de Privacidad

Generación
Digital

Oficina en Reino Unido

Generation Digital Ltd
33 Queen St,
Londres
EC4R 1AP
Reino Unido

Oficina en Canadá

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canadá

Oficina en EE. UU.

Generation Digital Américas Inc
77 Sands St,
Brooklyn, NY 11201,
Estados Unidos

Oficina de la UE

Software Generación Digital
Edificio Elgee
Dundalk
A91 X2R3
Irlanda

Oficina en Medio Oriente

6994 Alsharq 3890,
An Narjis,
Riad 13343,
Arabia Saudita

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Número de Empresa: 256 9431 77
Términos y Condiciones
Política de Privacidad
Derechos de Autor 2026