Scaling PostgreSQL for ChatGPT: Replicas, Cache, Guardrails

Scaling PostgreSQL for ChatGPT: Replicas, Cache, Guardrails

OpenAI

Jan 22, 2026

In a modern office featuring exposed brick and servers, two professionals discuss data displayed on a digital world map, with a diagram on a glass wall illustrating concepts of scaling PostgreSQL for ChatGPT, including replicas, cache, and guardrails, while colleagues engage in conversation in the background.
In a modern office featuring exposed brick and servers, two professionals discuss data displayed on a digital world map, with a diagram on a glass wall illustrating concepts of scaling PostgreSQL for ChatGPT, including replicas, cache, and guardrails, while colleagues engage in conversation in the background.

Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.

Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.

➔ Download Our Free AI Preparedness Pack

OpenAI scaled PostgreSQL for ChatGPT by keeping a single primary for writes and pushing reads to nearly 50 replicas across regions. They reduced pressure on the primary with caching and query optimisation, added rate limits and workload isolation to prevent “noisy neighbour” incidents, and enforced strict schema-change rules to protect reliability at massive scale.

Running a global product at ChatGPT’s scale creates a deceptively simple requirement: your database must behave like a utility. It has to stay fast under normal conditions, predictable under spikes, and recover quickly when something upstream goes wrong.

In January 2026, OpenAI shared how they’ve pushed PostgreSQL much further than many teams assume is possible—supporting 800 million users with a single primary Azure Database for PostgreSQL instance handling writes and nearly 50 read replicas spread across regions to handle read-heavy traffic.

The core architecture: one writer, many readers

OpenAI’s key bet is straightforward: don’t fight PostgreSQL’s write limits head-on if you can keep your workload predominantly read-heavy.

  • Single primary serves all writes (kept as calm as possible).

  • Read traffic is offloaded to replicas wherever possible, scaling out reads across geographies.

  • Write-heavy, shardable workloads are migrated away to sharded systems (they cite Azure Cosmos DB) to keep the primary stable.

The result, per OpenAI: millions of queries per second for read-heavy workloads, low double-digit millisecond p99 client-side latency, and five-nines availability in production.

What made it work in practice (the playbook)

OpenAI are very clear that the architecture isn’t the magic part; the guardrails are.

1) Reduce load on the primary (ruthlessly)

They minimise both reads and writes on the primary. Reads are pushed to replicas unless they must be in a write transaction; writes are reduced via bug fixes, “lazy writes” where appropriate, and migration of shardable write-heavy systems elsewhere.

2) Cache as a reliability feature, not just a speed trick

A recurring failure pattern they describe is cache misses (or caching layer failures) triggering a sudden surge of database load, which then cascades into timeouts and retries. Treat caching as part of your stability model, not an optional optimisation.

3) Query optimisation (and ORM discipline)

They call out expensive multi-table joins as a recurring risk (including an incident involving a 12-table join). Their approach: continuously hunt and fix costly query patterns, review ORM-generated SQL carefully, and use timeouts (e.g., idle-in-transaction) to avoid operational issues like blocking autovacuum.

4) Workload isolation: stop “noisy neighbours”

OpenAI split requests into high-priority vs low-priority tiers routed to separate instances, and apply similar separation across products so one feature or product can’t degrade the rest of the platform.

5) Connection pooling to prevent storms

They highlight Azure PostgreSQL connection limits (5,000 per instance) and past incidents caused by connection storms. Connection pooling is treated as core infrastructure, not an afterthought.

6) Rate limiting at multiple layers (and targeted load shedding)

Their rate limiting spans: application, connection pooler, proxy, and query layers. They also mention the ability to block specific query digests when necessary—useful for rapid recovery during surges of expensive queries.

7) Schema management: “schema changes are production events”

They avoid schema changes that trigger full table rewrites, enforce a strict 5-second timeout on schema changes, and restrict new tables in that PostgreSQL deployment (new workloads go to sharded systems).

What this means for your team

Most teams won’t ever hit “OpenAI scale”, but the failure modes are the same—just smaller:

  • a launch creates an unexpected write storm

  • a cache layer fails and the DB takes the hit

  • one endpoint’s slow query eats your CPU budget

  • retries turn a hiccup into an outage

The transferable lesson is this: scaling PostgreSQL is often less about exotic distributed databases and more about disciplined constraints, guardrails, and workload design.

Practical steps you can implement next

  1. Separate reads and writes intentionally (and decide what must hit the primary).

  2. Make caching observable (cache health should be a first-class alerting signal).

  3. Set query and transaction timeouts and actively police ORM output.

  4. Introduce workload tiers (high vs low priority) and isolate them.

  5. Rate limit at multiple layers and plan for targeted load shedding.

  6. Treat schema changes as risky operations with strict rules and timeouts.

Where Generation Digital fits

Scaling is never just a database problem—it's a ways-of-working problem. We help teams align product, engineering, and operations around the operating model that makes these controls stick.

And when the same “scale and sprawl” issues show up in knowledge and delivery workflows, we often recommend Notion as the foundation for structured documentation, standards, and repeatable templates—supported by tools like Asana (execution and governance) and Miro (alignment and design). The goal is the same: reduce friction, prevent bottlenecks, and make the system resilient as demand grows.

FAQs

How does PostgreSQL handle high query volumes?
By scaling reads horizontally with replicas, reducing primary pressure with caching and query optimisation, and enforcing guardrails (timeouts, pooling, rate limiting) that prevent overload and cascading retries.

What is workload isolation in PostgreSQL?
It’s separating different classes of traffic so one workload can’t starve another—for example, routing high-priority requests to dedicated instances while low-priority workloads run elsewhere.

Why is rate limiting important?
Rate limiting prevents sudden spikes, expensive-query surges, or retry storms from exhausting shared resources (CPU, I/O, connections), helping systems recover quickly without widespread degradation.

OpenAI scaled PostgreSQL for ChatGPT by keeping a single primary for writes and pushing reads to nearly 50 replicas across regions. They reduced pressure on the primary with caching and query optimisation, added rate limits and workload isolation to prevent “noisy neighbour” incidents, and enforced strict schema-change rules to protect reliability at massive scale.

Running a global product at ChatGPT’s scale creates a deceptively simple requirement: your database must behave like a utility. It has to stay fast under normal conditions, predictable under spikes, and recover quickly when something upstream goes wrong.

In January 2026, OpenAI shared how they’ve pushed PostgreSQL much further than many teams assume is possible—supporting 800 million users with a single primary Azure Database for PostgreSQL instance handling writes and nearly 50 read replicas spread across regions to handle read-heavy traffic.

The core architecture: one writer, many readers

OpenAI’s key bet is straightforward: don’t fight PostgreSQL’s write limits head-on if you can keep your workload predominantly read-heavy.

  • Single primary serves all writes (kept as calm as possible).

  • Read traffic is offloaded to replicas wherever possible, scaling out reads across geographies.

  • Write-heavy, shardable workloads are migrated away to sharded systems (they cite Azure Cosmos DB) to keep the primary stable.

The result, per OpenAI: millions of queries per second for read-heavy workloads, low double-digit millisecond p99 client-side latency, and five-nines availability in production.

What made it work in practice (the playbook)

OpenAI are very clear that the architecture isn’t the magic part; the guardrails are.

1) Reduce load on the primary (ruthlessly)

They minimise both reads and writes on the primary. Reads are pushed to replicas unless they must be in a write transaction; writes are reduced via bug fixes, “lazy writes” where appropriate, and migration of shardable write-heavy systems elsewhere.

2) Cache as a reliability feature, not just a speed trick

A recurring failure pattern they describe is cache misses (or caching layer failures) triggering a sudden surge of database load, which then cascades into timeouts and retries. Treat caching as part of your stability model, not an optional optimisation.

3) Query optimisation (and ORM discipline)

They call out expensive multi-table joins as a recurring risk (including an incident involving a 12-table join). Their approach: continuously hunt and fix costly query patterns, review ORM-generated SQL carefully, and use timeouts (e.g., idle-in-transaction) to avoid operational issues like blocking autovacuum.

4) Workload isolation: stop “noisy neighbours”

OpenAI split requests into high-priority vs low-priority tiers routed to separate instances, and apply similar separation across products so one feature or product can’t degrade the rest of the platform.

5) Connection pooling to prevent storms

They highlight Azure PostgreSQL connection limits (5,000 per instance) and past incidents caused by connection storms. Connection pooling is treated as core infrastructure, not an afterthought.

6) Rate limiting at multiple layers (and targeted load shedding)

Their rate limiting spans: application, connection pooler, proxy, and query layers. They also mention the ability to block specific query digests when necessary—useful for rapid recovery during surges of expensive queries.

7) Schema management: “schema changes are production events”

They avoid schema changes that trigger full table rewrites, enforce a strict 5-second timeout on schema changes, and restrict new tables in that PostgreSQL deployment (new workloads go to sharded systems).

What this means for your team

Most teams won’t ever hit “OpenAI scale”, but the failure modes are the same—just smaller:

  • a launch creates an unexpected write storm

  • a cache layer fails and the DB takes the hit

  • one endpoint’s slow query eats your CPU budget

  • retries turn a hiccup into an outage

The transferable lesson is this: scaling PostgreSQL is often less about exotic distributed databases and more about disciplined constraints, guardrails, and workload design.

Practical steps you can implement next

  1. Separate reads and writes intentionally (and decide what must hit the primary).

  2. Make caching observable (cache health should be a first-class alerting signal).

  3. Set query and transaction timeouts and actively police ORM output.

  4. Introduce workload tiers (high vs low priority) and isolate them.

  5. Rate limit at multiple layers and plan for targeted load shedding.

  6. Treat schema changes as risky operations with strict rules and timeouts.

Where Generation Digital fits

Scaling is never just a database problem—it's a ways-of-working problem. We help teams align product, engineering, and operations around the operating model that makes these controls stick.

And when the same “scale and sprawl” issues show up in knowledge and delivery workflows, we often recommend Notion as the foundation for structured documentation, standards, and repeatable templates—supported by tools like Asana (execution and governance) and Miro (alignment and design). The goal is the same: reduce friction, prevent bottlenecks, and make the system resilient as demand grows.

FAQs

How does PostgreSQL handle high query volumes?
By scaling reads horizontally with replicas, reducing primary pressure with caching and query optimisation, and enforcing guardrails (timeouts, pooling, rate limiting) that prevent overload and cascading retries.

What is workload isolation in PostgreSQL?
It’s separating different classes of traffic so one workload can’t starve another—for example, routing high-priority requests to dedicated instances while low-priority workloads run elsewhere.

Why is rate limiting important?
Rate limiting prevents sudden spikes, expensive-query surges, or retry storms from exhausting shared resources (CPU, I/O, connections), helping systems recover quickly without widespread degradation.

Receive weekly AI news and advice straight to your inbox

By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.

Upcoming Workshops and Webinars

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Streamlined Operations for Canadian Businesses - Asana

Virtual Webinar
Wednesday, February 25, 2026
Online

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Collaborate with AI Team Members - Asana

In-Person Workshop
Thursday, February 26, 2026
Toronto, Canada

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

From Concept to Prototype - AI in Miro

Online Webinar
Wednesday, February 18, 2026
Online

Generation
Digital

Canadian Office
33 Queen St,
Toronto
M5H 2N2
Canada

Canadian Office
1 University Ave,
Toronto,
ON M5J 1T1,
Canada

NAMER Office
77 Sands St,
Brooklyn,
NY 11201,
USA

Head Office
Charlemont St, Saint Kevin's, Dublin,
D02 VN88,
Ireland

Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy

Generation
Digital

Canadian Office
33 Queen St,
Toronto
M5H 2N2
Canada

Canadian Office
1 University Ave,
Toronto,
ON M5J 1T1,
Canada

NAMER Office
77 Sands St,
Brooklyn,
NY 11201,
USA

Head Office
Charlemont St, Saint Kevin's, Dublin,
D02 VN88,
Ireland

Middle East Office
6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Business No: 256 9431 77
Terms and Conditions
Privacy Policy
© 2026