Scaling PostgreSQL for ChatGPT: Replicas, Cache, Guardrails

Q: How does PostgreSQL handle high query volumes?

PostgreSQL can handle high query volumes by scaling reads with replicas, reducing primary pressure via caching and query optimisation, and adding guardrails such as pooling, timeouts and rate limiting to prevent overload.

Q: What is workload isolation in PostgreSQL?

Workload isolation separates different classes of traffic so one workload can’t degrade another—for example, routing high-priority requests to dedicated instances while lower-priority workloads run separately.

Q: Why is rate limiting important?

Rate limiting controls spikes, expensive-query surges and retry storms that can exhaust shared resources such as CPU, I/O and connections—helping maintain stability and recover quickly.

OpenAI

Jan 22, 2026

In a modern office featuring exposed brick and servers, two professionals discuss data displayed on a digital world map, with a diagram on a glass wall illustrating concepts of scaling PostgreSQL for ChatGPT, including replicas, cache, and guardrails, while colleagues engage in conversation in the background.

Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.

➔ Download Our Free AI Preparedness Pack

OpenAI scaled PostgreSQL for ChatGPT by keeping a single primary for writes and pushing reads to nearly 50 replicas across regions. They reduced pressure on the primary with caching and query optimisation, added rate limits and workload isolation to prevent “noisy neighbour” incidents, and enforced strict schema-change rules to protect reliability at massive scale.

Running a global product at ChatGPT’s scale creates a deceptively simple requirement: your database must behave like a utility. It has to stay fast under normal conditions, predictable under spikes, and recover quickly when something upstream goes wrong.

In January 2026, OpenAI shared how they’ve pushed PostgreSQL much further than many teams assume is possible—supporting 800 million users with a single primary Azure Database for PostgreSQL instance handling writes and nearly 50 read replicas spread across regions to handle read-heavy traffic.

The core architecture: one writer, many readers

OpenAI’s key bet is straightforward: don’t fight PostgreSQL’s write limits head-on if you can keep your workload predominantly read-heavy.

Single primary serves all writes (kept as calm as possible).
Read traffic is offloaded to replicas wherever possible, scaling out reads across geographies.
Write-heavy, shardable workloads are migrated away to sharded systems (they cite Azure Cosmos DB) to keep the primary stable.

The result, per OpenAI: millions of queries per second for read-heavy workloads, low double-digit millisecond p99 client-side latency, and five-nines availability in production.

What made it work in practice (the playbook)

OpenAI are very clear that the architecture isn’t the magic part; the guardrails are.

1) Reduce load on the primary (ruthlessly)

They minimise both reads and writes on the primary. Reads are pushed to replicas unless they must be in a write transaction; writes are reduced via bug fixes, “lazy writes” where appropriate, and migration of shardable write-heavy systems elsewhere.

2) Cache as a reliability feature, not just a speed trick

A recurring failure pattern they describe is cache misses (or caching layer failures) triggering a sudden surge of database load, which then cascades into timeouts and retries. Treat caching as part of your stability model, not an optional optimisation.

3) Query optimisation (and ORM discipline)

They call out expensive multi-table joins as a recurring risk (including an incident involving a 12-table join). Their approach: continuously hunt and fix costly query patterns, review ORM-generated SQL carefully, and use timeouts (e.g., idle-in-transaction) to avoid operational issues like blocking autovacuum.

4) Workload isolation: stop “noisy neighbours”

OpenAI split requests into high-priority vs low-priority tiers routed to separate instances, and apply similar separation across products so one feature or product can’t degrade the rest of the platform.

5) Connection pooling to prevent storms

They highlight Azure PostgreSQL connection limits (5,000 per instance) and past incidents caused by connection storms. Connection pooling is treated as core infrastructure, not an afterthought.

6) Rate limiting at multiple layers (and targeted load shedding)

Their rate limiting spans: application, connection pooler, proxy, and query layers. They also mention the ability to block specific query digests when necessary—useful for rapid recovery during surges of expensive queries.

7) Schema management: “schema changes are production events”

They avoid schema changes that trigger full table rewrites, enforce a strict 5-second timeout on schema changes, and restrict new tables in that PostgreSQL deployment (new workloads go to sharded systems).

What this means for your team

Most teams won’t ever hit “OpenAI scale”, but the failure modes are the same—just smaller:

a launch creates an unexpected write storm
a cache layer fails and the DB takes the hit
one endpoint’s slow query eats your CPU budget
retries turn a hiccup into an outage

The transferable lesson is this: scaling PostgreSQL is often less about exotic distributed databases and more about disciplined constraints, guardrails, and workload design.

Practical steps you can implement next

Separate reads and writes intentionally (and decide what must hit the primary).
Make caching observable (cache health should be a first-class alerting signal).
Set query and transaction timeouts and actively police ORM output.
Introduce workload tiers (high vs low priority) and isolate them.
Rate limit at multiple layers and plan for targeted load shedding.
Treat schema changes as risky operations with strict rules and timeouts.

Where Generation Digital fits

Scaling is never just a database problem—it's a ways-of-working problem. We help teams align product, engineering, and operations around the operating model that makes these controls stick.

And when the same “scale and sprawl” issues show up in knowledge and delivery workflows, we often recommend Notion as the foundation for structured documentation, standards, and repeatable templates—supported by tools like Asana (execution and governance) and Miro (alignment and design). The goal is the same: reduce friction, prevent bottlenecks, and make the system resilient as demand grows.

FAQs

How does PostgreSQL handle high query volumes?
By scaling reads horizontally with replicas, reducing primary pressure with caching and query optimisation, and enforcing guardrails (timeouts, pooling, rate limiting) that prevent overload and cascading retries.

What is workload isolation in PostgreSQL?
It’s separating different classes of traffic so one workload can’t starve another—for example, routing high-priority requests to dedicated instances while low-priority workloads run elsewhere.

Why is rate limiting important?
Rate limiting prevents sudden spikes, expensive-query surges, or retry storms from exhausting shared resources (CPU, I/O, connections), helping systems recover quickly without widespread degradation.