Scaling PostgreSQL for ChatGPT: Replicas, Cache, Guardrails
Scaling PostgreSQL for ChatGPT: Replicas, Cache, Guardrails
OpenAI
Jan 22, 2026


Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.
Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.
➔ Download Our Free AI Preparedness Pack
OpenAI scaled PostgreSQL for ChatGPT by keeping a single primary for writes and pushing reads to nearly 50 replicas across regions. They reduced pressure on the primary with caching and query optimisation, added rate limits and workload isolation to prevent “noisy neighbour” incidents, and enforced strict schema-change rules to protect reliability at massive scale.
Running a global product at ChatGPT’s scale creates a deceptively simple requirement: your database must behave like a utility. It has to stay fast under normal conditions, predictable under spikes, and recover quickly when something upstream goes wrong.
In January 2026, OpenAI shared how they’ve pushed PostgreSQL much further than many teams assume is possible—supporting 800 million users with a single primary Azure Database for PostgreSQL instance handling writes and nearly 50 read replicas spread across regions to handle read-heavy traffic.
The core architecture: one writer, many readers
OpenAI’s key bet is straightforward: don’t fight PostgreSQL’s write limits head-on if you can keep your workload predominantly read-heavy.
Single primary serves all writes (kept as calm as possible).
Read traffic is offloaded to replicas wherever possible, scaling out reads across geographies.
Write-heavy, shardable workloads are migrated away to sharded systems (they cite Azure Cosmos DB) to keep the primary stable.
The result, per OpenAI: millions of queries per second for read-heavy workloads, low double-digit millisecond p99 client-side latency, and five-nines availability in production.
What made it work in practice (the playbook)
OpenAI are very clear that the architecture isn’t the magic part; the guardrails are.
1) Reduce load on the primary (ruthlessly)
They minimise both reads and writes on the primary. Reads are pushed to replicas unless they must be in a write transaction; writes are reduced via bug fixes, “lazy writes” where appropriate, and migration of shardable write-heavy systems elsewhere.
2) Cache as a reliability feature, not just a speed trick
A recurring failure pattern they describe is cache misses (or caching layer failures) triggering a sudden surge of database load, which then cascades into timeouts and retries. Treat caching as part of your stability model, not an optional optimisation.
3) Query optimisation (and ORM discipline)
They call out expensive multi-table joins as a recurring risk (including an incident involving a 12-table join). Their approach: continuously hunt and fix costly query patterns, review ORM-generated SQL carefully, and use timeouts (e.g., idle-in-transaction) to avoid operational issues like blocking autovacuum.
4) Workload isolation: stop “noisy neighbours”
OpenAI split requests into high-priority vs low-priority tiers routed to separate instances, and apply similar separation across products so one feature or product can’t degrade the rest of the platform.
5) Connection pooling to prevent storms
They highlight Azure PostgreSQL connection limits (5,000 per instance) and past incidents caused by connection storms. Connection pooling is treated as core infrastructure, not an afterthought.
6) Rate limiting at multiple layers (and targeted load shedding)
Their rate limiting spans: application, connection pooler, proxy, and query layers. They also mention the ability to block specific query digests when necessary—useful for rapid recovery during surges of expensive queries.
7) Schema management: “schema changes are production events”
They avoid schema changes that trigger full table rewrites, enforce a strict 5-second timeout on schema changes, and restrict new tables in that PostgreSQL deployment (new workloads go to sharded systems).
What this means for your team
Most teams won’t ever hit “OpenAI scale”, but the failure modes are the same—just smaller:
a launch creates an unexpected write storm
a cache layer fails and the DB takes the hit
one endpoint’s slow query eats your CPU budget
retries turn a hiccup into an outage
The transferable lesson is this: scaling PostgreSQL is often less about exotic distributed databases and more about disciplined constraints, guardrails, and workload design.
Practical steps you can implement next
Separate reads and writes intentionally (and decide what must hit the primary).
Make caching observable (cache health should be a first-class alerting signal).
Set query and transaction timeouts and actively police ORM output.
Introduce workload tiers (high vs low priority) and isolate them.
Rate limit at multiple layers and plan for targeted load shedding.
Treat schema changes as risky operations with strict rules and timeouts.
Where Generation Digital fits
Scaling is never just a database problem—it's a ways-of-working problem. We help teams align product, engineering, and operations around the operating model that makes these controls stick.
And when the same “scale and sprawl” issues show up in knowledge and delivery workflows, we often recommend Notion as the foundation for structured documentation, standards, and repeatable templates—supported by tools like Asana (execution and governance) and Miro (alignment and design). The goal is the same: reduce friction, prevent bottlenecks, and make the system resilient as demand grows.
FAQs
How does PostgreSQL handle high query volumes?
By scaling reads horizontally with replicas, reducing primary pressure with caching and query optimisation, and enforcing guardrails (timeouts, pooling, rate limiting) that prevent overload and cascading retries.
What is workload isolation in PostgreSQL?
It’s separating different classes of traffic so one workload can’t starve another—for example, routing high-priority requests to dedicated instances while low-priority workloads run elsewhere.
Why is rate limiting important?
Rate limiting prevents sudden spikes, expensive-query surges, or retry storms from exhausting shared resources (CPU, I/O, connections), helping systems recover quickly without widespread degradation.
OpenAI scaled PostgreSQL for ChatGPT by keeping a single primary for writes and pushing reads to nearly 50 replicas across regions. They reduced pressure on the primary with caching and query optimisation, added rate limits and workload isolation to prevent “noisy neighbour” incidents, and enforced strict schema-change rules to protect reliability at massive scale.
Running a global product at ChatGPT’s scale creates a deceptively simple requirement: your database must behave like a utility. It has to stay fast under normal conditions, predictable under spikes, and recover quickly when something upstream goes wrong.
In January 2026, OpenAI shared how they’ve pushed PostgreSQL much further than many teams assume is possible—supporting 800 million users with a single primary Azure Database for PostgreSQL instance handling writes and nearly 50 read replicas spread across regions to handle read-heavy traffic.
The core architecture: one writer, many readers
OpenAI’s key bet is straightforward: don’t fight PostgreSQL’s write limits head-on if you can keep your workload predominantly read-heavy.
Single primary serves all writes (kept as calm as possible).
Read traffic is offloaded to replicas wherever possible, scaling out reads across geographies.
Write-heavy, shardable workloads are migrated away to sharded systems (they cite Azure Cosmos DB) to keep the primary stable.
The result, per OpenAI: millions of queries per second for read-heavy workloads, low double-digit millisecond p99 client-side latency, and five-nines availability in production.
What made it work in practice (the playbook)
OpenAI are very clear that the architecture isn’t the magic part; the guardrails are.
1) Reduce load on the primary (ruthlessly)
They minimise both reads and writes on the primary. Reads are pushed to replicas unless they must be in a write transaction; writes are reduced via bug fixes, “lazy writes” where appropriate, and migration of shardable write-heavy systems elsewhere.
2) Cache as a reliability feature, not just a speed trick
A recurring failure pattern they describe is cache misses (or caching layer failures) triggering a sudden surge of database load, which then cascades into timeouts and retries. Treat caching as part of your stability model, not an optional optimisation.
3) Query optimisation (and ORM discipline)
They call out expensive multi-table joins as a recurring risk (including an incident involving a 12-table join). Their approach: continuously hunt and fix costly query patterns, review ORM-generated SQL carefully, and use timeouts (e.g., idle-in-transaction) to avoid operational issues like blocking autovacuum.
4) Workload isolation: stop “noisy neighbours”
OpenAI split requests into high-priority vs low-priority tiers routed to separate instances, and apply similar separation across products so one feature or product can’t degrade the rest of the platform.
5) Connection pooling to prevent storms
They highlight Azure PostgreSQL connection limits (5,000 per instance) and past incidents caused by connection storms. Connection pooling is treated as core infrastructure, not an afterthought.
6) Rate limiting at multiple layers (and targeted load shedding)
Their rate limiting spans: application, connection pooler, proxy, and query layers. They also mention the ability to block specific query digests when necessary—useful for rapid recovery during surges of expensive queries.
7) Schema management: “schema changes are production events”
They avoid schema changes that trigger full table rewrites, enforce a strict 5-second timeout on schema changes, and restrict new tables in that PostgreSQL deployment (new workloads go to sharded systems).
What this means for your team
Most teams won’t ever hit “OpenAI scale”, but the failure modes are the same—just smaller:
a launch creates an unexpected write storm
a cache layer fails and the DB takes the hit
one endpoint’s slow query eats your CPU budget
retries turn a hiccup into an outage
The transferable lesson is this: scaling PostgreSQL is often less about exotic distributed databases and more about disciplined constraints, guardrails, and workload design.
Practical steps you can implement next
Separate reads and writes intentionally (and decide what must hit the primary).
Make caching observable (cache health should be a first-class alerting signal).
Set query and transaction timeouts and actively police ORM output.
Introduce workload tiers (high vs low priority) and isolate them.
Rate limit at multiple layers and plan for targeted load shedding.
Treat schema changes as risky operations with strict rules and timeouts.
Where Generation Digital fits
Scaling is never just a database problem—it's a ways-of-working problem. We help teams align product, engineering, and operations around the operating model that makes these controls stick.
And when the same “scale and sprawl” issues show up in knowledge and delivery workflows, we often recommend Notion as the foundation for structured documentation, standards, and repeatable templates—supported by tools like Asana (execution and governance) and Miro (alignment and design). The goal is the same: reduce friction, prevent bottlenecks, and make the system resilient as demand grows.
FAQs
How does PostgreSQL handle high query volumes?
By scaling reads horizontally with replicas, reducing primary pressure with caching and query optimisation, and enforcing guardrails (timeouts, pooling, rate limiting) that prevent overload and cascading retries.
What is workload isolation in PostgreSQL?
It’s separating different classes of traffic so one workload can’t starve another—for example, routing high-priority requests to dedicated instances while low-priority workloads run elsewhere.
Why is rate limiting important?
Rate limiting prevents sudden spikes, expensive-query surges, or retry storms from exhausting shared resources (CPU, I/O, connections), helping systems recover quickly without widespread degradation.
Receive weekly AI news and advice straight to your inbox
By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.
Upcoming Workshops and Webinars


Streamlined Operations for Canadian Businesses - Asana
Virtual Webinar
Wednesday, February 25, 2026
Online


Collaborate with AI Team Members - Asana
In-Person Workshop
Thursday, February 26, 2026
Toronto, Canada


From Concept to Prototype - AI in Miro
Online Webinar
Wednesday, February 18, 2026
Online
Generation
Digital

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy
Generation
Digital










