Notion Vector Search: 10x Scale at 1/10th the Cost

Q: How did Notion achieve a 10x scale in vector search?

Notion scaled vector search by improving onboarding throughput, routing new workspaces to new index generations as capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.

Q: What cost reduction did Notion achieve?

Notion reports reducing vector search costs by 90% overall over two years. Key steps included a serverless migration (50% reduction), a turbopuffer migration (60% reduction in search engine spend), and a Page State optimisation that reduced indexed data volume by 70%.

Q: What is the Page State Project?

Notion’s Page State Project stores per-span hashes for text and metadata so the indexing pipeline re-embeds only changed spans and can update metadata without re-embedding, reducing data volume and write costs.

Q: Why move embeddings from Spark to Ray?

Notion cites avoiding double compute, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; the team expects a 90%+ reduction in embeddings infrastructure costs as the rollout completes.

Notion

Oct 9, 2023

Not sure where to start with AI?
Assess readiness, risk, and priorities in under an hour.

➔ Download Our Free AI Readiness Pack

Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.

Vector search has quietly become one of the most expensive parts of AI features in production.

It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.

In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.

This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.

Why vector search matters (and why it gets pricey)

Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.

Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.

The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.

The early architecture: fast onboarding, then immediate scaling pain

When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:

an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors
an online path using Kafka consumers to process page edits in near real time

They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.

The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.

Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.

The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.

How Notion cut costs: the three biggest levers

Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.

1) Move to serverless indices that decouple storage and compute

In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.

They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.

2) Migrate to turbopuffer (object-storage-native search)

In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.

The migration (late 2024 into early 2025) was also used as a clean-up moment:

full reindexing with higher write throughput
an embeddings model upgrade
simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)
gradual cutover, validating correctness generation by generation

Notion reports outcomes including:

60% reduction in search engine spend
35% reduction in AWS EMR compute costs
improved p50 query latency from 70–100ms to 50–70ms

3) Stop re-embedding the world when one character changes (Page State)

In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.

Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.

That enables two important optimisations:

if only some spans change, re-embed and reload only those spans
if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation

Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.

The next wave: moving embeddings from Spark to Ray

Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.

In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:

eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)
reduce dependency on external API stability
simplify pipelining
self-host open-source embedding models for faster iteration

Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.

What you can learn from this?

Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.

1) Separate “scale” problems from “cost” problems

Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.

2) Make freshness cheap

If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.

3) Treat migrations as simplification opportunities

Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.

4) Optimise the embedding pipeline separately

Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.

Where Generation Digital helps

If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.

Generation Digital supports teams with:

RAG and enterprise search architecture reviews
governance and security guardrails for AI features
operational metrics that link quality, latency and cost to business outcomes

Summary

Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.

Next steps

Audit your vector costs: DB, embedding generation, and indexing churn.
Implement differential indexing (hashing + partial updates) before you scale.
Revisit your vector store economics (serverless vs provisioned vs object storage).
If you want help designing a scalable retrieval stack, contact Generation Digital.

FAQs

Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.

Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.

Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.

Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.

Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.