Notion Vector Search: 10x Scale at 1/10th the Cost
Notion Vector Search: 10x Scale at 1/10th the Cost
Conceptual
Oct 9, 2023


Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.
Uncertain about how to get started with AI?
Evaluate your readiness, potential risks, and key priorities in less than an hour.
➔ Download Our Free AI Preparedness Pack
Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.
Vector search has quietly become one of the most expensive parts of AI features in production.
It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.
In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.
This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.
Why vector search matters (and why it gets pricey)
Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.
Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.
The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.
The early architecture: fast onboarding, then immediate scaling pain
When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:
an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors
an online path using Kafka consumers to process page edits in near real time
They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.
The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.
Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.
The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.
How Notion cut costs: the three biggest levers
Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.
1) Move to serverless indices that decouple storage and compute
In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.
They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.
2) Migrate to turbopuffer (object-storage-native search)
In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.
The migration (late 2024 into early 2025) was also used as a clean-up moment:
full reindexing with higher write throughput
an embeddings model upgrade
simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)
gradual cutover, validating correctness generation by generation
Notion reports outcomes including:
60% reduction in search engine spend
35% reduction in AWS EMR compute costs
improved p50 query latency from 70–100ms to 50–70ms
3) Stop re-embedding the world when one character changes (Page State)
In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.
Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.
That enables two important optimisations:
if only some spans change, re-embed and reload only those spans
if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation
Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.
The next wave: moving embeddings from Spark to Ray
Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.
In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:
eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)
reduce dependency on external API stability
simplify pipelining
self-host open-source embedding models for faster iteration
Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.
What you can learn from this?
Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.
1) Separate “scale” problems from “cost” problems
Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.
2) Make freshness cheap
If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.
3) Treat migrations as simplification opportunities
Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.
4) Optimise the embedding pipeline separately
Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.
Where Generation Digital helps
If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.
Generation Digital supports teams with:
RAG and enterprise search architecture reviews
governance and security guardrails for AI features
operational metrics that link quality, latency and cost to business outcomes
Summary
Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.
Next steps
Audit your vector costs: DB, embedding generation, and indexing churn.
Implement differential indexing (hashing + partial updates) before you scale.
Revisit your vector store economics (serverless vs provisioned vs object storage).
If you want help designing a scalable retrieval stack, contact Generation Digital.
FAQs
Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.
Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.
Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.
Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.
Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.
Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.
Vector search has quietly become one of the most expensive parts of AI features in production.
It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.
In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.
This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.
Why vector search matters (and why it gets pricey)
Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.
Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.
The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.
The early architecture: fast onboarding, then immediate scaling pain
When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:
an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors
an online path using Kafka consumers to process page edits in near real time
They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.
The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.
Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.
The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.
How Notion cut costs: the three biggest levers
Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.
1) Move to serverless indices that decouple storage and compute
In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.
They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.
2) Migrate to turbopuffer (object-storage-native search)
In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.
The migration (late 2024 into early 2025) was also used as a clean-up moment:
full reindexing with higher write throughput
an embeddings model upgrade
simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)
gradual cutover, validating correctness generation by generation
Notion reports outcomes including:
60% reduction in search engine spend
35% reduction in AWS EMR compute costs
improved p50 query latency from 70–100ms to 50–70ms
3) Stop re-embedding the world when one character changes (Page State)
In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.
Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.
That enables two important optimisations:
if only some spans change, re-embed and reload only those spans
if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation
Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.
The next wave: moving embeddings from Spark to Ray
Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.
In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:
eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)
reduce dependency on external API stability
simplify pipelining
self-host open-source embedding models for faster iteration
Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.
What you can learn from this?
Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.
1) Separate “scale” problems from “cost” problems
Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.
2) Make freshness cheap
If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.
3) Treat migrations as simplification opportunities
Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.
4) Optimise the embedding pipeline separately
Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.
Where Generation Digital helps
If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.
Generation Digital supports teams with:
RAG and enterprise search architecture reviews
governance and security guardrails for AI features
operational metrics that link quality, latency and cost to business outcomes
Summary
Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.
Next steps
Audit your vector costs: DB, embedding generation, and indexing churn.
Implement differential indexing (hashing + partial updates) before you scale.
Revisit your vector store economics (serverless vs provisioned vs object storage).
If you want help designing a scalable retrieval stack, contact Generation Digital.
FAQs
Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.
Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.
Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.
Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.
Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.
Receive weekly AI news and advice straight to your inbox
By subscribing, you agree to allow Generation Digital to store and process your information according to our privacy policy. You can review the full policy at gend.co/privacy.
Upcoming Workshops and Webinars


Streamlined Operations for Canadian Businesses - Asana
Virtual Webinar
Wednesday, February 25, 2026
Online


Collaborate with AI Team Members - Asana
In-Person Workshop
Thursday, February 26, 2026
Toronto, Canada


From Concept to Prototype - AI in Miro
Online Webinar
Wednesday, February 18, 2026
Online
Generation
Digital

Business Number: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy
Generation
Digital










