Notion Vector Search: 10x Scale at 1/10th the Cost

Notion Vector Search: 10x Scale at 1/10th the Cost

Notion

Oct 9, 2023

A man and woman in a modern office discuss data visualization on a large monitor displaying a complex network graph, with keywords 'Notion Vector Search: 10x Scale at 1/10th the Cost' highlighted on a chart, emphasizing scalability and cost efficiency.
A man and woman in a modern office discuss data visualization on a large monitor displaying a complex network graph, with keywords 'Notion Vector Search: 10x Scale at 1/10th the Cost' highlighted on a chart, emphasizing scalability and cost efficiency.

Not sure where to start with AI?
Assess readiness, risk, and priorities in under an hour.

Not sure where to start with AI?
Assess readiness, risk, and priorities in under an hour.

➔ Download Our Free AI Readiness Pack

Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.

Vector search has quietly become one of the most expensive parts of AI features in production.

It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.

In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.

This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.

Why vector search matters (and why it gets pricey)

Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.

Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.

The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.

The early architecture: fast onboarding, then immediate scaling pain

When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:

  • an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors

  • an online path using Kafka consumers to process page edits in near real time

They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.

The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.

Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.

The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.

How Notion cut costs: the three biggest levers

Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.

1) Move to serverless indices that decouple storage and compute

In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.

They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.

2) Migrate to turbopuffer (object-storage-native search)

In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.

The migration (late 2024 into early 2025) was also used as a clean-up moment:

  • full reindexing with higher write throughput

  • an embeddings model upgrade

  • simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)

  • gradual cutover, validating correctness generation by generation

Notion reports outcomes including:

  • 60% reduction in search engine spend

  • 35% reduction in AWS EMR compute costs

  • improved p50 query latency from 70–100ms to 50–70ms

3) Stop re-embedding the world when one character changes (Page State)

In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.

Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.

That enables two important optimisations:

  • if only some spans change, re-embed and reload only those spans

  • if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation

Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.

The next wave: moving embeddings from Spark to Ray

Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.

In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:

  • eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)

  • reduce dependency on external API stability

  • simplify pipelining

  • self-host open-source embedding models for faster iteration

Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.

What you can learn from this?

Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.

1) Separate “scale” problems from “cost” problems

Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.

2) Make freshness cheap

If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.

3) Treat migrations as simplification opportunities

Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.

4) Optimise the embedding pipeline separately

Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.

Where Generation Digital helps

If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.

Generation Digital supports teams with:

  • RAG and enterprise search architecture reviews

  • governance and security guardrails for AI features

  • operational metrics that link quality, latency and cost to business outcomes

Summary

Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.

Next steps

  1. Audit your vector costs: DB, embedding generation, and indexing churn.

  2. Implement differential indexing (hashing + partial updates) before you scale.

  3. Revisit your vector store economics (serverless vs provisioned vs object storage).

  4. If you want help designing a scalable retrieval stack, contact Generation Digital.

FAQs

Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.

Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.

Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.

Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.

Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.

Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.

Vector search has quietly become one of the most expensive parts of AI features in production.

It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.

In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.

This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.

Why vector search matters (and why it gets pricey)

Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.

Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.

The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.

The early architecture: fast onboarding, then immediate scaling pain

When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:

  • an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors

  • an online path using Kafka consumers to process page edits in near real time

They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.

The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.

Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.

The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.

How Notion cut costs: the three biggest levers

Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.

1) Move to serverless indices that decouple storage and compute

In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.

They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.

2) Migrate to turbopuffer (object-storage-native search)

In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.

The migration (late 2024 into early 2025) was also used as a clean-up moment:

  • full reindexing with higher write throughput

  • an embeddings model upgrade

  • simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)

  • gradual cutover, validating correctness generation by generation

Notion reports outcomes including:

  • 60% reduction in search engine spend

  • 35% reduction in AWS EMR compute costs

  • improved p50 query latency from 70–100ms to 50–70ms

3) Stop re-embedding the world when one character changes (Page State)

In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.

Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.

That enables two important optimisations:

  • if only some spans change, re-embed and reload only those spans

  • if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation

Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.

The next wave: moving embeddings from Spark to Ray

Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.

In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:

  • eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)

  • reduce dependency on external API stability

  • simplify pipelining

  • self-host open-source embedding models for faster iteration

Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.

What you can learn from this?

Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.

1) Separate “scale” problems from “cost” problems

Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.

2) Make freshness cheap

If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.

3) Treat migrations as simplification opportunities

Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.

4) Optimise the embedding pipeline separately

Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.

Where Generation Digital helps

If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.

Generation Digital supports teams with:

  • RAG and enterprise search architecture reviews

  • governance and security guardrails for AI features

  • operational metrics that link quality, latency and cost to business outcomes

Summary

Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.

Next steps

  1. Audit your vector costs: DB, embedding generation, and indexing churn.

  2. Implement differential indexing (hashing + partial updates) before you scale.

  3. Revisit your vector store economics (serverless vs provisioned vs object storage).

  4. If you want help designing a scalable retrieval stack, contact Generation Digital.

FAQs

Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.

Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.

Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.

Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.

Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.

Get weekly AI news and advice delivered to your inbox

By subscribing you consent to Generation Digital storing and processing your details in line with our privacy policy. You can read the full policy at gend.co/privacy.

Upcoming Workshops and Webinars

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Operational Clarity at Scale - Asana

Virtual Webinar
Weds 25th February 2026
Online

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Work With AI Teammates - Asana

In-Person Workshop
Thurs 26th February 2026
London, UK

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

From Idea to Prototype - AI in Miro

Virtual Webinar
Weds 18th February 2026
Online

Generation
Digital

UK Office

Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom

Canada Office

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada

USA Office

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States

EU Office

Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland

Middle East Office

6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Company No: 256 9431 77 | Copyright 2026 | Terms and Conditions | Privacy Policy

Generation
Digital

UK Office

Generation Digital Ltd
33 Queen St,
London
EC4R 1AP
United Kingdom

Canada Office

Generation Digital Americas Inc
181 Bay St., Suite 1800
Toronto, ON, M5J 2T9
Canada

USA Office

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
United States

EU Office

Generation Digital Software
Elgee Building
Dundalk
A91 X2R3
Ireland

Middle East Office

6994 Alsharq 3890,
An Narjis,
Riyadh 13343,
Saudi Arabia

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Company No: 256 9431 77
Terms and Conditions
Privacy Policy
Copyright 2026