Notion Vector Search: 10x Scale at 1/10th the Cost

Notion Vector Search: 10x Scale at 1/10th the Cost

Notion

9 oct. 2023

A man and woman in a modern office discuss data visualization on a large monitor displaying a complex network graph, with keywords 'Notion Vector Search: 10x Scale at 1/10th the Cost' highlighted on a chart, emphasizing scalability and cost efficiency.
A man and woman in a modern office discuss data visualization on a large monitor displaying a complex network graph, with keywords 'Notion Vector Search: 10x Scale at 1/10th the Cost' highlighted on a chart, emphasizing scalability and cost efficiency.

Pas sûr de quoi faire ensuite avec l'IA?
Évaluez la préparation, les risques et les priorités en moins d'une heure.

Pas sûr de quoi faire ensuite avec l'IA?
Évaluez la préparation, les risques et les priorités en moins d'une heure.

➔ Téléchargez notre kit de préparation à l'IA gratuit

Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.

Vector search has quietly become one of the most expensive parts of AI features in production.

It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.

In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.

This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.

Why vector search matters (and why it gets pricey)

Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.

Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.

The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.

The early architecture: fast onboarding, then immediate scaling pain

When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:

  • an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors

  • an online path using Kafka consumers to process page edits in near real time

They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.

The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.

Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.

The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.

How Notion cut costs: the three biggest levers

Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.

1) Move to serverless indices that decouple storage and compute

In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.

They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.

2) Migrate to turbopuffer (object-storage-native search)

In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.

The migration (late 2024 into early 2025) was also used as a clean-up moment:

  • full reindexing with higher write throughput

  • an embeddings model upgrade

  • simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)

  • gradual cutover, validating correctness generation by generation

Notion reports outcomes including:

  • 60% reduction in search engine spend

  • 35% reduction in AWS EMR compute costs

  • improved p50 query latency from 70–100ms to 50–70ms

3) Stop re-embedding the world when one character changes (Page State)

In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.

Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.

That enables two important optimisations:

  • if only some spans change, re-embed and reload only those spans

  • if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation

Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.

The next wave: moving embeddings from Spark to Ray

Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.

In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:

  • eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)

  • reduce dependency on external API stability

  • simplify pipelining

  • self-host open-source embedding models for faster iteration

Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.

What you can learn from this?

Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.

1) Separate “scale” problems from “cost” problems

Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.

2) Make freshness cheap

If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.

3) Treat migrations as simplification opportunities

Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.

4) Optimise the embedding pipeline separately

Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.

Where Generation Digital helps

If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.

Generation Digital supports teams with:

  • RAG and enterprise search architecture reviews

  • governance and security guardrails for AI features

  • operational metrics that link quality, latency and cost to business outcomes

Summary

Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.

Next steps

  1. Audit your vector costs: DB, embedding generation, and indexing churn.

  2. Implement differential indexing (hashing + partial updates) before you scale.

  3. Revisit your vector store economics (serverless vs provisioned vs object storage).

  4. If you want help designing a scalable retrieval stack, contact Generation Digital.

FAQs

Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.

Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.

Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.

Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.

Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.

Notion scaled its vector search infrastructure 10x over two years while reducing costs by 90% by redesigning both indexing and storage. Key changes included serverless indices that decouple storage from compute, a migration to turbopuffer’s object-storage architecture, a Page State system that avoids re-embedding unchanged text, and a move from Spark to Ray for embeddings.

Vector search has quietly become one of the most expensive parts of AI features in production.

It isn’t just the vector database bill. It’s the ingestion pipeline, the embedding generation, the churn from “small edits”, and the operational load of keeping indices fresh across millions of tenants.

In a technical write-up published 19 February 2026, Notion shared how it scaled its vector search infrastructure 10x while reducing costs by 90% over two years — a story that starts with a launch crunch and ends with a calmer, more efficient architecture.

This article breaks down what they changed, why it worked, and what you can borrow if you’re building RAG or semantic search at enterprise scale.

Why vector search matters (and why it gets pricey)

Traditional keyword search matches the words you type. Vector search matches the meaning, by embedding text into a high-dimensional vector space. That’s why AI Q&A can answer “team meeting notes” even when the content is titled “group stand-up summary”.

Notion uses vector search as the retrieval layer for Notion AI, pulling relevant workspace content (and connected sources like Slack and Google Drive) before the model generates an answer.

The catch is that “semantic” comes with a bill: every chunk you embed must be stored and kept up to date, and every query has to embed the question before it can retrieve.

The early architecture: fast onboarding, then immediate scaling pain

When Notion AI Q&A launched in November 2023, they used a dual ingestion pipeline:

  • an offline batch path on Apache Spark to chunk documents, generate embeddings via API, and bulk-load vectors

  • an online path using Kafka consumers to process page edits in near real time

They also used a multi-tenant sharding approach that routed workspaces to indices using workspace ID.

The problem appeared quickly: within a month, their original indices were close to capacity. Re-sharding would have slowed onboarding, and over-provisioning was expensive because their provider charged for uptime.

Notion’s workaround was pragmatic: instead of reshaping existing indices, they created new generations of indices and routed new workspaces to the new generation, keeping reads and writes directed by a “generation ID”. It avoided repeated re-shard operations and kept onboarding moving.

The result was hypergrowth capacity: they went from onboarding a few hundred workspaces per day to a 600x increase in daily onboarding capacity, clearing the multi-million waitlist by April 2024.

How Notion cut costs: the three biggest levers

Notion’s cost reduction wasn’t one magic trick. It was a sequence of improvements that removed the biggest structural waste.

1) Move to serverless indices that decouple storage and compute

In May 2024, Notion migrated embeddings workload from dedicated “pod” clusters (coupled compute/storage and charged by uptime) to a serverless architecture that charges by usage.

They report an immediate 50% cost reduction from peak usage, plus operational benefits: no hard storage capacity planning and fewer manual provisioning chores.

2) Migrate to turbopuffer (object-storage-native search)

In parallel, they evaluated alternative engines and selected turbopuffer, built on object storage for cost efficiency.

The migration (late 2024 into early 2025) was also used as a clean-up moment:

  • full reindexing with higher write throughput

  • an embeddings model upgrade

  • simplified indexing (turbopuffer namespaces as independent indices — no sharding/generation routing)

  • gradual cutover, validating correctness generation by generation

Notion reports outcomes including:

  • 60% reduction in search engine spend

  • 35% reduction in AWS EMR compute costs

  • improved p50 query latency from 70–100ms to 50–70ms

3) Stop re-embedding the world when one character changes (Page State)

In July 2025, Notion tackled a core inefficiency: any change to a page previously triggered re-chunking, re-embedding, and re-uploading every span, even if only a tiny part changed.

Their Page State approach stores two hashes per span — one for span text, one for metadata — using 64-bit xxHash, and caches per-page span state in DynamoDB.

That enables two important optimisations:

  • if only some spans change, re-embed and reload only those spans

  • if only metadata (like permissions) changes, skip embedding entirely and issue a cheaper metadata update operation

Notion reports a 70% reduction in data volume, saving on embedding costs and vector DB write costs.

The next wave: moving embeddings from Spark to Ray

Notion’s later-stage work focuses on embeddings generation and serving, where costs can balloon and reliability can suffer if you rely heavily on third-party embedding APIs.

In July 2025, they began migrating near real-time embeddings to Ray on Anyscale. The motivations were practical:

  • eliminate a “double compute” pattern (Spark preprocessing on EMR plus per-token embedding API costs)

  • reduce dependency on external API stability

  • simplify pipelining

  • self-host open-source embedding models for faster iteration

Notion notes that this is still rolling out but expects a 90%+ reduction in embeddings infrastructure costs, with early results described as promising.

What you can learn from this?

Notion’s story is useful because it’s not theoretical. It’s the sequence most teams experience once retrieval moves from prototype to production.

1) Separate “scale” problems from “cost” problems

Early on, Notion solved “don’t run out of space” with generation-based routing. Later, they solved unit economics with serverless indices, object-storage-native search, and reduced re-embedding.

2) Make freshness cheap

If your pipeline re-embeds unchanged text, you’ll pay forever. Page State style hashing and differential updates are one of the highest-leverage improvements for update-heavy products.

3) Treat migrations as simplification opportunities

Notion used provider changes to remove complexity (shards/generations) and upgrade models in the process.

4) Optimise the embedding pipeline separately

Vector DB spend is visible. Embedding generation can quietly rival it. Consolidating preprocessing and inference on a single compute layer (Ray) is one way to take cost and reliability back under your control.

Where Generation Digital helps

If you’re implementing AI features at scale, retrieval quality and unit economics are the difference between a promising pilot and a sustainable product.

Generation Digital supports teams with:

  • RAG and enterprise search architecture reviews

  • governance and security guardrails for AI features

  • operational metrics that link quality, latency and cost to business outcomes

Summary

Notion’s 10x scale and 90% cost reduction came from redesigning the whole retrieval stack: serverless indices that decouple storage and compute, a migration to turbopuffer’s object-storage-native search, a Page State system that avoids re-embedding unchanged text, and a move towards Ray/Anyscale for embeddings. The lesson is simple: at scale, the cheapest vector is the one you don’t regenerate.

Next steps

  1. Audit your vector costs: DB, embedding generation, and indexing churn.

  2. Implement differential indexing (hashing + partial updates) before you scale.

  3. Revisit your vector store economics (serverless vs provisioned vs object storage).

  4. If you want help designing a scalable retrieval stack, contact Generation Digital.

FAQs

Q1: How did Notion achieve a 10x scale in vector search?
A: Notion scaled by improving onboarding throughput, routing new workspaces to new index “generations” when capacity filled, and later simplifying the architecture by moving to turbopuffer namespaces without sharding or generation routing.

Q2: What cost reduction did Notion achieve?
A: Notion reports cutting vector search costs by 90% overall over two years. The steps included a serverless migration with a 50% reduction, a turbopuffer migration with a 60% reduction in search engine spend, and a Page State optimisation that reduced indexed data volume by 70%.

Q3: Why is vector search important for Notion?
A: Vector search enables semantic retrieval, helping Notion AI find relevant content by meaning rather than exact keywords — which improves AI Q&A and enterprise search experiences.

Q4: What is the Page State Project?
A: It’s Notion’s optimisation that stores per-span hashes for text and metadata, allowing the pipeline to re-embed only changed spans and to update metadata without re-embedding, cutting data volume and write costs.

Q5: Why move embeddings from Spark to Ray?
A: Notion cites avoiding “double compute”, improving reliability, simplifying pipelining, and enabling self-hosted open-source embedding models; they expect a 90%+ reduction in embeddings infrastructure costs.

Recevez chaque semaine des nouvelles et des conseils sur l'IA directement dans votre boîte de réception

En vous abonnant, vous consentez à ce que Génération Numérique stocke et traite vos informations conformément à notre politique de confidentialité. Vous pouvez lire la politique complète sur gend.co/privacy.

Ateliers et webinaires à venir

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Clarté opérationnelle à grande échelle - Asana

Webinaire Virtuel
Mercredi 25 février 2026
En ligne

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

Collaborez avec des coéquipiers IA - Asana

Atelier en personne
Jeudi 26 février 2026
London, UK

A diverse group of professionals collaborating around a table in a bright, modern office setting.
A diverse group of professionals collaborating around a table in a bright, modern office setting.

De l'idée au prototype - L'IA dans Miro

Webinaire virtuel
Mercredi 18 février 2026
En ligne

Génération
Numérique

Bureau du Royaume-Uni

Génération Numérique Ltée
33 rue Queen,
Londres
EC4R 1AP
Royaume-Uni

Bureau au Canada

Génération Numérique Amériques Inc
181 rue Bay, Suite 1800
Toronto, ON, M5J 2T9
Canada

Bureau aux États-Unis

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
États-Unis

Bureau de l'UE

Génération de logiciels numériques
Bâtiment Elgee
Dundalk
A91 X2R3
Irlande

Bureau du Moyen-Orient

6994 Alsharq 3890,
An Narjis,
Riyad 13343,
Arabie Saoudite

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)

Numéro d'entreprise : 256 9431 77 | Droits d'auteur 2026 | Conditions générales | Politique de confidentialité

Génération
Numérique

Bureau du Royaume-Uni

Génération Numérique Ltée
33 rue Queen,
Londres
EC4R 1AP
Royaume-Uni

Bureau au Canada

Génération Numérique Amériques Inc
181 rue Bay, Suite 1800
Toronto, ON, M5J 2T9
Canada

Bureau aux États-Unis

Generation Digital Americas Inc
77 Sands St,
Brooklyn, NY 11201,
États-Unis

Bureau de l'UE

Génération de logiciels numériques
Bâtiment Elgee
Dundalk
A91 X2R3
Irlande

Bureau du Moyen-Orient

6994 Alsharq 3890,
An Narjis,
Riyad 13343,
Arabie Saoudite

UK Fast Growth Index UBS Logo
Financial Times FT 1000 Logo
Febe Growth 100 Logo (Background Removed)


Numéro d'entreprise : 256 9431 77
Conditions générales
Politique de confidentialité
Droit d'auteur 2026