CHAPTER 10Advanced ~90 min

RAG and Vector Databases

AI agents that work with your own data: document chunking, embeddings, vector search and cited answers.

In this chapter

RAG (Retrieval-Augmented Generation) is the technique of injecting external knowledge into an LLM so it can answer organisation-specific questions correctly. A bare LLM has two issues: (1) it doesn't know anything after its training cutoff, (2) it has never seen your internal PDFs, contracts or product docs. RAG closes the gap: first retrieve the 3-5 chunks closest to the question from a vector DB, then tell the LLM to 'answer only based on these sources.' In this chapter you'll build a RAG assistant from a PDF using n8n's Vector Store, Embeddings, Document Loader and Text Splitter nodes.

Topics

What embeddings are and which model to pick
Vector DB options: Pinecone, Qdrant, Supabase pgvector
Document Loader: PDF, web, Drive ingest
Text Splitter chunking strategies
Retrieve-and-Generate (RAG) flow
Citing sources and reducing hallucinations

RAG's two legs: Ingestion and Retrieval

Every RAG system has two separate workflows. Ingestion: PDF/Drive/web page → split into chunks → generate an embedding per chunk → write to vector DB. This only runs when a new document arrives. Retrieval: user question arrives → embed the question → find the top-N closest chunks in the vector DB → pass them to the LLM as context → produce a cited answer. In n8n these are usually two separate workflows sharing the same vector DB.

Drive Trigger

PDF Loader

Text Splitter

Embeddings OpenAI

Vector Store Insert

Embeddings: the engine that turns text into numbers

An embedding turns a piece of text into a 1536 (or 3072) dimensional vector of numbers. Semantically close texts produce close vectors — that is all the magic of RAG. Practical choices: OpenAI text-embedding-3-small (cheap, fast, fits most cases), text-embedding-3-large (more accurate, more expensive), Cohere embed-multilingual-v3 (great for non-English), Ollama nomic-embed-text (local, free, private). Critical: ingestion and retrieval MUST use the same model; otherwise vectors aren't comparable.

Vector DB choice: Pinecone, Qdrant, Supabase pgvector

Pinecone: managed SaaS, zero setup, low latency at millions of vectors, the free tier covers small projects. Qdrant: open source, cloud or self-host, very strong metadata filtering ('search only docs from the last 30 days'). Supabase pgvector: if you already use Postgres, no new tool is needed; ideal for small-to-mid projects. Rule of thumb: prototype on Supabase pgvector, switch to Pinecone or Qdrant in production at million+ vectors.

Document Loader: PDF, web, Drive ingestion

n8n's 'Default Data Loader' pulls file/text content into the RAG flow. Common sources: PDF (Drive/S3/local), Google Docs (via Docs node), web pages (HTTP Request + HTML Extract), Notion (via Notion node). For PDF use n8n's PDF Loader, or fetch via HTTP and convert to text with pdf-parse in a Code node. Recommended order: file → text → clean (extra \n\n, page numbers) → then splitter.

Text Splitter: chunking strategies

You can't embed a whole PDF at once — the model has token limits and search becomes too vague. You break the document into 'chunks.' Strategy 1 — Recursive Character Splitter: splits at paragraph → sentence → character (the default, most common). Strategy 2 — Token Splitter: splits by model token count (token-aware flows). Practical values: chunk size 800-1200 tokens, overlap 100-200 tokens (without overlap context across boundaries is lost). Too-small chunks: not enough context. Too-large chunks: bad retrieval precision. Tune per document type.

Retrieval flow: finding the closest chunks

The query workflow: Webhook receives the question → embed with the same embedding model → pass to Vector Store node in 'Retrieve' mode → pull top_k=5 nearest chunks → feed them as 'context' into the AI Agent's system prompt → agent answers. In n8n the 'Vector Store Retrieve as Tool' mode lets the agent itself query the vector DB as a tool — even more powerful because the agent can re-query when needed.

Webhook

Embeddings (Query)

Vector Store Retrieve

AI Agent

Respond

Citing sources and reducing hallucinations

RAG's biggest weakness: the model tends to answer from its 'own knowledge' rather than the retrieved source. Fixes: (1) write the system prompt to say 'answer only from <context>; if not present, say I don't know.' (2) Load each chunk with metadata like { source: 'contract-v3.pdf', page: 12 } and force the answer to print sources. (3) Fall back to 'no relevant information found' when the similarity score is low (e.g. < 0.7). (4) Bind to Structured Output { answer, sources[], confidence } — the model can't leave it empty if it can't say 'no source.'

Cost and speed: practical tips

Embedding a 1,000-page corporate archive with text-embedding-3-small costs ~$1-2; ~$0.0001 per retrieval. The biggest hidden cost is the LLM's reply: too large a top_k and you stuff 20-30K tokens into the prompt. Practical rule: top_k=5 and chunk_size=1,000 tokens are safe defaults. For speed, embed in batches (Split In Batches → 100 chunks at a time). To avoid re-ingesting the same document, write the chunk's hash into the metadata and check before ingest.

This chapter's workflow (n8n editor view)

Webhook

Vector Store Retrieve

AI Agent

Respond

Next chapter