Skip to content

03 INDEXING

TL;DR

Sam

LLM x RAG systems have 2 sources of memory:

  • Parametric: learned during initial LLM training.

  • Non-parametric: info stored in our KB

    • Indexing pipeline: creates KB (THIS DOC)

    • Generation pipeline: retrieves from KB

Indexing Pipeline steps to create the KB:

  1. #1-Load information from source systems

  2. #2-Chunk into smaller pieces

  3. #3-Embed into vectors to enable similarity search

  4. #4-Store in a vector DB

Indexing Pipeline


1-Load

Sam

steps:

  1. Connect to source

  2. Extract & parse

  3. Metadata review

  4. Transform and clean

Source data could be in many formats: md / data lakes / data warehouses / internet

2-Chunk

Sam

steps:

  1. Divide long text ⟶ compact units

  2. Merge units ⟶ larger chunks

  3. Overlap chunks to maintain context continuity

Sam

advantages | why chunking helps LLMs:

  • Context window: LLMs ignore content beyond token limit.

  • Lost-in-the-middle problem: LLMs struggle with relevant info in middle of prompts.

  • Search: LLMs struggle when searching over large text.

Sam

methods:

  • Fixed-size: based on special characters (eg characters / tokens / sentences / paragraphs)

  • Specialized: based on file structure (eg h tags / key-value pairs)

  • Semantic: sentence groups are based on semantic similarity

method considerations: Nature of source content / use case / embedding model

3-Embed

(See 00-Building-Blocks#Embedding for context.)

Sam

steps:

  1. Convert chunks ⟶ numerical vectors

  2. Normalize

  3. Validate vector quality

Sam

advantages | why embeddings helps LLMs:

  • Semantics: better than just keywords

  • Vector similarity: rank docs by context relevance, send best to LLM (via cosine similarity or dot-product distance)

  • Scalability: turns text into numeric vectors, making search and comparison fast.

  • Cross-modal alignment: compare text / images / etc under a shared representation space.

Sam

HF MTEB Leaderboard

Embedding algos Team Note
Word2Vec Google shallow NN
GloVe Stanford unsupervised learning
FastText Meta shallow NN, extends Word2Vec
ELMo Allen Institute for Q&A and sentiment
BERT (Transf) Google provides contextualized word embeddings via bi

4-Store

Sam

Non-vector DB types: Relational, NoSQL, Graph

Vector DBs:

  • store & retrieve vector data

  • index & store vector embeddings for semantic search & retrieval

Sam

vector DB categories & providers:

category core focus traditional features providers
Vector indexes Index & search N FAISS, NMSLIB, ANNOY, ScaNN
Specialized vector DBs Index & search Y Pinecone, ChromaDB, Milvus, Qdrant, Weaviate, Vald
Search platforms full text search & vector similarity search Solr, Elastic Search, Open Search, Apache Lucene
SQL databases add-on vector capability Azure SQL, Postgres, SingleStore, CloudSQL
NoSQL databases add-on vector capability MongoDB
Graph databases add-on vector capability Neo4j

Sam

vector DB choice considerations:

  • Accuracy vs. speed

  • Flexibility vs. performance: customizations add overhead

  • Local vs. cloud storage: local (storage speed, access) vs cloud (security, redundancy, scalability)

  • Direct access vs. API: need tight control via direct libraries? or are ease-of-use abstractions like APIs better?

  • Advanced features: how advanced do we need to be?

  • Cost