03 INDEXING
TL;DR¶
Sam
LLM x RAG systems have 2 sources of memory:
-
Parametric: learned during initial LLM training.
-
Non-parametric: info stored in our KB
-
Indexing pipeline: creates KB (THIS DOC)
-
Generation pipeline: retrieves from KB
-
Indexing Pipeline steps to create the KB:
-
#1-Load information from source systems
-
#2-Chunk into smaller pieces
-
#3-Embed into vectors to enable similarity search
-
#4-Store in a vector DB

1-Load¶
Sam
steps:
-
Connect to source
-
Extract & parse
-
Metadata review
-
Transform and clean
Source data could be in many formats: md / data lakes / data warehouses / internet
2-Chunk¶
Sam
steps:
-
Divide long text ⟶ compact units
-
Merge units ⟶ larger chunks
-
Overlap chunks to maintain context continuity
Sam
advantages | why chunking helps LLMs:
-
Context window: LLMs ignore content beyond token limit.
-
Lost-in-the-middle problem: LLMs struggle with relevant info in middle of prompts.
-
Search: LLMs struggle when searching over large text.
Sam
methods:
-
Fixed-size: based on special characters (eg characters / tokens / sentences / paragraphs)
-
Specialized: based on file structure (eg h tags / key-value pairs)
-
Semantic: sentence groups are based on semantic similarity
method considerations: Nature of source content / use case / embedding model
3-Embed¶
(See 00-Building-Blocks#Embedding for context.)
Sam
steps:
-
Convert chunks ⟶ numerical vectors
-
Normalize
-
Validate vector quality
Sam
advantages | why embeddings helps LLMs:
-
Semantics: better than just keywords
-
Vector similarity: rank docs by context relevance, send best to LLM (via
cosine similarityordot-product distance) -
Scalability: turns text into numeric vectors, making search and comparison fast.
-
Cross-modal alignment: compare text / images / etc under a shared representation space.
Sam
| Embedding algos | Team | Note |
|---|---|---|
| Word2Vec | shallow NN | |
| GloVe | Stanford | unsupervised learning |
| FastText | Meta | shallow NN, extends Word2Vec |
| ELMo | Allen Institute | for Q&A and sentiment |
| BERT (Transf) | provides contextualized word embeddings via bi |
4-Store¶
Sam
Non-vector DB types: Relational, NoSQL, Graph
Vector DBs:
-
store & retrieve vector data
-
index & store vector embeddings for semantic search & retrieval
Sam
vector DB categories & providers:
| category | core focus | traditional features | providers |
|---|---|---|---|
| Vector indexes | Index & search | N | FAISS, NMSLIB, ANNOY, ScaNN |
| Specialized vector DBs | Index & search | Y | Pinecone, ChromaDB, Milvus, Qdrant, Weaviate, Vald |
| Search platforms | full text search & vector similarity search | Solr, Elastic Search, Open Search, Apache Lucene | |
| SQL databases | add-on vector capability | Azure SQL, Postgres, SingleStore, CloudSQL | |
| NoSQL databases | add-on vector capability | MongoDB | |
| Graph databases | add-on vector capability | Neo4j |
Sam
vector DB choice considerations:
-
Accuracy vs. speed
-
Flexibility vs. performance: customizations add overhead
-
Local vs. cloud storage: local (storage speed, access) vs cloud (security, redundancy, scalability)
-
Direct access vs. API: need tight control via direct libraries? or are ease-of-use abstractions like APIs better?
-
Advanced features: how advanced do we need to be?
-
Cost