03 INDEXING

TL;DR¶

Sam

LLM x RAG systems have 2 sources of memory:

Parametric: learned during initial LLM training.
Non-parametric: info stored in our KB
- Indexing pipeline: creates KB (THIS DOC)
- Generation pipeline: retrieves from KB

Indexing Pipeline steps to create the KB:

Indexing Pipeline

Sam

steps:

Source data could be in many formats: md / data lakes / data warehouses / internet

Sam

steps:

Sam

advantages | why chunking helps LLMs:

Context window: LLMs ignore content beyond token limit.
Lost-in-the-middle problem: LLMs struggle with relevant info in middle of prompts.
Search: LLMs struggle when searching over large text.

Sam

methods:

Fixed-size: based on special characters (eg characters / tokens / sentences / paragraphs)
Specialized: based on file structure (eg h tags / key-value pairs)
Semantic: sentence groups are based on semantic similarity

method considerations: Nature of source content / use case / embedding model

Sam

steps:

Sam

advantages | why embeddings helps LLMs:

Semantics: better than just keywords
Vector similarity: rank docs by context relevance, send best to LLM (via cosine similarity or dot-product distance)
Scalability: turns text into numeric vectors, making search and comparison fast.
Cross-modal alignment: compare text / images / etc under a shared representation space.

Sam

Embedding algos	Team	Note
Word2Vec	Google	shallow NN
GloVe	Stanford	unsupervised learning
FastText	Meta	shallow NN, extends Word2Vec
ELMo	Allen Institute	for Q&A and sentiment
BERT (Transf)	Google	provides contextualized word embeddings via bi

Sam

Non-vector DB types: Relational, NoSQL, Graph

Vector DBs:

Sam

vector DB categories & providers:

category	core focus	traditional features	providers
Vector indexes	Index & search	N	`FAISS`, NMSLIB, ANNOY, ScaNN
Specialized vector DBs	Index & search	Y	`Pinecone`, `ChromaDB`, Milvus, Qdrant, Weaviate, Vald
Search platforms	full text search & vector similarity search		Solr, Elastic Search, Open Search, Apache Lucene
SQL databases	add-on vector capability		Azure SQL, Postgres, SingleStore, CloudSQL
NoSQL databases	add-on vector capability		MongoDB
Graph databases	add-on vector capability		Neo4j

Sam

vector DB choice considerations:

Accuracy vs. speed
Flexibility vs. performance: customizations add overhead
Local vs. cloud storage: local (storage speed, access) vs cloud (security, redundancy, scalability)
Direct access vs. API: need tight control via direct libraries? or are ease-of-use abstractions like APIs better?
Advanced features: how advanced do we need to be?
Cost