00 Building Blocks

This doc contains foundational concepts applied in the rest of the docs.

Sources

Textbook: A Simple Guide to RAG by Abhinav Kimothi. (Oreilly, Github, based on arXiv)
Documentation: LangChain, LangGraph, LangSmith

Sam

Tokens

Sam

LLMs (as a type of model)

are: next-token self-supervised probabilistic models
do: look at text, find statistical patterns, then estimate a token distribution and generate.
how: pretraining paradigms of CLM (next-token) or MLM (fill-in-the-blank)
challenges: knowledge cut-off, data/compute limits, hallucinations, bias, context-length

Sam

Memory: LLM x RAG models use memory for generation

Sam

Pipeline of LLM x RAG

Sam

transformers

Sam

LangChain

is: a modular framework
purpose: helps build LLM apps
languages: Python, JavaScript
other uses: chatbots, document summarizers, synthetic data generation
integrates with:
- LLM providers
- vector store providers
- cloud storage systems: eg AWS, SQL & NoSQL databases
- APIs: eg news, weather

LangChain

LangGraph

LangSmith

Sam

Analogy:

Sam

Similar pieces of text lie close to each other in space.

similarity calculations | common measures

cosine similarity: use angles. (0 deg = similar, 90 deg = unrelated, 180 deg = opposite)
euclidean distance: use distance

Sam

embedding (process): converting raw data (chunks) ⟶ numerical vectors. Enables similarity search based on semantics, not just keywords.

textbook example: We have 3 words. We want to find their similarities in 2D space.

data: dog, bark, fly similarities (2D):