00 Building Blocks
This doc contains foundational concepts applied in the rest of the docs.
Sources
Sam
Tokens
-
are: the fundamental semantic units used in NLP
-
1 token ~ 4 characters (OpenAI suggestion)
Sam
LLMs (as a type of model)
-
are: next-token self-supervised probabilistic models
-
do: look at text, find statistical patterns, then estimate a token distribution and generate.
-
how: pretraining paradigms of CLM (next-token) or MLM (fill-in-the-blank)
-
challenges: knowledge cut-off, data/compute limits, hallucinations, bias, context-length
Sam
Memory: LLM x RAG models use memory for generation
-
Parametric: info learned during LLM training
-
Non-parametric: info learned afterwards from RAG
Sam
Pipeline of LLM x RAG
-
Train LLM. Get parametric memory
-
I | Create non-parametric memory (external KB)
-
R | Fetch info from KB
-
A | Add KB info to prompt, send to LLM
-
G | LLM generates response
Sam
transformers
-
are: NN architecture based on attention mechanisms
-
do: let LLMs store & present knowledge
LangChain¶
Sam
LangChain
-
is: a modular framework
-
purpose: helps build LLM apps
-
languages: Python, JavaScript
-
other uses: chatbots, document summarizers, synthetic data generation
-
integrates with:
-
LLM providers
-
vector store providers
-
cloud storage systems: eg AWS, SQL & NoSQL databases
-
APIs: eg news, weather
-
LangChain
-
is: a modular framework
-
does: provides building blocks to implement LLM apps
LangGraph
-
is: a graph-based orchestration engine
-
does: enables complex workflows for LLM-powered systems
LangSmith
-
is: a platform
-
does: enables observability, debugging, evaluation, and monitoring
Sam
Analogy:
-
LangChain: individual workers doing straightforward tasks
-
LangGraph: a coordinated team with a manager who oversees complex workflows
-
LangSmith: quality control
Similarity¶
Sam
Similar pieces of text lie close to each other in space.
similarity calculations | common measures
-
cosine similarity: use angles. (0 deg = similar, 90 deg = unrelated, 180 deg = opposite)
-
euclidean distance: use distance
Embedding¶
Sam
embedding (process): converting raw data (chunks) ⟶ numerical vectors. Enables similarity search based on semantics, not just keywords.
textbook example: We have 3 words. We want to find their similarities in 2D space.
data: dog, bark, fly
similarities (2D):
-
x-axis: grammatically --->
fly&barkare close (verbs) -
y-axis: contextually --->
dog&barkare close
