Skip to content

00 Building Blocks

This doc contains foundational concepts applied in the rest of the docs.

Sources


Sam

Tokens

  • are: the fundamental semantic units used in NLP

  • 1 token ~ 4 characters (OpenAI suggestion)

Sam

LLMs (as a type of model)

  • are: next-token self-supervised probabilistic models

  • do: look at text, find statistical patterns, then estimate a token distribution and generate.

  • how: pretraining paradigms of CLM (next-token) or MLM (fill-in-the-blank)

  • challenges: knowledge cut-off, data/compute limits, hallucinations, bias, context-length

Sam

Memory: LLM x RAG models use memory for generation

  • Parametric: info learned during LLM training

  • Non-parametric: info learned afterwards from RAG

Sam

Pipeline of LLM x RAG

  1. Train LLM. Get parametric memory

  2. I | Create non-parametric memory (external KB)

  3. R | Fetch info from KB

  4. A | Add KB info to prompt, send to LLM

  5. G | LLM generates response

Sam

transformers

  • are: NN architecture based on attention mechanisms

  • do: let LLMs store & present knowledge

LangChain

Sam

LangChain

  • is: a modular framework

  • purpose: helps build LLM apps

  • languages: Python, JavaScript

  • other uses: chatbots, document summarizers, synthetic data generation

  • integrates with:

    • LLM providers

    • vector store providers

    • cloud storage systems: eg AWS, SQL & NoSQL databases

    • APIs: eg news, weather

LangChain

  • is: a modular framework

  • does: provides building blocks to implement LLM apps

LangGraph

  • is: a graph-based orchestration engine

  • does: enables complex workflows for LLM-powered systems

LangSmith

  • is: a platform

  • does: enables observability, debugging, evaluation, and monitoring

Sam

Analogy:

  • LangChain: individual workers doing straightforward tasks

  • LangGraph: a coordinated team with a manager who oversees complex workflows

  • LangSmith: quality control

Similarity

Sam

Similar pieces of text lie close to each other in space.

similarity calculations | common measures

  • cosine similarity: use angles. (0 deg = similar, 90 deg = unrelated, 180 deg = opposite)

  • euclidean distance: use distance

Embedding

Sam

embedding (process): converting raw data (chunks) ⟶ numerical vectors. Enables similarity search based on semantics, not just keywords.

textbook example: We have 3 words. We want to find their similarities in 2D space.

data: dog, bark, fly similarities (2D):

  • x-axis: grammatically ---> fly & bark are close (verbs)

  • y-axis: contextually ---> dog & bark are close