04 GENERATING
TL;DR¶
Sam
LLM x RAG systems have 2 sources of memory available.
-
Parametric: learned during initial LLM training.
-
Non-parametric: info stored in our KB
-
Indexing pipeline: creates KB
-
Generation pipeline: retrieves from KB (THIS DOC)
-
Generation Pipeline: input Q ⟶ respond with LLM x RAG:
-
#1. Retrieval: Retrieve info from KB based on Q.
-
#2. Augmentation: Augment Q with fetched info, create prompt for LLM.
-
#3. Generation: Generate response via LLM.

1. Retrieval¶
Sam
Process:
-
Input Q
-
Search KB for matching docs (stored embeddings)
-
Fetch info
-
Output list
Retrieval methods¶
LangChain has abstracted these algorithms ⟶ retrievers.
Sam
TF-IDF: keyword-based, uses TF and IDF to score words.
BM25: probabilistic variant of TF-IDF. Adds length normalization & saturation effects so longer documents aren’t unfairly favored.
Static Word Embeddings: vector-based semantics (fixed meaning per word)
- Represents words as dense vectors (e.g.,
Word2Vec,GloVe)
Contextual Embeddings: context-aware semantics (meanings shift with context)
-
Handles polysemy & nuanced meanings
-
Embeddings from models (e.g.,
BERT,GPT)
Other popular retrievers¶
Sam
-
Vector stores and DBs:
-
Combine
FAISSwith contextual embedding model -
PineCone/Milvus/Weaviatecombine dense retrieval methods ⟶ provide hybrid search functionality.
-
-
Cloud providers: Includes infrastructure, APIs, and tools for info retrieval
-
Web info: Connect to Wikipedia / Arxiv / AskNews / etc. See Langchain .
2. Augmentation¶
Apply prompt engineering. Goal is to best augment the LLM with the Q & retrieved info.
| Prompting Technique | Description |
|---|---|
| Contextual | “Answer based on only the context provided below.” |
| Controlled generation | "Say 'I don’t know' when provided context doesn't have needed info." |
| Few-shot | Provide examples in prompt |
| Chain-of-thought (CoT) | Provide intermediate reasoning steps |
3. Generation¶
Key question: Which LLM to use?
Consider these 3 major themes.
3.1 Foundation v fine-tuned¶
Sam
Foundation models: massive pre-trained LLMs.
-
are: autoregressive next-token prediction models
-
how: trained via unsupervised learning
-
benefits: Deployment speed, resource efficiency
SFT (supervised fine-tuning):
-
is: a process to adjust foundation model's weights for specific tasks
-
how: start with a pre-trained model ⟶ prepare labelled dataset ⟶ train model. This adjusts the model parameters to perform better on the given task.
-
benefits: Domain specialization, retrieval integration w KB, response customization, output control
3.2 Open source v proprietary¶
Sam
Open source: more flexible, but need infrastructure and maintenance.
Criteria
-
Customization: Open source allows (1) deep integration with custom retrievers (2) control over fine-tuning
-
Ease of use: Open source is more difficult. Proprietary can offer prebuilt RAG solutions.
-
Deployment flexibility: Open source are customizable(private cloud, on-premises)
-
Cost: Open source has higher up-front fixed costs, lower variable costs over time.
3.3 Model size¶
Sam
Small models..
pros:
-
Face fewer resource constraints
-
Are easier to deploy
cons:
-
Have limited reasoning capability (rely heavily on KB)
-
Could struggle with context windows & diverse queries.