Geron chapters 1-2.

Data Mining Process¶

Sam

ML lifecycle: steps for transforming data ⟶ actionable insights

Business Understanding: Define the problem & success criteria.
Data Understanding: How was it collected? Any implicit biases?
Data Preparation
Model
1. Data Preparation (possibly)
2. Modeling (possibly)
Evaluate
Deploy

Types of Systems¶

Sam

Broad categories are based on:

Are they trained with human supervision? (Paradigms)
- Supervised: learns from labeled data
- Unsupervised: finds structure in unlabeled data.
- Semisupervised: uses a mix of labeled + unlabeled data.
- Reinforcement: learns via rewards/penalties from interactions with an environment.
Can they learn incrementally on the fly?
- Online: Yes
- Batch: No
How do they generalize?
- Instance-based: Take known data points ⟶ compare new data points
- Model-based learning: Require training data to detect patterns

Paradigms | Core 4¶

Sam

Classify according to amount & type of supervision the system receives.

Supervised Learning
- Goal: given input features, predict target values
- Data: labeled dataset (X, y)
- Tasks: classification & regression
  - Algorithms: linear, svm, xgboost
Unsupervised Learning
- Goal: find structure or patterns in unlabeled data.
- Data: unlabeled dataset
- Tasks: clustering, dimensionality reduction, anomaly detection, association rules
  - Algorithms: DBSCAN, PCA, autoencoders
Semi-Supervised Learning
- Goal: see supervised
- Data: some labeled (X, y), most unlabeled
- Tasks: see supervised, but where labeling is expensive
  - Algorithms: DBNs, RBMs
Reinforcement Learning
- Goal: learn a policy that selects actions to maximize long-term reward
- Data: experience tuples (state, action, reward, next_state) gathered through interaction with an environment
- Tasks: sequential decision-making, control, planning
  - Algorithms: Q-learning, SARSA, Policy Gradient

Learning Type | Batch & Online¶

Sam

Does the system learn incrementally from a stream of incoming data?

Batch: Static, train on the entire dataset at once
Online: Streaming, model updates as new data points are received. (Learning rate is key.)

Generalization Type | Model & Instance¶

Sam

How does the ML system generalize to new data?

Model-Based Learning

goal: use training data to build a model, then extrapolate.
data: full dataset
learning approach: train once → discard data → use learned model to predict
prediction/generalization mechanism: new inputs → learned model → outputs
use when: generalization matters most

Instance-Based Learning

goal: memorize training instances → compare new inputs to them
data: training instances kept in memory (or efficiently indexed)
learning approach: measure similarity to stored instances → predict
prediction/generalization mechanism: find NN or do weighted vote/average
use when: local relationships matter most

Challenges of ML¶

Two challenges: “bad algorithm” and “bad data”.

Data Issues¶

Sam

Quantity: Typically need thousands of examples
Quality: Might have too much info missing, could be poorly collected
Non-representative: When old cases no longer reflect new cases. Sources:
1. Sampling noise: Data is too small
2. Sampling bias: Sampling method is flawed.
Irrelevant features. Solutions:
1. Feature selection: Select only most useful features.
2. Feature extraction: Combine existing features to produce meaningful ones.
3. New features: Use external sources to create new features.

Algorithm Issues¶

Sam

Overfitting solutions:

Select a model with fewer parameters
Feature reduction
Constrain the model (regularization)
Gather more data
Reduce noise in training data

Underfitting solutions:

Select a more powerful model, with more parameters
Feature engineering
Reduce model constraints (eg reduce regularization)

Code Process¶

Sam

1. Data Prep

Create test set
Build preprocessing pipeline

2. Baseline

Baseline model & sanity check (cross_val_score)
Learning & validation curves (bias/variance)

3. Hyperparameter Search

RandomizedSearchCV broadly
GridSearchCV refinements

4. Model Iteration

Error analysis (confusion matrices, residual plots)
Feature & pipeline improvements
Ensembles (bagging, boosting, stacking)

5. Estimate Performance

Nested k-fold cross-validation (inner CV = tuning, outer CV = generalization)

6. Deployment

Retrain on all available training data with chosen hyperparameters
Save pipeline, set up monitoring

Hyperparameter Optimization¶

Sam

Model-type details:

Parametric: Fixed # of parameters, less complex. Hyperparameters include (1) Regularization terms, (2) learning rate, (3) key params
Non-parametric: Unconstrained # of parameters, more complex. Hyperparameters include selecting best complexity (eg tree depth).

Fine-tuning techniques:

Grid Search: Exhaustive search over parameter combinations.
Random Search: Randomly sample parameters to find optimal settings.
Bayesian Optimization: Use probabilistic models to select parameters.

Cross-Validation¶

Sam

Methods include:

k-Fold: Divides data into \(k\) subsets for training and testing.
Nested: Addresses hyperparameter overfitting by adding an outer validation loop.
Nested k-Fold: Removes overfit "leak" from evaluating on train set. Estimates generalization error of the underlying model & hyperparameters.
- Inner loop: Fits model to each training set, then select hypers over validation set
- Outer loop: Estimates generalization error by averaging test set scores over several dataset splits

Code¶

# Grid search
search = GridSearchCV(estimator = rf_classifier, param_grid = parameters, cv = 3)

# Apply to training
search.fit(x_train, y_train)  
search.best_params_

# Best combo
best = search.best_estimator_  
accuracy = evaluate(best, x_test, y_test)

Data Mining Tasks¶

Sam

Classification
Regression
Causal modeling
Data reduction
Clustering
Co-occurence (market basket analysis)
Similarity matching: X bought from us. Who else is likely to?
Profiling: What is the typical behavior of this segment?
Link prediction: You and x share 10 friends. x likes this person, so you prob will too.