Geron chapters 1-2.
Data Mining Process¶
Sam
ML lifecycle: steps for transforming data ⟶ actionable insights
-
Business Understanding: Define the problem & success criteria.
-
Data Understanding: How was it collected? Any implicit biases?
-
Data Preparation
-
Model
-
Data Preparation (possibly)
-
Modeling (possibly)
-
-
Evaluate
-
Deploy
Types of Systems¶
Sam
Broad categories are based on:
-
Are they trained with human supervision? (Paradigms)
-
Supervised: learns from labeled data
-
Unsupervised: finds structure in unlabeled data.
-
Semisupervised: uses a mix of labeled + unlabeled data.
-
Reinforcement: learns via rewards/penalties from interactions with an environment.
-
-
Can they learn incrementally on the fly?
-
Online: Yes
-
Batch: No
-
-
How do they generalize?
-
Instance-based: Take known data points ⟶ compare new data points
-
Model-based learning: Require training data to detect patterns
-
Paradigms | Core 4¶
Sam
Classify according to amount & type of supervision the system receives.
-
Supervised Learning
-
Goal: given input features, predict target values
-
Data: labeled dataset
(X, y) -
Tasks: classification & regression
- Algorithms: linear, svm, xgboost
-
-
Unsupervised Learning
-
Goal: find structure or patterns in unlabeled data.
-
Data: unlabeled dataset
-
Tasks: clustering, dimensionality reduction, anomaly detection, association rules
- Algorithms: DBSCAN, PCA, autoencoders
-
-
Semi-Supervised Learning
-
Goal: see supervised
-
Data: some labeled
(X, y), most unlabeled -
Tasks: see supervised, but where labeling is expensive
- Algorithms: DBNs, RBMs
-
-
Reinforcement Learning
-
Goal: learn a policy that selects actions to maximize long-term reward
-
Data: experience tuples
(state, action, reward, next_state)gathered through interaction with an environment -
Tasks: sequential decision-making, control, planning
- Algorithms: Q-learning, SARSA, Policy Gradient
-
Learning Type | Batch & Online¶
Sam
Does the system learn incrementally from a stream of incoming data?
-
Batch: Static, train on the entire dataset at once
-
Online: Streaming, model updates as new data points are received. (
Learning rateis key.)
Generalization Type | Model & Instance¶
Sam
How does the ML system generalize to new data?
Model-Based Learning
-
goal: use training data to build a model, then extrapolate.
-
data: full dataset
-
learning approach: train once → discard data → use learned model to predict
-
prediction/generalization mechanism: new inputs → learned model → outputs
-
use when: generalization matters most
Instance-Based Learning
-
goal: memorize training instances → compare new inputs to them
-
data: training instances kept in memory (or efficiently indexed)
-
learning approach: measure similarity to stored instances → predict
-
prediction/generalization mechanism: find NN or do weighted vote/average
-
use when: local relationships matter most
Challenges of ML¶
Two challenges: “bad algorithm” and “bad data”.
Data Issues¶
Sam
-
Quantity: Typically need thousands of examples
-
Quality: Might have too much info missing, could be poorly collected
-
Non-representative: When old cases no longer reflect new cases. Sources:
-
Sampling noise: Data is too small
-
Sampling bias: Sampling method is flawed.
-
-
Irrelevant features. Solutions:
-
Feature selection: Select only most useful features.
-
Feature extraction: Combine existing features to produce meaningful ones.
-
New features: Use external sources to create new features.
-
Algorithm Issues¶
Sam
Overfitting solutions:
-
Select a model with fewer parameters
-
Feature reduction
-
Constrain the model (regularization)
-
Gather more data
-
Reduce noise in training data
Underfitting solutions:
-
Select a more powerful model, with more parameters
-
Feature engineering
-
Reduce model constraints (eg reduce regularization)
Code Process¶
Sam
1. Data Prep
-
Create test set
-
Build preprocessing pipeline
2. Baseline
-
Baseline model & sanity check (
cross_val_score) -
Learning & validation curves (bias/variance)
3. Hyperparameter Search
-
RandomizedSearchCVbroadly -
GridSearchCVrefinements
4. Model Iteration
-
Error analysis (confusion matrices, residual plots)
-
Feature & pipeline improvements
-
Ensembles (bagging, boosting, stacking)
5. Estimate Performance
- Nested k-fold cross-validation (inner CV = tuning, outer CV = generalization)
6. Deployment
-
Retrain on all available training data with chosen hyperparameters
-
Save pipeline, set up monitoring
Hyperparameter Optimization¶
Sam
Model-type details:
-
Parametric: Fixed # of parameters, less complex. Hyperparameters include (1) Regularization terms, (2) learning rate, (3) key params
-
Non-parametric: Unconstrained # of parameters, more complex. Hyperparameters include selecting best complexity (eg tree depth).
Fine-tuning techniques:
-
Grid Search: Exhaustive search over parameter combinations.
-
Random Search: Randomly sample parameters to find optimal settings.
-
Bayesian Optimization: Use probabilistic models to select parameters.
Cross-Validation¶
Sam
Methods include:
-
k-Fold: Divides data into \(k\) subsets for training and testing.
-
Nested: Addresses hyperparameter overfitting by adding an outer validation loop.
-
Nested k-Fold: Removes overfit "leak" from evaluating on train set. Estimates generalization error of the underlying model & hyperparameters.
-
Inner loop: Fits model to each training set, then select hypers over validation set
-
Outer loop: Estimates generalization error by averaging test set scores over several dataset splits
-
Code¶
# Grid search
search = GridSearchCV(estimator = rf_classifier, param_grid = parameters, cv = 3)
# Apply to training
search.fit(x_train, y_train)
search.best_params_
# Best combo
best = search.best_estimator_
accuracy = evaluate(best, x_test, y_test)
Data Mining Tasks¶
Sam
-
Classification
-
Regression
-
Causal modeling
-
Data reduction
-
Clustering
-
Co-occurence (market basket analysis)
-
Similarity matching: X bought from us. Who else is likely to?
-
Profiling: What is the typical behavior of this segment?
-
Link prediction: You and x share 10 friends. x likes this person, so you prob will too.