Skip to content

Geron chapters 1-2.

Data Mining Process

Sam

ML lifecycle: steps for transforming data ⟶ actionable insights

  1. Business Understanding: Define the problem & success criteria.

  2. Data Understanding: How was it collected? Any implicit biases?

  3. Data Preparation

  4. Model

    1. Data Preparation (possibly)

    2. Modeling (possibly)

  5. Evaluate

  6. Deploy

Types of Systems

Sam

Broad categories are based on:

  • Are they trained with human supervision? (Paradigms)

    • Supervised: learns from labeled data

    • Unsupervised: finds structure in unlabeled data.

    • Semisupervised: uses a mix of labeled + unlabeled data.

    • Reinforcement: learns via rewards/penalties from interactions with an environment.

  • Can they learn incrementally on the fly?

    • Online: Yes

    • Batch: No

  • How do they generalize?

    • Instance-based: Take known data points ⟶ compare new data points

    • Model-based learning: Require training data to detect patterns

Paradigms | Core 4

Sam

Classify according to amount & type of supervision the system receives.

  • Supervised Learning

    • Goal: given input features, predict target values

    • Data: labeled dataset (X, y)

    • Tasks: classification & regression

      • Algorithms: linear, svm, xgboost
  • Unsupervised Learning

    • Goal: find structure or patterns in unlabeled data.

    • Data: unlabeled dataset

    • Tasks: clustering, dimensionality reduction, anomaly detection, association rules

      • Algorithms: DBSCAN, PCA, autoencoders
  • Semi-Supervised Learning

    • Goal: see supervised

    • Data: some labeled (X, y), most unlabeled

    • Tasks: see supervised, but where labeling is expensive

      • Algorithms: DBNs, RBMs
  • Reinforcement Learning

    • Goal: learn a policy that selects actions to maximize long-term reward

    • Data: experience tuples (state, action, reward, next_state) gathered through interaction with an environment

    • Tasks: sequential decision-making, control, planning

      • Algorithms: Q-learning, SARSA, Policy Gradient

Learning Type | Batch & Online

Sam

Does the system learn incrementally from a stream of incoming data?

  • Batch: Static, train on the entire dataset at once

  • Online: Streaming, model updates as new data points are received. (Learning rate is key.)

Generalization Type | Model & Instance

Sam

How does the ML system generalize to new data?

Model-Based Learning

  • goal: use training data to build a model, then extrapolate.

  • data: full dataset

  • learning approach: train once → discard data → use learned model to predict

  • prediction/generalization mechanism: new inputs → learned model → outputs

  • use when: generalization matters most

Instance-Based Learning

  • goal: memorize training instances → compare new inputs to them

  • data: training instances kept in memory (or efficiently indexed)

  • learning approach: measure similarity to stored instances → predict

  • prediction/generalization mechanism: find NN or do weighted vote/average

  • use when: local relationships matter most

Challenges of ML

Two challenges: “bad algorithm” and “bad data”.

Data Issues

Sam

  1. Quantity: Typically need thousands of examples

  2. Quality: Might have too much info missing, could be poorly collected

  3. Non-representative: When old cases no longer reflect new cases. Sources:

    1. Sampling noise: Data is too small

    2. Sampling bias: Sampling method is flawed.

  4. Irrelevant features. Solutions:

    1. Feature selection: Select only most useful features.

    2. Feature extraction: Combine existing features to produce meaningful ones.

    3. New features: Use external sources to create new features.

Algorithm Issues

Sam

Overfitting solutions:

  • Select a model with fewer parameters

  • Feature reduction

  • Constrain the model (regularization)

  • Gather more data

  • Reduce noise in training data

Underfitting solutions:

  • Select a more powerful model, with more parameters

  • Feature engineering

  • Reduce model constraints (eg reduce regularization)

Code Process

Sam

1. Data Prep

  • Create test set

  • Build preprocessing pipeline

2. Baseline

  • Baseline model & sanity check (cross_val_score)

  • Learning & validation curves (bias/variance)

3. Hyperparameter Search

  • RandomizedSearchCV broadly

  • GridSearchCV refinements

4. Model Iteration

  • Error analysis (confusion matrices, residual plots)

  • Feature & pipeline improvements

  • Ensembles (bagging, boosting, stacking)

5. Estimate Performance

  • Nested k-fold cross-validation (inner CV = tuning, outer CV = generalization)

6. Deployment

  • Retrain on all available training data with chosen hyperparameters

  • Save pipeline, set up monitoring

Hyperparameter Optimization

Sam

Model-type details:

  • Parametric: Fixed # of parameters, less complex. Hyperparameters include (1) Regularization terms, (2) learning rate, (3) key params

  • Non-parametric: Unconstrained # of parameters, more complex. Hyperparameters include selecting best complexity (eg tree depth).

Fine-tuning techniques:

  • Grid Search: Exhaustive search over parameter combinations.

  • Random Search: Randomly sample parameters to find optimal settings.

  • Bayesian Optimization: Use probabilistic models to select parameters.

Cross-Validation

Sam

Methods include:

  • k-Fold: Divides data into \(k\) subsets for training and testing.

  • Nested: Addresses hyperparameter overfitting by adding an outer validation loop.

  • Nested k-Fold: Removes overfit "leak" from evaluating on train set. Estimates generalization error of the underlying model & hyperparameters.

    • Inner loop: Fits model to each training set, then select hypers over validation set

    • Outer loop: Estimates generalization error by averaging test set scores over several dataset splits

Code

# Grid search
search = GridSearchCV(estimator = rf_classifier, param_grid = parameters, cv = 3)

# Apply to training
search.fit(x_train, y_train)  
search.best_params_

# Best combo
best = search.best_estimator_  
accuracy = evaluate(best, x_test, y_test)

Data Mining Tasks

Sam

  1. Classification

  2. Regression

  3. Causal modeling

  4. Data reduction

  5. Clustering

  6. Co-occurence (market basket analysis)

  7. Similarity matching: X bought from us. Who else is likely to?

  8. Profiling: What is the typical behavior of this segment?

  9. Link prediction: You and x share 10 friends. x likes this person, so you prob will too.