Skip to content

ML 03 Classification

Algorithm Summary

  • k-NN is intuitive and simple but can be slow for large datasets.

  • Naive Bayes is fast and works well with high-dimensional data but assumes independent features.

  • Logistic Regression is a well-understood parametric approach, ideal for linearly separable data, and can handle regularization elegantly.

  • Decision Trees are highly interpretable but can overfit if not carefully regularized (via max depth, minimum samples per leaf, etc.).

CATEGORY CORE IDEA MODEL TYPE KEY HYPERPARAMETERS STRENGTHS WEAKNESSES REGULARIZATION
k-NN 1. Find k closest neighbors

2. Majority vote or average.
Non-parametric k: Number of neighbors.

Distance metric
Simple to implement.

Minimal statistical assumptions.

Models complex decision boundaries.
Slow for large data.

Bad in high-d space.

Domain knowledge for good distance metric.
Not typically applied, could use dimensionality reduction
Naive Bayes Applies Bayes’ Theorem with a naive assumption of conditional independence among features Probabilistic Distribution choice (e.g. Multinomial) Fast

Robust with many features.

Works well with small datasets.
Bad if features are highly correlated. Smoothing techniques to avoid zero probabilities. (e.g., Laplace smoothing)
Logistic Regression Estimates the probability of each class via a linear combination of features passed through a sigmoid (or softmax) function, making it a parametric model. Parametric C: Inverse regularization strength.

multi_class: One-vs-Rest or multinomial.

Regularization type
Interpretable coefficients.

Can incorporate regularization

Good when data is linearly (or log-linearly) separable.
May struggle with highly non-linear relationships.

Sensitive to outliers if regularization is not used properly. Assumes linear (or log-linear) deci-bound.
L1 (Lasso)

L2 (Ridge)

Elastic Net: Combo of L1 and L2.
Decision Trees Splits data into hierarchical branches based on feature values, aiming to maximize “purity” at each split. Non-parametric max_depth: Maximum tree depth.

min_samples_leaf: Minimum samples required in a leaf node.

Splitting criterion (e.g., Gini or Entropy).
Highly interpretable

Handles numeric & categorical

Don't need scaling or dummies.

Handles multi-output problems.
Overfitting if grown without constraints.

Greedy splitting may not yield a global optimum.

Sensitive to data imbalance.

Can struggle with certain complex interactions
Indirectly via max_depth, min_samples_leaf, etc.

Pruning can reduce overfitting

Logistic

Sam

Target

  • Binary (Logit): Two classes (0 or 1).

  • Softmax Regression: Multiple classes (unordered); picks the class with the highest probability.

  • Ordered Logit: Multiple ordered classes.

\(probability(x) \: = \: \frac{1}{e^(-1 \: * \: regression \: model)}\)

Image | Logistic Regression Sigmoid Curve

Sam

Steps

  1. Compute linear score: Per observation, add up the weighted contribution of each variable (ie, model’s raw score).

  2. Convert to probability: Pass that score through the S-shaped logistic curve to get a probability between 0 and 1.

  3. Compare to reality: Check how far those probabilities are from the actual group labels (0/1).

  4. Fit the model: Adjust the weights so the probabilities match reality as closely as possible across all observations.

  5. Regularize (if needed): Add a penalty for overly large weights so the model stays simple and generalizes better.

  6. Final model: Use the adjusted weights to make predictions on new data.

Regularization

Helps prevent overfitting by penalizing large coefficients.

Type Penalty Key Characteristics
L1 (Lasso) Sum of absolute values of weights - Encourages sparsity (some coefficients may become zero)
- Can be unstable with highly correlated features
- Avoids using all features if many are redundant
L2 (Ridge) Sum of squared values of weights - Tends to shrink coefficients but rarely sets any to zero
- More stable in the presence of correlated features
Elastic Net (L1 + L2) Combination of L1 and L2 penalties - Useful when multiple correlated features are suspected
- Retains feature selection from L1 while benefiting from L2’s stability

DT Purity

When deciding how to split a node, decision tree algorithms use measures like Gini Impurity or Entropy to assess how "pure" the resulting child nodes are.

Measure Range (Binary Setting) Calculation Characteristics
Gini Impurity 0 (pure) to 0.5 (impure) Uses squares of class probabilities - Slightly faster to compute
- Tends to isolate the most frequent class
Entropy 0 (pure) to 0.5 (impure) Uses logs of class probabilities - Tends to produce more balanced splits

Evaluation

Accuracy can be misleading. 2 primary reasons:

  1. Imbalanced Class Distributions: When one class dominates, accuracy may inflate how well the model performs.

  2. Ignoring Economic Costs/Benefits: Use a cost/benefit matrix to maximize profit.

Cost-Benefit Approach

  1. Construct a “cost/benefit” matrix, detailing the financial impact of each type of prediction:

  2. TP & TN: Represent revenue or benefits.

  3. FP & FN: Represent costs or losses.

  4. Multiply your confusion matrix by the cost/benefit matrix to calculate expected profit (or cost), and use this to guide decisions.

Formulas (TP, FP, TN, FN)

Metric Formula
True Positive Rate (TPR) / Recall \(\frac{TP}{TP + FN}\)
False Positive Rate (FPR) \(\frac{FP}{FP + TN}\)
Precision \(\frac{TP}{TP + FP}\)
Recall (Same as TPR) \(\frac{TP}{TP + FN}\)

Model Evaluation Techniques

ChatGPT: Key "curves" and model evaluation techniques commonly used in classification:

Scope

  • Within: Evaluate a single model (diagnose overfitting, threshold tuning, and class imbalance)

  • Across: Compare multiple models, or compare model vs baseline.

  • Either

Scope Evaluation Technique What Why Imbalanced Data Suitability
Within Confusion Matrix Shows counts of TP, TN, FP, FN Derive performance metrics. -
Within ROC Curve

Receiver Operating Characteristic
Plots TPR vs. FPR at different probability thresholds. Offers insight into the trade-off between TP & FP. Bad. When negative class is large, the FPR remains deceptively low, which makes ROC curve look overly optimistic.
Either AUC

Area Under the ROC Curve
A single-number summary (the area under the ROC curve). \(\text{AUC} = 1\) indicates a perfect model.

\(\text{AUC} = 0.5\) indicates a model with no discriminative power.
-
Either Precision-Recall Curve Plots precision vs. recall as the decision threshold varies. Especially useful for imbalanced datasets, or when false positives and false negatives incur high costs. Good. Focuses on the minority class, where precision and recall are most critical.
Across Lift Chart Compares the model’s performance against a random baseline. Shows how many more positives are identified by the model compared to random selection. Good. Especially relevant if you’re trying to identify a small minority class more effectively than chance.
Across Gain Chart Displays cumulative gain (the fraction of positives identified) as you move through the sorted predictions. Similar to Lift, it shows the improvement gained by the model over random selection. Good. Like the Lift chart, it highlights model performance on minority classes.
Across Cumulative Response Curve Shows the proportion of positive instances captured as you move through the ranked predictions. Commonly used in marketing and lead-generation applications to understand how quickly you capture most of the “yes” cases. -
Within Validation Curve Plots the training and validation scores across different levels of model complexity (e.g., varying hyperparameters). Helps diagnose overfitting or underfitting by showing whether the model performance is improving or plateauing. -

Single-Value Measures

Metric What Why Imbalanced Data Suitability
F-Measure (F1 Score) The harmonic mean of precision and recall:
\(F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
Combines precision and recall into a single metric, weighting them equally. Good. Highlights performance on the minority class, where both precision and recall can be low.
Matthews Correlation Coefficient (MCC) A correlation coefficient between observed and predicted classifications:
\(\text{MCC} = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\)
Accounts for all four quadrants (\(TP, TN, FP, FN\)) and provides a balanced measure even if the classes are of very different sizes. Good. MCC is often more informative than accuracy and works well with imbalanced classes.
Cohen’s Kappa Measures agreement between the model’s predictions and the true labels, adjusted for chance agreement. In imbalanced scenarios, a model might appear good by randomly guessing the majority class. Kappa accounts for this chance agreement. Mostly good. While it adjusts for chance, it can still be influenced by highly imbalanced distributions.

Classification Code

Curves (Matrix, Precision/Recall, ROC)

# Knn
param_grid = dict(n_neighbors = list(range(1,31)), 
weights = ["uniform", "distance"])
knn = KNeighborsClassifier()

# Tree
param_grid = dict(criterion = ["gini", "entropy"], 
max_depth = range(2,10),
min_samples_leaf = range(2,8),
min_impurity_decrease = [0,1e-8,1e-7,1e-6,1e-5,1e-4])
grid_tree_clf = tree.DecisionTreeClassifier(random_state=45)

# Logistic
param_grid = dict(penalty = ['l1', 'l2'], 
C = range(1,10))

# SVM
c = 5# reduce if overfitting
degrees = 3
influence = 1 
poly_kernel_svm_clf = Pipeline([
 ("scaler", StandardScaler()),
 ("svm_clf", SVC(kernel="poly", degree=degrees, coef0=influence, C=c))
 ])
poly_kernel_svm_clf.fit(X, y)
# In text

from sklearn.metrics import classification_report
from sklearn import metrics

target_names = ['malignant', 'benign']
y_true = y_test
y_pred = y_pred

print(target_names)
print("Accuracy: {0:.2%}".format(accuracy_score(y_true, y_pred)))
print("Precision: {0:.2%}".format(metrics.precision_score(y_true, y_pred)))
print("Recall: {0:.2%}".format(metrics.recall_score(y_true, y_pred)))
print("F1: {0:.2%}".format(metrics.f1_score(y_true, y_pred)))
print('-------------------------------------')
print(classification_report(y_true, y_pred))
# Visually - https://i.imgur.com/PExd8UC.png

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(ytest, yfit, target_names=faces.target_names))

mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');