ML 03 Classification

Algorithm Summary¶

k-NN is intuitive and simple but can be slow for large datasets.
Naive Bayes is fast and works well with high-dimensional data but assumes independent features.
Logistic Regression is a well-understood parametric approach, ideal for linearly separable data, and can handle regularization elegantly.
Decision Trees are highly interpretable but can overfit if not carefully regularized (via max depth, minimum samples per leaf, etc.).

CATEGORY	CORE IDEA	MODEL TYPE	KEY HYPERPARAMETERS	STRENGTHS	WEAKNESSES	REGULARIZATION
k-NN	1. Find k closest neighbors 2. Majority vote or average.	Non-parametric	k: Number of neighbors. Distance metric	Simple to implement. Minimal statistical assumptions. Models complex decision boundaries.	Slow for large data. Bad in high-d space. Domain knowledge for good distance metric.	Not typically applied, could use dimensionality reduction
Naive Bayes	Applies Bayes’ Theorem with a naive assumption of conditional independence among features	Probabilistic	Distribution choice (e.g. Multinomial)	Fast Robust with many features. Works well with small datasets.	Bad if features are highly correlated.	Smoothing techniques to avoid zero probabilities. (e.g., Laplace smoothing)
Logistic Regression	Estimates the probability of each class via a linear combination of features passed through a sigmoid (or softmax) function, making it a parametric model.	Parametric	C: Inverse regularization strength. multi_class: One-vs-Rest or multinomial. Regularization type	Interpretable coefficients. Can incorporate regularization Good when data is linearly (or log-linearly) separable.	May struggle with highly non-linear relationships. Sensitive to outliers if regularization is not used properly. Assumes linear (or log-linear) deci-bound.	L1 (Lasso) L2 (Ridge) Elastic Net: Combo of L1 and L2.
Decision Trees	Splits data into hierarchical branches based on feature values, aiming to maximize “purity” at each split.	Non-parametric	max_depth: Maximum tree depth. min_samples_leaf: Minimum samples required in a leaf node. Splitting criterion (e.g., Gini or Entropy).	Highly interpretable Handles numeric & categorical Don't need scaling or dummies. Handles multi-output problems.	Overfitting if grown without constraints. Greedy splitting may not yield a global optimum. Sensitive to data imbalance. Can struggle with certain complex interactions	Indirectly via max_depth, min_samples_leaf, etc. Pruning can reduce overfitting

Logistic¶

Sam

Target

Binary (Logit): Two classes (0 or 1).
Softmax Regression: Multiple classes (unordered); picks the class with the highest probability.
Ordered Logit: Multiple ordered classes.

\(probability(x) \: = \: \frac{1}{e^(-1 \: * \: regression \: model)}\)

Image | Logistic Regression Sigmoid Curve

Sam

Steps

Compute linear score: Per observation, add up the weighted contribution of each variable (ie, model’s raw score).
Convert to probability: Pass that score through the S-shaped logistic curve to get a probability between 0 and 1.
Compare to reality: Check how far those probabilities are from the actual group labels (0/1).
Fit the model: Adjust the weights so the probabilities match reality as closely as possible across all observations.
Regularize (if needed): Add a penalty for overly large weights so the model stays simple and generalizes better.
Final model: Use the adjusted weights to make predictions on new data.

Regularization¶

Helps prevent overfitting by penalizing large coefficients.

Type	Penalty	Key Characteristics
L1 (Lasso)	Sum of absolute values of weights	- Encourages sparsity (some coefficients may become zero) - Can be unstable with highly correlated features - Avoids using all features if many are redundant
L2 (Ridge)	Sum of squared values of weights	- Tends to shrink coefficients but rarely sets any to zero - More stable in the presence of correlated features
Elastic Net (L1 + L2)	Combination of L1 and L2 penalties	- Useful when multiple correlated features are suspected - Retains feature selection from L1 while benefiting from L2’s stability

DT Purity¶

When deciding how to split a node, decision tree algorithms use measures like Gini Impurity or Entropy to assess how "pure" the resulting child nodes are.

Measure	Range (Binary Setting)	Calculation	Characteristics
Gini Impurity	0 (pure) to 0.5 (impure)	Uses squares of class probabilities	- Slightly faster to compute - Tends to isolate the most frequent class
Entropy	0 (pure) to 0.5 (impure)	Uses logs of class probabilities	- Tends to produce more balanced splits

Evaluation¶

Accuracy can be misleading. 2 primary reasons:

Imbalanced Class Distributions: When one class dominates, accuracy may inflate how well the model performs.
Ignoring Economic Costs/Benefits: Use a cost/benefit matrix to maximize profit.

Cost-Benefit Approach

Construct a “cost/benefit” matrix, detailing the financial impact of each type of prediction:
TP & TN: Represent revenue or benefits.
FP & FN: Represent costs or losses.
Multiply your confusion matrix by the cost/benefit matrix to calculate expected profit (or cost), and use this to guide decisions.

Formulas (TP, FP, TN, FN)¶

Metric	Formula
True Positive Rate (TPR) / Recall	\(\frac{TP}{TP + FN}\)
False Positive Rate (FPR)	\(\frac{FP}{FP + TN}\)
Precision	\(\frac{TP}{TP + FP}\)
Recall (Same as TPR)	\(\frac{TP}{TP + FN}\)

Model Evaluation Techniques¶

ChatGPT: Key "curves" and model evaluation techniques commonly used in classification:

Scope

Within: Evaluate a single model (diagnose overfitting, threshold tuning, and class imbalance)
Across: Compare multiple models, or compare model vs baseline.
Either

`Scope`	Evaluation Technique	What	Why	Imbalanced Data Suitability
Within	Confusion Matrix	Shows counts of TP, TN, FP, FN	Derive performance metrics.	-
Within	ROC Curve Receiver Operating Characteristic	Plots TPR vs. FPR at different probability thresholds.	Offers insight into the trade-off between TP & FP.	Bad. When negative class is large, the FPR remains deceptively low, which makes ROC curve look overly optimistic.
Either	AUC Area Under the ROC Curve	A single-number summary (the area under the ROC curve).	\(\text{AUC} = 1\) indicates a perfect model. \(\text{AUC} = 0.5\) indicates a model with no discriminative power.	-
Either	Precision-Recall Curve	Plots precision vs. recall as the decision threshold varies.	Especially useful for imbalanced datasets, or when false positives and false negatives incur high costs.	Good. Focuses on the minority class, where precision and recall are most critical.
Across	Lift Chart	Compares the model’s performance against a random baseline.	Shows how many more positives are identified by the model compared to random selection.	Good. Especially relevant if you’re trying to identify a small minority class more effectively than chance.
Across	Gain Chart	Displays cumulative gain (the fraction of positives identified) as you move through the sorted predictions.	Similar to Lift, it shows the improvement gained by the model over random selection.	Good. Like the Lift chart, it highlights model performance on minority classes.
Across	Cumulative Response Curve	Shows the proportion of positive instances captured as you move through the ranked predictions.	Commonly used in marketing and lead-generation applications to understand how quickly you capture most of the “yes” cases.	-
Within	Validation Curve	Plots the training and validation scores across different levels of model complexity (e.g., varying hyperparameters).	Helps diagnose overfitting or underfitting by showing whether the model performance is improving or plateauing.	-

Single-Value Measures¶

Metric	What	Why	Imbalanced Data Suitability
F-Measure (F1 Score)	The harmonic mean of precision and recall: \(F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)	Combines precision and recall into a single metric, weighting them equally.	Good. Highlights performance on the minority class, where both precision and recall can be low.
Matthews Correlation Coefficient (MCC)	A correlation coefficient between observed and predicted classifications: \(\text{MCC} = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\)	Accounts for all four quadrants (\(TP, TN, FP, FN\)) and provides a balanced measure even if the classes are of very different sizes.	Good. MCC is often more informative than accuracy and works well with imbalanced classes.
Cohen’s Kappa	Measures agreement between the model’s predictions and the true labels, adjusted for chance agreement.	In imbalanced scenarios, a model might appear good by randomly guessing the majority class. Kappa accounts for this chance agreement.	Mostly good. While it adjusts for chance, it can still be influenced by highly imbalanced distributions.

Classification Code¶

Curves (Matrix, Precision/Recall, ROC)

# Knn
param_grid = dict(n_neighbors = list(range(1,31)), 
weights = ["uniform", "distance"])
knn = KNeighborsClassifier()

# Tree
param_grid = dict(criterion = ["gini", "entropy"], 
max_depth = range(2,10),
min_samples_leaf = range(2,8),
min_impurity_decrease = [0,1e-8,1e-7,1e-6,1e-5,1e-4])
grid_tree_clf = tree.DecisionTreeClassifier(random_state=45)

# Logistic
param_grid = dict(penalty = ['l1', 'l2'], 
C = range(1,10))

# SVM
c = 5# reduce if overfitting
degrees = 3
influence = 1 
poly_kernel_svm_clf = Pipeline([
 ("scaler", StandardScaler()),
 ("svm_clf", SVC(kernel="poly", degree=degrees, coef0=influence, C=c))
 ])
poly_kernel_svm_clf.fit(X, y)

# In text

from sklearn.metrics import classification_report
from sklearn import metrics

target_names = ['malignant', 'benign']
y_true = y_test
y_pred = y_pred

print(target_names)
print("Accuracy: {0:.2%}".format(accuracy_score(y_true, y_pred)))
print("Precision: {0:.2%}".format(metrics.precision_score(y_true, y_pred)))
print("Recall: {0:.2%}".format(metrics.recall_score(y_true, y_pred)))
print("F1: {0:.2%}".format(metrics.f1_score(y_true, y_pred)))
print('-------------------------------------')
print(classification_report(y_true, y_pred))

# Visually - https://i.imgur.com/PExd8UC.png

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(ytest, yfit, target_names=faces.target_names))

mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');