Anomaly Detection

Anomaly vs Outlier

Anomaly: new generating process
Outlier: same generating process (get rid of for modeling)

ChatGPT's Taxonomy¶

Sam

Anomaly Detection can be categorized using various criteria based on data characteristics, methods used, or application domains.

Here is a taxonomy of anomaly detection, based on..

Background¶

1. Data Type¶

Univariate vs. Multivariate
Static vs. Dynamic (times series, sequential)
Structured vs. Unstructured (database vs data like images/text)

2. Nature of Anomalies¶

Point: Individual point differs from norm.
Contextual: Data points that are anomalous only in a specific context (e.g., a low temperature in summer).
Collective: Groups of related data points differ from norm

3. Learning Paradigm¶

Supervised: Using labeled data
Semi-supervised: Train on normal data, find new points that differ (eg, Autoencoders trained on normal data)
Unsupervised: Use clustering, density estimation, or outlier scoring.

4. Detection Methods¶

Statistical: Assumes data follows a known statistical distribution; anomalies are deviations from this distribution.
- Z-score, Gaussian Mixture Models (GMM).
Machine Learning: Uses algorithms to learn patterns and identify anomalies.
- Clustering: DBSCAN, k-means.
- Classification: SVM, Random Forest.
- Deep Learning: Autoencoders, Variational Autoencoders (VAEs), RNNs
Proximity-Based: Use distance or similarity.
- k-Nearest Neighbors (k-NN)
- Local Outlier Factor (LOF)

Background¶

1. Data Type¶

Univariate vs. Multivariate

Sam

Category	Description	Example	Common Methods
Univariate	1 variable	Daily temperature	- Z-score (Grubbs’ test) - GESD - simple parametric or histogram-based
Multivariate	2+ variables	Temperature + humidity	- Mahalanobis distance, - multivariate Gaussians, - clustering

Static vs. Dynamic

Sam

Category	Description	Example	Common Methods
Static		Random samples in a dataset	- k-NN - clustering - one-class SVM - isolation forest
Dynamic / Time Series	Sequentially-ordered data	Sensor readings over time	- Moving average -STL decomposition -RNN/LSTM-based

Structured vs. Unstructured

Sam

Category	Description	Example	Common Methods
Structured	Data in well-defined schemas (relational database, tabular)	Relational tables in a database	- SQL-based queries - Bayesian networks - standard ML
Unstructured	Data without a strict schema (text, images, audio)	Image anomaly detection, log-text analysis	- CNNs for images - transformer-based text models

2. Nature of Anomalies¶

Type	Definition	Example	Notes
Point / Global	A single data point that is far from the rest of the distribution	A single extremely high transaction value	Often flagged via simple statistical methods (e.g., Z-score)
Contextual	A data point that is only anomalous in a given context (time, location, etc.)	75°F in Canadian winter	Must model both context attributes (e.g., location/time) and behavioral attributes (e.g., temp)
Collective	A set of related data points that jointly deviate from normal patterns	A group of transactions that spike at once	Individual points might appear normal, but their group behavior is anomalous
Distributional Shift	The overall distribution changes (concept drift, new patterns emerging)	A sudden change in average temperature or user behavior	May require updating the model or using methods that adapt over time

3. Learning Paradigm¶

Paradigm	Key Idea	Typical Methods	Pros / Cons
Supervised	Train a model with both normal and anomalous labeled data	SVM, Random Forest, Logistic Regression	Pros: Accurate if labeled data is representative; Cons: Requires labeled anomalies, which can be expensive or rare
Semi-Supervised	Train primarily on normal data; anomalies deviate significantly from learned “normal”	Autoencoders (reconstruction error), One-Class SVM	Pros: Easier to obtain normal data; Cons: Might miss anomalies that look somewhat “normal”
Unsupervised	No labels; rely on inherent data structure to find outliers	Clustering (DBSCAN, k-means), LOF, isolation forest	Pros: No labels needed; Cons: Often requires careful parameter tuning; might flag too many or too few points

4. Detection Methods¶

4.1 Statistical Methods¶

Approach	Assumption	Pros / Cons
Parametric	Data follows a known distribution (e.g., Gaussian)	Pros: Straightforward if correct distribution known Cons: Sensitive to distribution mismatch
Non-Parametric	No assumptions on data distribution	Pros: Flexible Cons: Can be more computationally intensive

	Univariate	Multivariate
Parametric	Maximum Likelihood Estimation (MLE): \(\rightarrow\) Fit a chosen distribution (often normal) by estimating parameters (e.g., mean and std. dev.) \(\rightarrow\) Points far in the tails may be flagged as outliers Grubbs’ Test (Z-score): \(\rightarrow\) Identifies a single outlier by comparing Z-scores to a threshold GESD (Generalized Extreme Studentized Deviate): \(\rightarrow\) Iteratively detects multiple outliers	Mahalanobis Distance: \(\rightarrow\) Assumes data follows a multivariate normal distribution; uses mean & covariance to measure distance Chi-squared Test: \(\rightarrow\) Often used if variables are assumed jointly normal; large values indicate a potential outlier Expectation-Maximization (EM): \(\rightarrow\) Fits a Gaussian Mixture Model (or other distributions); points with low likelihood are flagged as anomalies
Non-Parametric	- Histogram-Based: \(\rightarrow\) Estimate density via binning; points in sparse bins have high outlier scores - Rank-Based Tests: \(\rightarrow\) Compare point ranks to expected distributions (e.g., Wilcoxon-type tests)	Kernel Density Estimation (KDE): \(\rightarrow\) Estimates the multivariate density; points in low-density regions may be anomalies Non-Parametric Distance Methods: \(\rightarrow\) Similar to proximity-based methods, but viewed from a statistical density perspective

Notes

Parametric (Univariate)
Maximum Likelihood Estimation (MLE) is used to find parameters (e.g., mean \mu, std. dev. \sigma) for a hypothesized distribution.
Grubbs’ Test and GESD rely on these parameters to flag outliers.
Parametric (Multivariate)
Mahalanobis Distance and Chi-squared tests assume a multivariate Gaussian distribution.
Expectation-Maximization (EM) can model more complex or mixed distributions (e.g., Gaussian Mixture Models).
Non-Parametric (Univariate / Multivariate)
Do not assume a specific distribution.
Rely on data-driven density estimation (e.g., histograms, kernel density) or rank-based methodologies.

4.2 Machine Learning Methods¶

Method	Key Idea	Examples	Notes
Clustering	Normal data form clusters; outliers do not fit well	k-means, DBSCAN	Points far from cluster centroids or in sparse clusters are flagged
Classification	Labeled normal/anomalous classes	SVM, Random Forest	Requires sufficient labeled anomalies, which may be rare
Deep Learning	Learn representations or patterns in high-dimensional data	Autoencoders, VAEs, RNNs	Autoencoder reconstruction error is a common anomaly signal

Clusters
- Does it belong to a cluster? If no --> outlier
- Is it far away from its cluster? If yes --> outlier
- Is the cluster small or sparse? if yes --> outlier

4.3 Proximity-Based Methods¶

Method	Definition	Examples	Calculation / Notes
Distance-Based	Outliers are “far” from neighbors	k-NN outlier detection	Threshold-based or top-N distance rankings
Density-Based	Outliers appear in low-density regions	DBSCAN, LOF	Compare local density with that of neighbors
Local Outlier Factor (LOF)	Ratio of the density of a point vs. the density of its neighbors	LOF algorithm	1. Find kkk-nearest neighbors 2. Compute local reachability density 3. LOF > 1 indicates anomaly

LOF: larger value = more anomalous Steps using 3 nn as an example

Step 1: Find distance from me to 3 nn
Step 2: Take the avg
Step 3: Local reach density = 1 / that avg
Step 4: LOF = Avg local reach density of neighbors / my local reach density

4.4 Subset Scanning¶

Approach	Key Idea	Examples	Notes
WSARE + Chi-squared	Identify a rule or subset that is not independent of time	Binning data by features, seeing if some combination spikes	Commonly used for disease outbreak detection; uses stats to see if patterns deviate in time
Bayesian Networks	Check if a row/subset fits a “normal” category	Causal or hierarchical relationships	Learns the joint probability of features; flags subsets that break expected dependencies
Predictive Models	Build a forecast or expected behavior, compare to actual	Regressions, ARIMA, machine learning	Subsets that deviate significantly from predictions are flagged

Time-Dependent Anomaly Detection¶

\(\text{Temporal data = seasonal patterns + overall trend + irregular (ie noise})\). We need to take these into consideration to make our data stationary.

Aspect	Description	Techniques	Notes
Seasonality & Trends	Data may exhibit daily/weekly/annual cycles and overall trends	- STL decomposition - Moving average/residual analysis	Helps isolate the “irregular” component where anomalies may be found
Moving Average	Uses recent history to estimate today’s expected value	- Compare actual vs. expected - Residuals flagged if large	Assumes short-term stationarity, can apply statistical tests (e.g., GESD) on residuals
STL (Seasonal and Trend decomposition using Loess)	Decomposes time series into seasonal, trend, and remainder components	- Loess-based smoothing	The remainder (irregular) component can highlight anomalies after accounting for seasonality and trend. Good for nonlinear relationships.
Sequential / RNN-based	Models temporal dependencies (e.g., LSTM)	- Recurrent Neural Networks	Learns normal temporal patterns; flags unusual sequences or hidden state transitions
Concept Drift / Distribution Shift	Distribution may change over time, invalidating older models	- Online learning - Adaptive algorithms	Requires continuous updating of model parameters to adapt to new normal patterns

Moving Average: assuming recent data is indicative of today Steps

Take difference from what we expect - what we got
1. Residuals expected to be normally distributed (CLT)
Flag the values that are different than what we would expect from the most recent with GESD