Anomaly Detection
Anomaly vs Outlier
-
Anomaly: new generating process
-
Outlier: same generating process (get rid of for modeling)
ChatGPT's Taxonomy¶
Sam
Anomaly Detection can be categorized using various criteria based on data characteristics, methods used, or application domains.
Here is a taxonomy of anomaly detection, based on..
Background¶
1. Data Type¶
-
Univariate vs. Multivariate
-
Static vs. Dynamic (times series, sequential)
-
Structured vs. Unstructured (database vs data like images/text)
2. Nature of Anomalies¶
-
Point: Individual point differs from norm.
-
Contextual: Data points that are anomalous only in a specific context (e.g., a low temperature in summer).
-
Collective: Groups of related data points differ from norm
3. Learning Paradigm¶
-
Supervised: Using labeled data
-
Semi-supervised: Train on normal data, find new points that differ (eg, Autoencoders trained on normal data)
-
Unsupervised: Use clustering, density estimation, or outlier scoring.
4. Detection Methods¶
-
Statistical: Assumes data follows a known statistical distribution; anomalies are deviations from this distribution.
- Z-score, Gaussian Mixture Models (GMM).
-
Machine Learning: Uses algorithms to learn patterns and identify anomalies.
-
Clustering: DBSCAN, k-means.
-
Classification: SVM, Random Forest.
-
Deep Learning: Autoencoders, Variational Autoencoders (VAEs), RNNs
-
-
Proximity-Based: Use distance or similarity.
-
k-Nearest Neighbors (k-NN)
-
Local Outlier Factor (LOF)
-
Background¶
1. Data Type¶
Univariate vs. Multivariate
Sam
| Category | Description | Example | Common Methods |
|---|---|---|---|
| Univariate | 1 variable | Daily temperature | - Z-score (Grubbs’ test) - GESD - simple parametric or histogram-based |
| Multivariate | 2+ variables | Temperature + humidity | - Mahalanobis distance, - multivariate Gaussians, - clustering |
Static vs. Dynamic
Sam
| Category | Description | Example | Common Methods |
|---|---|---|---|
| Static | Random samples in a dataset | - k-NN - clustering - one-class SVM - isolation forest |
|
| Dynamic / Time Series | Sequentially-ordered data | Sensor readings over time | - Moving average -STL decomposition -RNN/LSTM-based |
Structured vs. Unstructured
Sam
| Category | Description | Example | Common Methods |
|---|---|---|---|
| Structured | Data in well-defined schemas (relational database, tabular) | Relational tables in a database | - SQL-based queries - Bayesian networks - standard ML |
| Unstructured | Data without a strict schema (text, images, audio) | Image anomaly detection, log-text analysis | - CNNs for images - transformer-based text models |
2. Nature of Anomalies¶
| Type | Definition | Example | Notes |
|---|---|---|---|
| Point / Global | A single data point that is far from the rest of the distribution | A single extremely high transaction value | Often flagged via simple statistical methods (e.g., Z-score) |
| Contextual | A data point that is only anomalous in a given context (time, location, etc.) | 75°F in Canadian winter | Must model both context attributes (e.g., location/time) and behavioral attributes (e.g., temp) |
| Collective | A set of related data points that jointly deviate from normal patterns | A group of transactions that spike at once | Individual points might appear normal, but their group behavior is anomalous |
| Distributional Shift | The overall distribution changes (concept drift, new patterns emerging) | A sudden change in average temperature or user behavior | May require updating the model or using methods that adapt over time |
3. Learning Paradigm¶
| Paradigm | Key Idea | Typical Methods | Pros / Cons |
|---|---|---|---|
| Supervised | Train a model with both normal and anomalous labeled data | SVM, Random Forest, Logistic Regression | Pros: Accurate if labeled data is representative; Cons: Requires labeled anomalies, which can be expensive or rare |
| Semi-Supervised | Train primarily on normal data; anomalies deviate significantly from learned “normal” | Autoencoders (reconstruction error), One-Class SVM | Pros: Easier to obtain normal data; Cons: Might miss anomalies that look somewhat “normal” |
| Unsupervised | No labels; rely on inherent data structure to find outliers | Clustering (DBSCAN, k-means), LOF, isolation forest | Pros: No labels needed; Cons: Often requires careful parameter tuning; might flag too many or too few points |
4. Detection Methods¶
4.1 Statistical Methods¶
| Approach | Assumption | Pros / Cons |
|---|---|---|
| Parametric | Data follows a known distribution (e.g., Gaussian) | Pros: Straightforward if correct distribution known Cons: Sensitive to distribution mismatch |
| Non-Parametric | No assumptions on data distribution | Pros: Flexible Cons: Can be more computationally intensive |
| Univariate | Multivariate | |
|---|---|---|
| Parametric | Maximum Likelihood Estimation (MLE): \(\rightarrow\) Fit a chosen distribution (often normal) by estimating parameters (e.g., mean and std. dev.) \(\rightarrow\) Points far in the tails may be flagged as outliers Grubbs’ Test (Z-score): \(\rightarrow\) Identifies a single outlier by comparing Z-scores to a threshold GESD (Generalized Extreme Studentized Deviate): \(\rightarrow\) Iteratively detects multiple outliers |
Mahalanobis Distance: \(\rightarrow\) Assumes data follows a multivariate normal distribution; uses mean & covariance to measure distance Chi-squared Test: \(\rightarrow\) Often used if variables are assumed jointly normal; large values indicate a potential outlier Expectation-Maximization (EM): \(\rightarrow\) Fits a Gaussian Mixture Model (or other distributions); points with low likelihood are flagged as anomalies |
| Non-Parametric | - Histogram-Based: \(\rightarrow\) Estimate density via binning; points in sparse bins have high outlier scores - Rank-Based Tests: \(\rightarrow\) Compare point ranks to expected distributions (e.g., Wilcoxon-type tests) |
Kernel Density Estimation (KDE): \(\rightarrow\) Estimates the multivariate density; points in low-density regions may be anomalies Non-Parametric Distance Methods: \(\rightarrow\) Similar to proximity-based methods, but viewed from a statistical density perspective |
Notes
-
Parametric (Univariate)
-
Maximum Likelihood Estimation (MLE) is used to find parameters (e.g., mean \mu, std. dev. \sigma) for a hypothesized distribution.
-
Grubbs’ Test and GESD rely on these parameters to flag outliers.
-
Parametric (Multivariate)
-
Mahalanobis Distance and Chi-squared tests assume a multivariate Gaussian distribution.
-
Expectation-Maximization (EM) can model more complex or mixed distributions (e.g., Gaussian Mixture Models).
-
Non-Parametric (Univariate / Multivariate)
-
Do not assume a specific distribution.
-
Rely on data-driven density estimation (e.g., histograms, kernel density) or rank-based methodologies.
4.2 Machine Learning Methods¶
| Method | Key Idea | Examples | Notes |
|---|---|---|---|
| Clustering | Normal data form clusters; outliers do not fit well | k-means, DBSCAN | Points far from cluster centroids or in sparse clusters are flagged |
| Classification | Labeled normal/anomalous classes | SVM, Random Forest | Requires sufficient labeled anomalies, which may be rare |
| Deep Learning | Learn representations or patterns in high-dimensional data | Autoencoders, VAEs, RNNs | Autoencoder reconstruction error is a common anomaly signal |
-
Clusters
-
Does it belong to a cluster? If no --> outlier
-
Is it far away from its cluster? If yes --> outlier
-
Is the cluster small or sparse? if yes --> outlier
-
4.3 Proximity-Based Methods¶
| Method | Definition | Examples | Calculation / Notes |
|---|---|---|---|
| Distance-Based | Outliers are “far” from neighbors | k-NN outlier detection | Threshold-based or top-N distance rankings |
| Density-Based | Outliers appear in low-density regions | DBSCAN, LOF | Compare local density with that of neighbors |
| Local Outlier Factor (LOF) | Ratio of the density of a point vs. the density of its neighbors | LOF algorithm | 1. Find kkk-nearest neighbors 2. Compute local reachability density 3. LOF > 1 indicates anomaly |
LOF: larger value = more anomalous Steps using 3 nn as an example
-
Step 1: Find distance from me to 3 nn
-
Step 2: Take the avg
-
Step 3: Local reach density = 1 / that avg
-
Step 4: LOF = Avg local reach density of neighbors / my local reach density
4.4 Subset Scanning¶
| Approach | Key Idea | Examples | Notes |
|---|---|---|---|
| WSARE + Chi-squared | Identify a rule or subset that is not independent of time | Binning data by features, seeing if some combination spikes | Commonly used for disease outbreak detection; uses stats to see if patterns deviate in time |
| Bayesian Networks | Check if a row/subset fits a “normal” category | Causal or hierarchical relationships | Learns the joint probability of features; flags subsets that break expected dependencies |
| Predictive Models | Build a forecast or expected behavior, compare to actual | Regressions, ARIMA, machine learning | Subsets that deviate significantly from predictions are flagged |
Time-Dependent Anomaly Detection¶
\(\text{Temporal data = seasonal patterns + overall trend + irregular (ie noise})\). We need to take these into consideration to make our data stationary.
| Aspect | Description | Techniques | Notes |
|---|---|---|---|
| Seasonality & Trends | Data may exhibit daily/weekly/annual cycles and overall trends | - STL decomposition - Moving average/residual analysis |
Helps isolate the “irregular” component where anomalies may be found |
| Moving Average | Uses recent history to estimate today’s expected value | - Compare actual vs. expected - Residuals flagged if large |
Assumes short-term stationarity, can apply statistical tests (e.g., GESD) on residuals |
| STL (Seasonal and Trend decomposition using Loess) | Decomposes time series into seasonal, trend, and remainder components | - Loess-based smoothing | The remainder (irregular) component can highlight anomalies after accounting for seasonality and trend. Good for nonlinear relationships. |
| Sequential / RNN-based | Models temporal dependencies (e.g., LSTM) | - Recurrent Neural Networks | Learns normal temporal patterns; flags unusual sequences or hidden state transitions |
| Concept Drift / Distribution Shift | Distribution may change over time, invalidating older models | - Online learning - Adaptive algorithms |
Requires continuous updating of model parameters to adapt to new normal patterns |
Moving Average: assuming recent data is indicative of today Steps
-
Take difference from what we expect - what we got
- Residuals expected to be normally distributed (CLT)
-
Flag the values that are different than what we would expect from the most recent with GESD