Clustering
Overview¶
| Method | Characteristics | Advantages | Limitations | Use Cases |
|---|---|---|---|---|
| Partitioning | Distance-based clustering; MECE spheres. | - Simple to understand and implement. - Effective for small to medium datasets. |
- Assumes spherical clusters. - Sensitive to initial conditions and outliers. |
Clustering small to medium-sized datasets. (<1M rows) |
| Hierarchical | Multi-level cluster structure. | - Builds a hierarchy for better visualization. - No need to pre-specify the number of clusters. |
- Cannot correct wrong merges or splits. - Computationally expensive for large datasets. |
Visualizing nested clusters in small datasets. |
| Density-based | Identifies dense regions of points. | - Handles arbitrary cluster shapes. - Can filter out noise and outliers. |
- Struggles with varying densities. - Sensitive to parameter tuning. |
Clustering with noise or irregular cluster shapes. |
| Grid-based | Grid-based data partitioning. | - Fast processing regardless of dataset size. - Scalable for large datasets. |
- Resolution depends on grid size. - Struggles with very high-dimensional data. |
Fast clustering of large-scale datasets. |
Partitioning¶
K-means
-
High influence of outliers
-
Could use k-medioids instead
-
Only works for continuous
-
K-modes (Hamming Distance)
-
K-prototype: kmeans + kmode (Gower Distance)
-
No hierarchy provided
-
Use hierarchy cluster and then partition cluster
-
Bias towards circles
-
Use dbscan instead
Hierarchical¶
-
Agglomerative - opposite of divisive
-
AGNES (AGglomerative NESting)
-
Divisive - start with 1 cluster, split apart
-
DIANA (DIvisive ANAlysis)

Density¶
Motivation: What if not circular?
- General idea of DB = continue growing cluster as long as we are meeting some threshold (min data points)
DBscan: continue growing cluster as long as we are meeting some threshold (min data points)
-
Epsilon: for each data point, radius of region if its the mean point
-
Density of neighborhood: number of data points in the region
-
MinPts: threshold to be considered dense
-
Core point: if the data point's region has MinPts
-
Direct density reachable: if a point is within the core point's region
-
Density reachable: direct density reachable from a point thats direct density reachable to the core point
-
Density connected: direct density reachable from a point thats direct density reachable to the neighborhood of a core point
-
Density based cluster: group of density connected points
Mixture model: start with data, identify true underlying distribution
Measuring Performance¶
-
Silhouette = max homo within, max hetero between
-
Are individual points correctly assigned to their clusters?
-
Coef between -1 and 1
-
\[\text{Silhouette coef} = \frac{b - a}{Max \: b - max \: a}\]
a = Avg distance from all points in its own cluster
b = Avg distance from all points in nearest cluster

-
Gap statistic: does the additional cluster add anything meaningful? Visualized with the elbow plot.
-
x = Number of clusters
-
y = WSS (Within-Cluster-Sum of Squared Errors)
$\(\sum (\text{each point} - \text{cluster mean})^2\)$
