Skip to content

Clustering

Overview

Image: Same as table below.

Method Characteristics Advantages Limitations Use Cases
Partitioning Distance-based clustering; MECE spheres. - Simple to understand and implement.
- Effective for small to medium datasets.
- Assumes spherical clusters.
- Sensitive to initial conditions and outliers.
Clustering small to medium-sized datasets. (<1M rows)
Hierarchical Multi-level cluster structure. - Builds a hierarchy for better visualization.
- No need to pre-specify the number of clusters.
- Cannot correct wrong merges or splits.
- Computationally expensive for large datasets.
Visualizing nested clusters in small datasets.
Density-based Identifies dense regions of points. - Handles arbitrary cluster shapes.
- Can filter out noise and outliers.
- Struggles with varying densities.
- Sensitive to parameter tuning.
Clustering with noise or irregular cluster shapes.
Grid-based Grid-based data partitioning. - Fast processing regardless of dataset size.
- Scalable for large datasets.
- Resolution depends on grid size.
- Struggles with very high-dimensional data.
Fast clustering of large-scale datasets.

Partitioning

K-means

  • High influence of outliers

  • Could use k-medioids instead

  • Only works for continuous

  • K-modes (Hamming Distance)

  • K-prototype: kmeans + kmode (Gower Distance)

  • No hierarchy provided

  • Use hierarchy cluster and then partition cluster

  • Bias towards circles

  • Use dbscan instead

Hierarchical

  1. Agglomerative - opposite of divisive

  2. AGNES (AGglomerative NESting)

  3. Divisive - start with 1 cluster, split apart

  4. DIANA (DIvisive ANAlysis)

Density

Motivation: What if not circular?

  • General idea of DB = continue growing cluster as long as we are meeting some threshold (min data points)

DBscan: continue growing cluster as long as we are meeting some threshold (min data points)

  1. Epsilon: for each data point, radius of region if its the mean point

  2. Density of neighborhood: number of data points in the region

  3. MinPts: threshold to be considered dense

  4. Core point: if the data point's region has MinPts

  5. Direct density reachable: if a point is within the core point's region

  6. Density reachable: direct density reachable from a point thats direct density reachable to the core point

  7. Density connected: direct density reachable from a point thats direct density reachable to the neighborhood of a core point

  8. Density based cluster: group of density connected points

Mixture model: start with data, identify true underlying distribution


Measuring Performance

  1. Silhouette = max homo within, max hetero between

  2. Are individual points correctly assigned to their clusters?

  3. Coef between -1 and 1

  4. \[\text{Silhouette coef} = \frac{b - a}{Max \: b - max \: a}\]

a = Avg distance from all points in its own cluster

b = Avg distance from all points in nearest cluster

  1. Gap statistic: does the additional cluster add anything meaningful? Visualized with the elbow plot.

  2. x = Number of clusters

  3. y = WSS (Within-Cluster-Sum of Squared Errors)

$\(\sum (\text{each point} - \text{cluster mean})^2\)$