Clustering

Method	Characteristics	Advantages	Limitations	Use Cases
Partitioning	Distance-based clustering; MECE spheres.	- Simple to understand and implement. - Effective for small to medium datasets.	- Assumes spherical clusters. - Sensitive to initial conditions and outliers.	Clustering small to medium-sized datasets. (<1M rows)
Hierarchical	Multi-level cluster structure.	- Builds a hierarchy for better visualization. - No need to pre-specify the number of clusters.	- Cannot correct wrong merges or splits. - Computationally expensive for large datasets.	Visualizing nested clusters in small datasets.
Density-based	Identifies dense regions of points.	- Handles arbitrary cluster shapes. - Can filter out noise and outliers.	- Struggles with varying densities. - Sensitive to parameter tuning.	Clustering with noise or irregular cluster shapes.
Grid-based	Grid-based data partitioning.	- Fast processing regardless of dataset size. - Scalable for large datasets.	- Resolution depends on grid size. - Struggles with very high-dimensional data.	Fast clustering of large-scale datasets.

K-means

Motivation: What if not circular?

General idea of DB = continue growing cluster as long as we are meeting some threshold (min data points)

DBscan: continue growing cluster as long as we are meeting some threshold (min data points)

Epsilon: for each data point, radius of region if its the mean point
Density of neighborhood: number of data points in the region
MinPts: threshold to be considered dense
Core point: if the data point's region has MinPts
Direct density reachable: if a point is within the core point's region
Density reachable: direct density reachable from a point thats direct density reachable to the core point
Density connected: direct density reachable from a point thats direct density reachable to the neighborhood of a core point
Density based cluster: group of density connected points

Mixture model: start with data, identify true underlying distribution

a = Avg distance from all points in its own cluster

b = Avg distance from all points in nearest cluster

Gap statistic: does the additional cluster add anything meaningful? Visualized with the elbow plot.
x = Number of clusters
y = WSS (Within-Cluster-Sum of Squared Errors)

$$\sum (\text{each point} - \text{cluster mean})^2$$