Dimensionality Reduction

Curse of dimensionality

More dimensions = sparse data (you'll have tons of variation if we have a ton of features)
Additional features can throw in extra noise
Distance metrics don't work as well in high dimensions
Probability distributions don't work well in high dimensions

PCA

Remove overlap among variables while retaining variation
Max variance, min reconstruction error
Transform correlated features into uncorrelated features (ie principal components)

Procedure

Normalize input data (z score recommended)
PCA computes a smaller basis from our features (ie principal components). Each principal component is an eigenvector, their corresponding value (importance) is the eigenvalue.
Sort these principal components with the most important on top. 1st axis shows most variance among the data, 2nd shows next, etc.
Eliminate the least important components. Need to do some kind of "elbow plot" to see how many principal components we need.

Break matrix into eigen decomp

(PCA, Factor Analysis, t-SNE, UMAP)

Dimensionality reduction: reduce number of variables

PCA combines columns, removes old ones
Attribute subset selection removes unneeded columns, uses some stats test to determine which

Numerosity reduction: replace data with smaller form

Data compression: reconstruct dataset

Discrete wavelet transformation transforms a column into wavelet coefficients, and then drops the rows that are not significant