Dimensionality Reduction
Dimensionality Reduction¶
Curse of dimensionality
-
More dimensions = sparse data (you'll have tons of variation if we have a ton of features)
-
Additional features can throw in extra noise
-
Distance metrics don't work as well in high dimensions
-
Probability distributions don't work well in high dimensions
PCA
-
Remove overlap among variables while retaining variation
-
Max variance, min reconstruction error
-
Transform correlated features into uncorrelated features (ie principal components)
Procedure
-
Normalize input data (z score recommended)
-
PCA computes a smaller basis from our features (ie principal components). Each principal component is an eigenvector, their corresponding value (importance) is the eigenvalue.
-
Sort these principal components with the most important on top. 1st axis shows most variance among the data, 2nd shows next, etc.
-
Eliminate the least important components. Need to do some kind of "elbow plot" to see how many principal components we need.
Break matrix into eigen decomp
-
Eigen vect = principal components
-
Eigen values = value from the original features
Reduction Techniques¶
(PCA, Factor Analysis, t-SNE, UMAP)
Dimensionality reduction: reduce number of variables
-
PCA combines columns, removes old ones
-
Attribute subset selection removes unneeded columns, uses some stats test to determine which
Numerosity reduction: replace data with smaller form
-
Parametric (regression)
-
Nonparametric (clustering, sampling)
Data compression: reconstruct dataset
- Discrete wavelet transformation transforms a column into wavelet coefficients, and then drops the rows that are not significant