Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a widely used technique in Statistics and Machine Learning for reducing the number of variables in a dataset while preserving as much variability (information) as possible.
Definition
PCA transforms a set of possibly correlated variables into a new set of uncorrelated variables called principal components.
-
First principal component (PC1): captures the maximum possible variance.
-
Second principal component (PC2): captures the maximum remaining variance and is orthogonal to PC1.
-
Additional components continue in the same way.
Mathematical Transformation
PCA projects the original data matrix (X) onto a lower-dimensional subspace:
Z = XW
where:
-
(X): standardized data matrix
-
(W): matrix of eigenvectors (principal directions)
-
(Z): transformed data (principal component scores)
Steps in PCA
-
Standardize the variables.
-
Compute the covariance (or correlation) matrix.
-
Calculate eigenvalues and eigenvectors.
-
Sort components by decreasing eigenvalues.
-
Select the top (k) components.
-
Project the data onto those components.
Variance Explained
The proportion of variance explained by component (i) is:
[
\frac{\lambda_i}{\sum_j \lambda_j}
]
where (\lambda_i) is the corresponding eigenvalue.
Advantages
-
Reduces dimensionality
-
Removes redundancy from correlated variables
-
Reduces noise
-
Speeds up model training
-
Facilitates visualization
Limitations
-
Captures only linear relationships
-
Components may be difficult to interpret
-
Sensitive to feature scaling
-
Variance does not always correspond to predictive importance
Applications
-
Gene expression analysis
-
Image compression
-
Finance and risk modeling
-
Marketing segmentation
-
Exploratory data analysis
Example
If a dataset contains 100 correlated features, PCA may reveal that the first 10 principal components explain 95% of the total variance, allowing the data to be represented with only 10 variables instead of 100.
Summary
Principal Component Analysis converts correlated variables into a smaller set of orthogonal principal components that retain most of the data’s variability. It is one of the most important techniques for dimensionality reduction, data compression, noise filtering, and visualization.