I have thought that I have understand PCA quite well until I began to study this tutorial. PCA is an Dimension reduction/Embedding technique. Very fantastically, we can explain PCA from different points of view!
  • Energy preserving approximation: by removing the smallest eigenvalues/eigenvectors, the information loss of the approximation is minimized;
  • Local Structure Embedding: Given Y is the centered data matrix, A=Y'*Y (Gram/Kernel matrix) through normalization can be seen as the pairwise correlation matrix where each element Aij represents the correlation between i and j that is the cosine of the angle between these two vectors. Then because the energy preserving properties of PCA, the approximation preserves the local structure of nearby points due to their high cosine value (more energy); This property can also be explained as cluster discriminant within embedding system;
  • Feature dimension reduction, Clustering and Embedding is related with each other. The objective function can be derived to a common one. And solution of dimension reduction (PCA) is the k largest principal components Vk (Y'Y*Vk=aVk), Vk(s) are also the lower bounds of the indicator vectors of k clusters (the integer indicator (0...,1..1,0...) can be recovered from Vk) and Vk(s) can be used as the coordinate (Vk=A'X) to achieve low-dimension embedding;
  • Given a centered data matrix X, X*X' is the dimension covarance matrix (ignore the factor 1/n) according to its definition (m x m) that measures the variance of data to the mean. But because each dimension is not independent, we can not find the potential dominant direction with largest virance. Eigenvalue decomposition on the covariance matrix can simplify this matrix and make each dimension independent by rotating the coordinate system; normalized X'*X is the gram matrix or kernel matrix that is related to the cosine of the angle between each pair of data. So it is related to the pairwise similarity matrix (affinity matrix)