PCA

Overview

In real-world data analysis tasks we analyze complex data. At first, more features increase performance. However, at some point as the dimensions of data increases, the difficulty to visualize it and perform computations on it also increases. This is called the course of dimensionality.

Principal Components Analysis solves this problem by making a dimensionality reduction. It performs these tasks by what is called in data science feature extraction.

If any good data scientist could choose an ideal dataset, it would be plenty of features with a high degree of independence from each other. Unfortunately, real-world datasets are not alike.

Check out the Variance and Covariance concepts!

Eigenvectors

It is a vector whose direction remains unchanged when a linear transformation is applied to it.

Eigenvalues

It is a scalar which corresponds and represents each eigenvector. The corresponding eigenvalue is a number that indicates how much variance there is in the data along that eigenvector (or principal component).

How does it work?

PCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are orthogonal (and hence linearly independent) and ranked according to the variance of data along them. It means more important principle axis occurs first. (more important = more variance/more spread out data)

The steps:

  1. Calculate covariance matrix X of data points.

  2. Calculate eigen vectors and corresponding eigen values. [Panic button]

  3. Choose first k eigen vectors and that will be the new k dimensions.

  4. Transform the original n dimensional data points into k dimensions.

The principle components are the eigenvectors of the covariance matrix of the original dataset. They correspond to the direction (in the original n-dimensional space) with the greatest variance in the data.

In construction...

Last updated