Covariance vs Correlation Matrix

Overview

  • Covariance \Rightarrow direction of the linear relationship between variables.

  • Correlation \Rightarrow measure of the strength and direction of a linear relationship.

Correlation values are standardized whereas, covariance values are not.

Covariance Matrix

Focusing on the two-dimensional case, the covariance matrix for two dimensions (or xx and yyvariables) is given by:

C=(σ(x,x)σ(x,y)σ(y,x)σ(y,y))C = \begin{pmatrix} \sigma(x,x) & \sigma(x,y) \\ \sigma(y,x) & \sigma(y,y) \end{pmatrix}
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 8)

mean = 0
std = 1
num_samples = 500

x = np.random.normal(mean, std, num_samples)
y = np.random.normal(mean, std, num_samples)
X = np.vstack((x, y)).T  # Join both arrays and transpose
# X = np.stack(arrays=[x, y], axis=1) # Equivalent transformation

plt.scatter(X[:, 0], X[:, 1])
plt.title('Generated Data')
plt.axis('equal');

Correlation Matrix

Unlike covariance, the correlation has an upper and lower cap on a range [1,1][-1, 1].

The correlation coefficient of two variables could be get by dividing the covariance of these variables by the product of the standard deviations of the same values.

ρx,y=corr(x,y)=σx,yσx2σy2\rho_{x,y} = corr(x,y) = \frac{\sigma_{x,y}}{\sigma_{x}^2\sigma_{y}^2}
import pandas as pd

data = np.random.RandomState(seed=0)
correlation = pd.DataFrame(data.rand(10, 10)).corr()

correlation.style.background_gradient(cmap='coolwarm')

Last updated