The basics

Credits & Sources:

Properties of distributions

Mean

Found by adding all of the numbers together and dividing by the number of items in the set:

{\bar {x}}={\frac {1}{n}}\left(\sum _{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}

Median

It depends on whether the number of terms in the distribution. Once the values are sorted.

If the given number of terms is odd:
- It's the value in de middle.
- 1, 3, 3, 6, 7, 8, 9 $\Rightarrow$ 6
If the given number of terms is even:
- It's the average of the two terms in the middle.
- 1, 2, 3, 4, 5, 6, 8, 9 $\Rightarrow$ (4+5) $\div$ 2 = 4.5

Mode

It's the most repeated value within the distribution. Example:

1, 2, 2, 3, 4, 7, 9 $\Rightarrow$ 2

It's quite common to find more than one mode, especially if there aren't many terms. A distribution with two modes is called bimodal. As well, with three it's called trimodal.

Range

It's the difference between the maximum value and the minimum value. Example:

1, 2, 2, 3, 4, 7, 9 $\Rightarrow$ 9 - 1 = 8

Variance

It measures how far a set of random numbers are spread out from their mean.

\sigma^2 = \frac {1}{N} \sum _{i=1}^{N} (x_{i} - \mu)^2

import numpy as np

np.random.seed(333)
np.std(np.random.rand(100))

Standard deviation

It's is the squarred root of the variance. A low standard deviation indicates that the data points tend to be close to the mean.

\sigma = \sqrt{ \frac {1}{N} \sum _{i=1}^{N} (x_{i} - \mu)^2}

import numpy as np

np.random.seed(333)
np.var(np.random.rand(100))

Covariance

It's a measure of the joint variability of two random variables. measures of the extent to which corresponding elements from two sets of ordered data move in the same direction. It measures how much two variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together.

\sigma_{XY} = \frac{ \sum_{(x,y) \epsilon S} (x - \mu_X) (y - \mu_Y) }{N}

Covariance vs Correlation Matrix

Mean vs Median

Median is much less sensitive to outliers.
However, almost all analytic calculations on sets of data are more natural in terms of the mean than the median.
The difference between the median and the mean is useful to represent how skewed the data is.
The real use of the median comes when the data set may contain extreme outliers. Then, describing the distribution in terms of quartiles can be more informative than quoting $\mu$ and $\sigma$ .
For skewed distributions, the mean is not necessarily the same as the median or the mode. For example, mean income is typically skewed upwards by a small number of people with very large incomes, so that the majority have an income lower than the mean. By contrast, the median income is the level at which half the population is below and half is above. The mode income is the most likely income and favors the larger number of people with lower incomes. Median and mode are often more intuitive measures for such skewed data, BUT many skewed distributions are in fact best described by their mean, including the Exponential and Poisson distributions.

Difference between standard deviation of a sample and standard error of the population mean

Meanwhile the standard deviation expresses how disperse is data with respect to the mean, the standard error measures the standard deviation of its sampling distribution.

The sampling distribution of a population mean is generated by repeated sampling and recording of the means obtained. This forms a distribution of different means, and this distribution has its own mean and variance.

Given the standard error of the population $\sigma$ and the size $n$ of a sample, the standard error of a sample of this population is expressed as:

\sigma_{x}^{-} = \frac{\sigma}{\sqrt{n}}

PreviousStatistics NextDistributions

Last updated 1 month ago

Was this helpful?