Sampling
Last updated
Last updated
Sampling consists of selecting some part of the population to observe so that one may estimate something about the whole population.
Probability vs Non-probability sampling
The difference is whether the sample picking is based on randomization or not.
With randomization, every element gets an equal chance to be picked up and to be part of the sample.
Every element has an equal chance of getting selected to be the part sample.
It is used when we do not have any kind of prior information about the target population.
Elements of the population are divided into small subgroups based on similarity (ie data labels or classes).
The subset is selected from every group or class based on the percentage accounting for the overall population.
We need to have prior information about the population to create subgroups.
The entire population is divided into clusters and then the clusters are randomly selected.
Clusters are created or identified based on features.
Sub-types:
Single Stage Cluster Sampling: the whole cluster is selected.
Two-Stage Cluster Sampling: elements from picked clusters are randomly selected.
This technique is more reliant on the researcher’s ability to select elements for a sample. The outcome of the sampling might be biased and makes it more difficult for all the elements of the population to be part of the sample equally.
Statistical resampling methods are procedures that describe how to economically use available data to estimate a population parameter. The result can be both a more accurate estimate of the parameter (such as taking the mean of the estimates) and a quantification of the uncertainty of the estimate (such as adding a confidence interval).
It is a resampling method by independently sampling with replacement from an existing sample data with same sample size n, and performing inference among these resampled data.
Steps:
Get a sample from a population with sample size n.
Draw a sample from the original sample data with replacement with size n, and replicate B times, each re-sampled sample is called a Bootstrap Sample, and there will totally B Bootstrap Samples.
Evaluate the statistic of θ for each Bootstrap Sample, and there will be totally B estimates of θ.
Construct a sampling distribution with these B Bootstrap statistics and use it to make a further statistical inference, such as:
Estimating the standard error of statistics for θ.
Obtaining a Confidence Interval for θ.
Dataset is partitioned into K groups.
K-1 groups are assigned to training and the remaining one for testing.
Useful when the training dataset is fairly small.
To avoid overfitting!
Given imbalanced datasets, the problem we often face is where most data falls into the major class, whereas a few data falls into the minority class.
To overcome the poor performance of model training for such imbalanced data, over-sampling and under-sampling techniques are primarily suggested to produce equally distributed data fall into each class.