The iron ML notebook
  • The iron data science notebook
  • ML & Data Science
    • Frequent Questions
      • Discriminative vs Generative models
      • Supervised vs Unsupervised learning
      • Batch vs Online Learning
      • Instance-based vs Model-based Learning
      • Bias-Variance Tradeoff
      • Probability vs Likelihood
      • Covariance vs Correlation Matrix
      • Precision vs Recall
      • How does a ROC curve work?
      • Ridge vs Lasso
      • Anomaly detection methods
      • How to deal with imbalanced datasets?
      • What is "Statistically Significant"?
      • Recommendation systems methods
    • Statistics
      • The basics
      • Distributions
      • Sampling
      • IQR
      • Z-score
      • F-statistic
      • Outliers
      • The bayesian basis
      • Statistic vs Parameter
      • Markov Monte Carlo Chain
    • ML Techniques
      • Pre-process
        • PCA
      • Loss functions
      • Regularization
      • Optimization
      • Metrics
        • Distance measures
      • Activation Functions
      • Selection functions
      • Feature Normalization
      • Cross-validation
      • Hyperparameter tuning
      • Ensemble methods
      • Hard negative mining
      • ML Serving
        • Quantization
        • Kernel Auto-Tuning
        • NVIDIA TensorRT vs ONNX Runtime
    • Machine Learning Algorithms
      • Supervised Learning
        • Support Vector Machines
        • Adaptative boosting
        • Gradient boosting
        • Regression algorithms
          • Linear Regression
          • Lasso regression
          • Multi Layer Perceptron
        • Classification algorithms
          • Perceptron
          • Logistic Regression
          • Multilayer Perceptron
          • kNN
          • Naive Bayes
          • Decision Trees
          • Random Forest
          • Gradient Boosted Trees
      • Unsupervised learning
        • Clustering
          • Clustering metrics
          • kMeans
          • Gaussian Mixture Model
          • Hierarchical clustering
          • DBSCAN
      • Cameras
        • Intrinsic and extrinsic parameters
    • Computer Vision
      • Object Detection
        • Two-Stage detectors
          • Traditional Detection Models
          • R-CNN
          • Fast R-CNN
          • Faster R-CNN
        • One-Stage detectors
          • YOLO
          • YOLO v2
          • YOLO v3
          • YOLOX
        • Techniques
          • NMS
          • ROI Pooling
        • Metrics
          • Objectness Score
          • Coco Metrics
          • IoU
      • MOT
        • SORT
        • Deep SORT
  • Related Topics
    • Intro
    • Python
      • Global Interpreter Lock (GIL)
      • Mutability
      • AsyncIO
    • SQL
    • Combinatorics
    • Data Engineering Questions
    • Distributed computation
      • About threads & processes
      • REST vs gRPC
  • Algorithms & data structures
    • Array
      • Online Stock Span
      • Two Sum
      • Best time to by and sell stock
      • Rank word combination
      • Largest subarray with zero sum
    • Binary
      • Sum of Two Integers
    • Tree
      • Maximum Depth of Binary Tree
      • Same Tree
      • Invert/Flip Binary Tree
      • Binary Tree Paths
      • Binary Tree Maximum Path Sum
    • Matrix
      • Set Matrix Zeroes
    • Linked List
      • Reverse Linked List
      • Detect Cycle
      • Merge Two Sorted Lists
      • Merge k Sorted Lists
    • String
      • Longest Substring Without Repeating Characters
      • Longest Repeating Character Replacement
      • Minimum Window Substring
    • Interval
    • Graph
    • Heap
    • Dynamic Programming
      • Fibonacci
      • Grid Traveler
      • Can Sum
      • How Sum
      • Best Sum
      • Can Construct
      • Count Construct
      • All Construct
      • Climbing Stairs
Powered by GitBook
On this page
  • Sampling
  • Statistical Sampling
  • Non-probability Sampling
  • Resampling techniques
  • Bootstrapping
  • K-Fold Cross-Validation
  • Oversampling & Undersampling

Was this helpful?

  1. ML & Data Science
  2. Statistics

Sampling

PreviousDistributionsNextIQR

Last updated 3 years ago

Was this helpful?

Sources:

Sampling

Sampling consists of selecting some part of the population to observe so that one may estimate something about the whole population.

Probability vs Non-probability sampling

  • The difference is whether the sample picking is based on randomization or not.

  • With randomization, every element gets an equal chance to be picked up and to be part of the sample.

Statistical Sampling

Simple Random Sampling

  • Every element has an equal chance of getting selected to be the part sample.

  • It is used when we do not have any kind of prior information about the target population.

Stratified Sampling

  • Elements of the population are divided into small subgroups based on similarity (ie data labels or classes).

  • The subset is selected from every group or class based on the percentage accounting for the overall population.

  • We need to have prior information about the population to create subgroups.

Cluster Sampling

  • The entire population is divided into clusters and then the clusters are randomly selected.

  • Clusters are created or identified based on features.

  • Sub-types:

    • Single Stage Cluster Sampling: the whole cluster is selected.

    • Two-Stage Cluster Sampling: elements from picked clusters are randomly selected.

Non-probability Sampling

This technique is more reliant on the researcher’s ability to select elements for a sample. The outcome of the sampling might be biased and makes it more difficult for all the elements of the population to be part of the sample equally.

Resampling techniques

Statistical resampling methods are procedures that describe how to economically use available data to estimate a population parameter. The result can be both a more accurate estimate of the parameter (such as taking the mean of the estimates) and a quantification of the uncertainty of the estimate (such as adding a confidence interval).

Bootstrapping

It is a resampling method by independently sampling with replacement from an existing sample data with same sample size n, and performing inference among these resampled data.

Steps:

  1. Get a sample from a population with sample size n.

  2. Draw a sample from the original sample data with replacement with size n, and replicate B times, each re-sampled sample is called a Bootstrap Sample, and there will totally B Bootstrap Samples.

  3. Evaluate the statistic of θ for each Bootstrap Sample, and there will be totally B estimates of θ.

  4. Construct a sampling distribution with these B Bootstrap statistics and use it to make a further statistical inference, such as:

    • Estimating the standard error of statistics for θ.

    • Obtaining a Confidence Interval for θ.

K-Fold Cross-Validation

  • Dataset is partitioned into K groups.

  • K-1 groups are assigned to training and the remaining one for testing.

  • Useful when the training dataset is fairly small.

  • To avoid overfitting!

Oversampling & Undersampling

  • Given imbalanced datasets, the problem we often face is where most data falls into the major class, whereas a few data falls into the minority class.

  • To overcome the poor performance of model training for such imbalanced data, over-sampling and under-sampling techniques are primarily suggested to produce equally distributed data fall into each class.

Methods of sampling from a population (Health Knowledge)
Sampling Techniques (Seema Singh)
A Gentle Introduction to Statistical Sampling and Resampling
Statistical Learning (II): Data Sampling & Resampling