The iron ML notebook
  • The iron data science notebook
  • ML & Data Science
    • Frequent Questions
      • Discriminative vs Generative models
      • Supervised vs Unsupervised learning
      • Batch vs Online Learning
      • Instance-based vs Model-based Learning
      • Bias-Variance Tradeoff
      • Probability vs Likelihood
      • Covariance vs Correlation Matrix
      • Precision vs Recall
      • How does a ROC curve work?
      • Ridge vs Lasso
      • Anomaly detection methods
      • How to deal with imbalanced datasets?
      • What is "Statistically Significant"?
      • Recommendation systems methods
    • Statistics
      • The basics
      • Distributions
      • Sampling
      • IQR
      • Z-score
      • F-statistic
      • Outliers
      • The bayesian basis
      • Statistic vs Parameter
      • Markov Monte Carlo Chain
    • ML Techniques
      • Pre-process
        • PCA
      • Loss functions
      • Regularization
      • Optimization
      • Metrics
        • Distance measures
      • Activation Functions
      • Selection functions
      • Feature Normalization
      • Cross-validation
      • Hyperparameter tuning
      • Ensemble methods
      • Hard negative mining
      • ML Serving
        • Quantization
        • Kernel Auto-Tuning
        • NVIDIA TensorRT vs ONNX Runtime
    • Machine Learning Algorithms
      • Supervised Learning
        • Support Vector Machines
        • Adaptative boosting
        • Gradient boosting
        • Regression algorithms
          • Linear Regression
          • Lasso regression
          • Multi Layer Perceptron
        • Classification algorithms
          • Perceptron
          • Logistic Regression
          • Multilayer Perceptron
          • kNN
          • Naive Bayes
          • Decision Trees
          • Random Forest
          • Gradient Boosted Trees
      • Unsupervised learning
        • Clustering
          • Clustering metrics
          • kMeans
          • Gaussian Mixture Model
          • Hierarchical clustering
          • DBSCAN
      • Cameras
        • Intrinsic and extrinsic parameters
    • Computer Vision
      • Object Detection
        • Two-Stage detectors
          • Traditional Detection Models
          • R-CNN
          • Fast R-CNN
          • Faster R-CNN
        • One-Stage detectors
          • YOLO
          • YOLO v2
          • YOLO v3
          • YOLOX
        • Techniques
          • NMS
          • ROI Pooling
        • Metrics
          • Objectness Score
          • Coco Metrics
          • IoU
      • MOT
        • SORT
        • Deep SORT
  • Related Topics
    • Intro
    • Python
      • Global Interpreter Lock (GIL)
      • Mutability
      • AsyncIO
    • SQL
    • Combinatorics
    • Data Engineering Questions
    • Distributed computation
      • About threads & processes
      • REST vs gRPC
  • Algorithms & data structures
    • Array
      • Online Stock Span
      • Two Sum
      • Best time to by and sell stock
      • Rank word combination
      • Largest subarray with zero sum
    • Binary
      • Sum of Two Integers
    • Tree
      • Maximum Depth of Binary Tree
      • Same Tree
      • Invert/Flip Binary Tree
      • Binary Tree Paths
      • Binary Tree Maximum Path Sum
    • Matrix
      • Set Matrix Zeroes
    • Linked List
      • Reverse Linked List
      • Detect Cycle
      • Merge Two Sorted Lists
      • Merge k Sorted Lists
    • String
      • Longest Substring Without Repeating Characters
      • Longest Repeating Character Replacement
      • Minimum Window Substring
    • Interval
    • Graph
    • Heap
    • Dynamic Programming
      • Fibonacci
      • Grid Traveler
      • Can Sum
      • How Sum
      • Best Sum
      • Can Construct
      • Count Construct
      • All Construct
      • Climbing Stairs
Powered by GitBook
On this page
  • The landscape
  • The MCMC landscape
  • The MCMC algorithm
  • Prediction step
  • About convergence

Was this helpful?

  1. ML & Data Science
  2. Statistics

Markov Monte Carlo Chain

PreviousStatistic vs ParameterNextML Techniques

Last updated 5 years ago

Was this helpful?

Sources:

The landscape

Given a Bayesian inference problem with NNN unknowns, we are creating an N-dimensional space for the prior distributions to exist in. We may describe this space or additional space as the surface that reflects the prior probability on a particular point.

Notice that the posterior landscapes look different from one another, though the data observed is identical in both cases. The reason is as follows: the exponential-prior landscape puts very little posterior weight on values in the upper right corner of the figure; this is because the prior does not put much weight there. On the other hand, the uniform-prior landscape is happy to put posterior weight in the upper-right corner, as the prior puts more weight there.

The black dot represents the true parameters. Even with 1 sample point, the mountains attempts to contain the true parameter. Of course, inference with a sample size of one is incredibly naive, and choosing such a small sample size was only illustrative.

The MCMC landscape

MCMC returns samples from the posterior distribution, not the distribution itself. It performs a task similar to repeatedly asking "How likely is this point I found to be from the mountain I am searching for?", and completes its task by returning thousands of accepted points in hopes of reconstructing the original mountain.

The MCMC algorithm

There is a large family of algorithms that perform MCMC. However, they may be expressed as follows:

  1. Start at current position

  2. Evaluate new point based on the position's adherence to the data and prior distributions

    1. If accept: move to the new position. Return to Step 1

    2. If reject: don't move. Return to Step 1

  3. After a large number of iterations, return all accepted positions.

This way we move in the general direction towards the regions where the posterior distributions exist, and collect samples sparingly on the journey. If the current position of the MCMC algorithm is in an area of extremely low probability, which is often the case when the algorithm begins, the algorithm will move in positions that are likely not from the posterior but better than everything else nearby. Thus the first moves of the algorithm are not reflective of the posterior distribution.

Inference using the first few thousand points is a bad idea, as they are unrelated to the final distribution we are interested in. Thus is it a good idea to discard those samples before using the samples for inference. We call this period before converge the burn-in period.

Alternatives to MCMC

There are other procedures for determining the posterior distributions:

Prediction step

A naive method to compute this is to re-run the above MCMC with the additional data point appended. The disadvantage with this method is that it will be slow to infer for each novel data point. Alternatively, we can try a less precise, but much quicker method. We will use Bayes Theorem for this, given the found parameters for our posterior distribution.

About convergence

Autocorrelation

By the nature of the MCMC algorithm, we will always be returned samples that exhibit autocorrelation (this is because of the step from your current position, move to a position near you). A chain that is not exploring the space well will exhibit very high autocorrelation. Visually, if the trace seems to meander like a river, and not settle down, the chain will have high autocorrelation. This does not imply that a converged MCMC has low autocorrelation. Hence low autocorrelation is not necessary for convergence, but it is sufficient.

Thinning

If there is high-autocorrelation between posterior samples. Many post-processing algorithms require samples to be independent of each other. This can be solved, or at least reduced, by only returning to the user every n-th sample, thus removing some autocorrelation. With more thinning, the autocorrelation drops quicker. There is a tradeoff though: higher thinning requires more MCMC iterations to achieve the same number of returned samples.

Intelligence starting values

It would be great to start the MCMC algorithm off near the posterior distribution, so that it will take little time to start sampling correctly. We can aid the algorithm by telling where we "think" the posterior distribution will be by specifying the testval parameter in the Stochastic variable creation. In many cases we can produce a reasonable guess for the parameter.

mu = tfd.Uniform(name="mu", low=0., high=100.).sample(seed=data.mean())

What if it does not converge?

If the priors are poorly chosen, the MCMC algorithm may not converge, or at least have difficulty converging. Consider what may happen if the prior chosen does not even contain the true parameter.

What happens to our space after we incorporate our observed dataXXX? The data XXX does not change the space, but it changes the surface of the space by pulling and stretching the fabric of the prior surface to reflect where the true parameters likely live. More data means more pulling and stretching, and our original shape becomes mangled or insignificant compared to the newly formed shape. Less data, and our original shape is more present. Regardless, the resulting surface describes the posterior distribution.

Through this approach we may find out the posterior mountains (and therefore the "best" parameters). However, we cannot naively search over a N-dimensional space, due to becomes a O(exp(N))O(exp(N))O(exp(N)) problem. So, the idea behind MCMC is to perform an intelligent and efficient search of the space.

Propose moving to a new position ⇒\Rightarrow⇒ Explore closer points

If prior probability = 0 ⇒\Rightarrow⇒ Posterior probability = 0

Laplace approximation
Variational Bayes
Chapter 3: Probabilistic Programming and Bayesian Methods for Hackers
Given p1, p2 ~ Uni(0, 5), the space created is a square and the surface is a fat plane at the top of it
Prior probability recreated as the surface given p1~Exp(3), p2~Exp(10)
Landscape for both distributions given an observation