The iron ML notebook
  • The iron data science notebook
  • ML & Data Science
    • Frequent Questions
      • Discriminative vs Generative models
      • Supervised vs Unsupervised learning
      • Batch vs Online Learning
      • Instance-based vs Model-based Learning
      • Bias-Variance Tradeoff
      • Probability vs Likelihood
      • Covariance vs Correlation Matrix
      • Precision vs Recall
      • How does a ROC curve work?
      • Ridge vs Lasso
      • Anomaly detection methods
      • How to deal with imbalanced datasets?
      • What is "Statistically Significant"?
      • Recommendation systems methods
    • Statistics
      • The basics
      • Distributions
      • Sampling
      • IQR
      • Z-score
      • F-statistic
      • Outliers
      • The bayesian basis
      • Statistic vs Parameter
      • Markov Monte Carlo Chain
    • ML Techniques
      • Pre-process
        • PCA
      • Loss functions
      • Regularization
      • Optimization
      • Metrics
        • Distance measures
      • Activation Functions
      • Selection functions
      • Feature Normalization
      • Cross-validation
      • Hyperparameter tuning
      • Ensemble methods
      • Hard negative mining
      • ML Serving
        • Quantization
        • Kernel Auto-Tuning
        • NVIDIA TensorRT vs ONNX Runtime
    • Machine Learning Algorithms
      • Supervised Learning
        • Support Vector Machines
        • Adaptative boosting
        • Gradient boosting
        • Regression algorithms
          • Linear Regression
          • Lasso regression
          • Multi Layer Perceptron
        • Classification algorithms
          • Perceptron
          • Logistic Regression
          • Multilayer Perceptron
          • kNN
          • Naive Bayes
          • Decision Trees
          • Random Forest
          • Gradient Boosted Trees
      • Unsupervised learning
        • Clustering
          • Clustering metrics
          • kMeans
          • Gaussian Mixture Model
          • Hierarchical clustering
          • DBSCAN
      • Cameras
        • Intrinsic and extrinsic parameters
    • Computer Vision
      • Object Detection
        • Two-Stage detectors
          • Traditional Detection Models
          • R-CNN
          • Fast R-CNN
          • Faster R-CNN
        • One-Stage detectors
          • YOLO
          • YOLO v2
          • YOLO v3
          • YOLOX
        • Techniques
          • NMS
          • ROI Pooling
        • Metrics
          • Objectness Score
          • Coco Metrics
          • IoU
      • MOT
        • SORT
        • Deep SORT
  • Related Topics
    • Intro
    • Python
      • Global Interpreter Lock (GIL)
      • Mutability
      • AsyncIO
    • SQL
    • Combinatorics
    • Data Engineering Questions
    • Distributed computation
      • About threads & processes
      • REST vs gRPC
  • Algorithms & data structures
    • Array
      • Online Stock Span
      • Two Sum
      • Best time to by and sell stock
      • Rank word combination
      • Largest subarray with zero sum
    • Binary
      • Sum of Two Integers
    • Tree
      • Maximum Depth of Binary Tree
      • Same Tree
      • Invert/Flip Binary Tree
      • Binary Tree Paths
      • Binary Tree Maximum Path Sum
    • Matrix
      • Set Matrix Zeroes
    • Linked List
      • Reverse Linked List
      • Detect Cycle
      • Merge Two Sorted Lists
      • Merge k Sorted Lists
    • String
      • Longest Substring Without Repeating Characters
      • Longest Repeating Character Replacement
      • Minimum Window Substring
    • Interval
    • Graph
    • Heap
    • Dynamic Programming
      • Fibonacci
      • Grid Traveler
      • Can Sum
      • How Sum
      • Best Sum
      • Can Construct
      • Count Construct
      • All Construct
      • Climbing Stairs
Powered by GitBook
On this page
  • About optimization
  • Random search
  • Follow the slope
  • Gradient descent
  • Vanilla Gradient Descent (GD)
  • Stochastic Gradient Descent
  • GD with momentum
  • Adam optimizer

Was this helpful?

  1. ML & Data Science
  2. ML Techniques

Optimization

PreviousRegularizationNextMetrics

Last updated 4 months ago

Was this helpful?

About optimization

Given W the weights matrix, we need to find out the best solution that minimizes the loss function and optimizes the success metric or metrics.

Random search

Randomly checking N random weight matrixes and selecting the one that behaves the best in the training set.

Follow the slope

slope_in_one_dimension=df(x)dx=lim⁡h→0f(x+h)−f(x)hslope\_in\_one\_dimension= \frac{df(x)}{dx} = \lim_{h\to0} \frac{f(x + h) - f(x)}{h}slope_in_one_dimension=dxdf(x)​=h→0lim​hf(x+h)−f(x)​

In multiple dimensions, the gradient is the vector of partial derivatives along each dimension. The slope in any direction is the dot product of the direction with the gradient. The direction of the steepest descent is the negative gradient.

In practice, we will follow the analytic gradient (Newton-Raphson method).

Gradient descent

Vanilla Gradient Descent (GD)

weights = np.random(N)

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)  # current gradient
  weights += - step_size * weights_grad  # parameter update

Step size updates the weights in the opposite direction to the gradient. Since the gradient points to the direction of the greatest increase of the function - step_size helps us to slowly move in the direction of the greatest decrease. This variable is a hyperparameter of the algorithm, also called the learning rate.

We want the Slope to be closer to 0, to minimize it. To compute the partial derivative of the loss w.r.t. w1w_1w1​, we can apply the chain rule:

To go more in detail about how GD works, we can observe this pseudo code:

from ai_framework import activation_fn  # output of the model

n = 1000  # number of data samples
epochs = 10  # number of iterations across all the dataset
alpha = 0.2  # learning rate
weight1 = 0  # model weigth
bias = 0  # model bias

for _ in range(epochs):
    overall_loss = 0
    for x, y in range(n):
        z = pred(weight1, bias, x)
        output = activation(z)
        overall_loss += loss_fn(output, y)
    overall_loss /= n    
    derivative_weight1, derivative_bias = compute_partial_derivatives(overall_loss)
    weight1 -= alpha * derivative_weight1
    bias -= alpha * derivative_bias

Stochastic Gradient Descent

Source:

In stochastic gradient descent, rather than calculating the error as an average of all the training examples, we select M random training examples from the entire dataset and use that in our cost function. We call this subset a mini-batch.

Once our network has been trained on all the data points in our mini-batch, we select a new subset of random points and train our model with that. We continue this process until we’ve exhausted all training points, at which point we’ve completed an epoch. We then start with a new epoch and continue until convergence.

Main differences w.r.t. the "standard" Gradient Descent:

  • Partial derivates are computed on every training sample

  • Meaning, model weights, and bias are updated after each sample

  • The step size (alpha) is smaller to avoid big changes between samples

There's a variant called Minibatch GD:

  • Mini batches are used (with 2x2^x2x sizes)

  • Parameters are updated after each mini-batch

  • This solution is less noisy than SGD

  • But converges faster than Vanilla GD

  • Better GPU usage

GD with momentum

TBA

Adam optimizer

TBA

Hyperparameter tuning
Deep Learning Part 2: Vanilla vs Stochastic Gradient Descent