Optimization
About optimization
Given W the weights matrix, we need to find out the best solution that minimizes the loss function and optimizes the success metric or metrics.
Random search
Randomly checking N random weight matrixes and selecting the one that behaves the best in the training set.
Follow the slope
In multiple dimensions, the gradient is the vector of partial derivatives along each dimension. The slope in any direction is the dot product of the direction with the gradient. The direction of the steepest descent is the negative gradient.
In practice, we will follow the analytic gradient (Newton-Raphson method).
Gradient descent
Vanilla Gradient Descent (GD)
weights = np.random(N)
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights) # current gradient
weights += - step_size * weights_grad # parameter update
Step size updates the weights in the opposite direction to the gradient. Since the gradient points to the direction of the greatest increase of the function - step_size
helps us to slowly move in the direction of the greatest decrease. This variable is a hyperparameter of the algorithm, also called the learning rate.
We want the Slope to be closer to 0, to minimize it. To compute the partial derivative of the loss w.r.t. , we can apply the chain rule:


To go more in detail about how GD works, we can observe this pseudo code:
from ai_framework import activation_fn # output of the model
n = 1000 # number of data samples
epochs = 10 # number of iterations across all the dataset
alpha = 0.2 # learning rate
weight1 = 0 # model weigth
bias = 0 # model bias
for _ in range(epochs):
overall_loss = 0
for x, y in range(n):
z = pred(weight1, bias, x)
output = activation(z)
overall_loss += loss_fn(output, y)
overall_loss /= n
derivative_weight1, derivative_bias = compute_partial_derivatives(overall_loss)
weight1 -= alpha * derivative_weight1
bias -= alpha * derivative_bias
Stochastic Gradient Descent
In stochastic gradient descent, rather than calculating the error as an average of all the training examples, we select M random training examples from the entire dataset and use that in our cost function. We call this subset a mini-batch.
Once our network has been trained on all the data points in our mini-batch, we select a new subset of random points and train our model with that. We continue this process until we’ve exhausted all training points, at which point we’ve completed an epoch. We then start with a new epoch and continue until convergence.
Main differences w.r.t. the "standard" Gradient Descent:
Partial derivates are computed on every training sample
Meaning, model weights, and bias are updated after each sample
The step size (alpha) is smaller to avoid big changes between samples
GD with momentum
TBA
Adam optimizer
TBA
Last updated
Was this helpful?