# Optimization

## About optimization

Given *W* the weights matrix, we need to find out the best solution that minimizes the loss function and optimizes the success metric or metrics.

### Random search

Randomly checking N random weight matrixes and selecting the one that behaves the best in the training set.

### Follow the slope

$$
slope\_in\_one\_dimension= \frac{df(x)}{dx} = \lim\_{h\to0} \frac{f(x + h) - f(x)}{h}
$$

In multiple dimensions, the **gradient** is the vector of partial derivatives along each dimension. The slope in any direction is the **dot product** of the direction with the gradient. The direction of the **steepest descent is the negative gradient**.

In practice, we will follow the analytic gradient (Newton-Raphson method).

{% embed url="<https://www.youtube.com/watch?v=f9wIElV4s0A>" %}

## Gradient descent

### Vanilla Gradient Descent (GD)

```python
weights = np.random(N)

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)  # current gradient
  weights += - step_size * weights_grad  # parameter update
```

*Step size* updates the weights in the opposite direction to the gradient. Since the gradient points to the direction of the greatest increase of the function `- step_size` helps us to slowly move in the direction of the greatest decrease. This variable is a hyperparameter of the algorithm, also called the **learning rate**.

{% content-ref url="/pages/-MganRSiaDYaFX1e0dwW" %}
[Hyperparameter tuning](/iron-data-science-notebook/ml-datascience/ml-techniques/hyperparameter-tuning.md)
{% endcontent-ref %}

We want the Slope to be closer to 0, to minimize it. To compute the partial derivative of the loss w\.r.t. $$w\_1$$, we can apply the chain rule:

<figure><img src="/files/vE9H5sXbW1Ogbw6oj9b1" alt="" width="420"><figcaption></figcaption></figure>

<figure><img src="/files/Mg0EVPpjEtNzhN4WasTV" alt=""><figcaption></figcaption></figure>

To go more in detail about how GD works, we can observe this pseudo code:

```python
from ai_framework import activation_fn  # output of the model

n = 1000  # number of data samples
epochs = 10  # number of iterations across all the dataset
alpha = 0.2  # learning rate
weight1 = 0  # model weigth
bias = 0  # model bias

for _ in range(epochs):
    overall_loss = 0
    for x, y in range(n):
        z = pred(weight1, bias, x)
        output = activation(z)
        overall_loss += loss_fn(output, y)
    overall_loss /= n    
    derivative_weight1, derivative_bias = compute_partial_derivatives(overall_loss)
    weight1 -= alpha * derivative_weight1
    bias -= alpha * derivative_bias
```

### Stochastic Gradient Descent

{% hint style="info" %}
Source:&#x20;

* [Deep Learning Part 2: Vanilla vs Stochastic Gradient Descent](https://medium.com/geekculture/deep-learning-part-2-vanilla-vs-stochastic-gradient-descent-6bcecc26fd51)
  {% endhint %}

In stochastic gradient descent, rather than calculating the error as an average of **all** the training examples, we select *M random training examples* from the entire dataset and use that in our cost function. We call this subset a **mini-batch**.

Once our network has been trained on all the data points in our mini-batch, we select a new subset of random points and train our model with that. We continue this process until we’ve exhausted all training points, at which point we’ve completed an **epoch**. We then start with a new **epoch** and continue until convergence.

{% hint style="success" %}
Main differences w\.r.t. the "standard" Gradient Descent:

* Partial derivates are computed on every training sample
* Meaning, model weights, and bias are updated after each sample
* The step size (alpha) is smaller to avoid big changes between samples
  {% endhint %}

{% hint style="info" %}
There's a variant called **Minibatch GD:**

* Mini batches are used (with $$2^x$$ sizes)
* Parameters are updated after each mini-batch
* This solution is **less noisy** than SGD
* But converges **faster** than Vanilla GD
* Better GPU usage
  {% endhint %}

### GD with momentum

TBA

### Adam optimizer

TBA


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://jgoodman8.gitbook.io/iron-data-science-notebook/ml-datascience/ml-techniques/optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
