# Optimizers

The vast majority of learning applications solves the associated high-dimensional optimization problem in an iterative way.

More formally, given a cost function to be minimized, depending on a set of , then at each step of the optimization, we perform a change in the parameters:

i.e. each component is updated with a variation depending on the current/past set of parameters, and on the cost function . Typically, but not exclusively, contains information directly related to the gradient of .

NetKet implements a series of *optimizers*, suitable for situations where the gradient of (and itself) is known only stochastically.
Optimizers are used in conjunction with one of the available learning Methods,
specifying the field `Name`

in the `Optimizer`

section of the input (see for example here).

## Common Parameters: Gradient Clipping

Gradient clipping can be used with all the optimisers. We implement two methods to clip gradients: `ClipNorm`

and `ClipVal`

. The clipping would only be used if the parameters are specified, otherwise no clipping would be performed. They can be used independently or together.

`ClipNorm`

clips the variation based on the L2 norm, i.e. if is greater than a certain threshold , we rescale the variation by

`ClipVal`

clips the gradient based on the absolute value of the elements of the variation , i.e. if the element is greater than a certain threshold , we rescale the it by

Parameter | Possible values | Description | Default value |
---|---|---|---|

`ClipNorm` |
Float > 0 | Threshold for L2 norm of the variation | None |

`ClipVal` |
Float > 0 | Threshold for each element of the variation | None |

## Stochastic Gradient Descent

Stochastic Gradient Descent is one of the most popular *optimizers* in machine learning applications.
Given a stochastic estimate of the gradient of the cost function (), it performs the update:

where is the so-called learning rate. NetKet also implements two extensions to the simple SGD, the first one is regularization, and the second one is the possibility to set a decay factor for the learning rate, such that at iteration the learning rate is .

The Stochastic Gradient Descent optimizer can be chosen specifying `Name`

:`Sgd`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`DecayFactor` |
Float | The decay factor | 1 |

`LearningRate` |
Float | The learning rate | None |

`L2Reg` |
Float | The amount of regularization | 0 |

### Example

```
pars['Optimizer']={
'Name' : 'Sgd',
'LearningRate' : 0.01,
}
```

## AdaMax

NetKet implements AdaMax, an adaptive stochastic gradient descent method, and a variant of Adam based on the infinity norm. In contrast to the SGD, AdaMax offers the important advantage of being much less sensitive to the choice of the hyper-parameters (for example, the learning rate).

Given a stochastic estimate of the gradient of the cost function (), AdaMax performs an update:

where implicitly depends on all the history of the optimization up to the current point. The NetKet naming convention of the parameters strictly follows the one introduced by the authors of AdaMax. For an in-depth description of this method, please refer to Reference 1 (Algorithm 2 therein).

The AdaMax optimizer can be chosen by specifying `Name`

:`AdaMax`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`Alpha` |
Float | The step size | 0.001 |

`Beta1` |
Float | First exponential decay rate | 0.9 |

`Beta2` |
Float | Second exponential decay rate | 0.999 |

### Example

```
pars['Optimizer']={
'Name' : 'AdaMax',
'Alpha' : 0.01,
}
```

## Momentum

The momentum update incorporates an exponentially weighted moving average over previous gradients to speed up descent [2]. The momentum vector is initialized to zero. Given a stochastic estimate of the gradient of the cost function , the updates for the parameter and corresponding component of the momentum are

The Momentum optimizer can be chosen by specifying `Name`

:`Momentum`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`LearningRate` |
Float | The learning rate | 0.001 |

`Beta` |
Float | Momentum exponential decay rate | 0.9 |

### Example

```
pars['Optimizer']={
'Name' : 'Momentum',
'LearningRate' : 0.01,
}
```

## AdaGrad

In many cases, the learning rate should decay as a function of training iteration to prevent overshooting as the optimum is approached. AdaGrad [3] is an adaptive learning rate algorithm that automatically scales the learning rate with a sum over past gradients. The vector is initialized to zero. Given a stochastic estimate of the gradient of the cost function , the updates for and the parameter are

AdaGrad has been shown to perform particularly well when the gradients are sparse, but the learning rate may become too small after many updates because the sum over the squares of past gradients is cumulative.

The AdaGrad optimizer can be chosen by specifying `Name`

:`AdaGrad`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`LearningRate` |
Float | The learning rate | 0.001 |

### Example

```
pars['Optimizer']={
'Name' : 'AdaGrad',
'LearningRate' : 0.01,
}
```

## RMSProp

RMSProp is a well-known update algorithm proposed by Geoff Hinton in his Neural Networks course notes [4]. It corrects the problem with AdaGrad by using an exponentially weighted moving average over past squared gradients instead of a cumulative sum. After initializing the vector to zero, and the parameters are updated as

The RMSProp optimizer can be chosen by specifying `Name`

:`RMSProp`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`LearningRate` |
Float | The learning rate | 0.001 |

`Beta` |
Float | Exponential decay rate | 0.9 |

### Example

```
pars['Optimizer']={
'Name' : 'RMSProp',
'LearningRate' : 0.01,
}
```

## AdaDelta

Like RMSProp, AdaDelta [5] corrects the monotonic decay of learning rates associated with AdaGrad, while additionally eliminating the need to choose a global learning rate . The NetKet naming convention of the parameters strictly follows the one introduced in Ref. 5; here is equivalent to the vector from RMSProp. and are initialized as zero vectors.

The AdaDelta optimizer can be chosen by specifying `Name`

:`AdaDelta`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`Rho` |
Float | Exponential decay rate | 0.95 |

### Example

```
pars['Optimizer']={
'Name' : 'AdaDelta',
'Rho' : 0.9,
}
```

## AMSGrad

In some cases, adaptive learning rate methods such as AdaMax fail to converge to the optimal solution because of the exponential moving average over past gradients. To address this problem, Sashank J. Reddi, Satyen Kale and Sanjiv Kumar proposed the AMSGrad update algorithm [6]. The update rule for (equivalent to in AdaDelta and in RMSProp) is modified such that is guaranteed, giving the algorithm a “long-term memory” of past gradients. The vectors and are initialized to zero, and are updated with the parameters :

The AMSGrad optimizer can be chosen by specifying `Name`

:`AMSGrad`

in pars[`Optimizer`

].

Parameter | Possible values | Description | Default value |
---|---|---|---|

`LearningRate` |
Float | The learning rate | 0.001 |

`Beta1` |
Float | First exponential decay rate | 0.9 |

`Beta2` |
Float | Second exponential decay rate | 0.999 |

### Example

```
pars['Optimizer']={
'Name' : 'AMSGrad',
'LearningRate' : 0.01,
}
```

## References

[1] Kingma, D., & Ba, J. (2015). Adam: a method for stochastic optimization

[2] Qian, N. (1999) On the momentum term in gradient descent learning algorithms

[3] Hinton, G. Lecture 6a Overview of mini-batch gradient descent

[5] Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method

[6] Reddi, S., Kale, S., & Kumar, S. (2018) On the convergence of Adam and beyond

- Previous
- Next