# Optimizers

The vast majority of learning applications solves the associated high-dimensional optimization problem in an iterative way.

More formally, given a cost function $F(\mathbf{p})$ to be minimized, depending on a set of $\mathbf{p} = p_1 \dots p_M$, then at each step of the optimization, we perform a change in the parameters:

i.e. each component is updated with a variation $\mathcal{S}_k$ depending on the current/past set of parameters, and on the cost function $F$. Typically, but not exclusively, $\mathcal{S}$ contains information directly related to the gradient of $F$.

NetKet implements a series of optimizers, suitable for situations where the gradient of $F$ (and $F$ itself) is known only stochastically. Optimizers are used in conjunction with one of the available learning Methods, specifying the field Name in the Optimizer section of the input (see for example here).

Gradient clipping can be used with all the optimisers. We implement two methods to clip gradients: ClipNorm and ClipVal. The clipping would only be used if the parameters are specified, otherwise no clipping would be performed. They can be used independently or together.

ClipNorm clips the variation $\mathcal{S}_k$ based on the L2 norm, i.e. if $\|\mathcal{S}_k\|$ is greater than a certain threshold $N$, we rescale the variation by

ClipVal clips the gradient based on the absolute value of the elements of the variation $\mathcal{S}_k$, i.e. if the element $\|\mathcal{S}_{k,i}\|$ is greater than a certain threshold $N$, we rescale the it by

Parameter Possible values Description Default value
ClipNorm Float > 0 Threshold for L2 norm of the variation None
ClipVal Float > 0 Threshold for each element of the variation None

Stochastic Gradient Descent is one of the most popular optimizers in machine learning applications. Given a stochastic estimate of the gradient of the cost function ($G(\mathbf{p})$), it performs the update:

where $\eta$ is the so-called learning rate. NetKet also implements two extensions to the simple SGD, the first one is $L_2$ regularization, and the second one is the possibility to set a decay factor $\gamma \leq 1$ for the learning rate, such that at iteration $n$ the learning rate is $\eta \gamma^n$.

The Stochastic Gradient Descent optimizer can be chosen specifying Name:Sgd in pars[Optimizer].

Parameter Possible values Description Default value
DecayFactor Float The decay factor $\gamma$ 1
LearningRate Float $\leq 1$ The learning rate $\eta$ None
L2Reg Float $\geq 0$ The amount of $L_2$ regularization 0

### Example

pars['Optimizer']={
'Name'    : 'Sgd',
'LearningRate'   : 0.01,
}


NetKet implements AdaMax, an adaptive stochastic gradient descent method, and a variant of Adam based on the infinity norm. In contrast to the SGD, AdaMax offers the important advantage of being much less sensitive to the choice of the hyper-parameters (for example, the learning rate).

Given a stochastic estimate of the gradient of the cost function ($G(\mathbf{p})$), AdaMax performs an update:

where $\mathcal{S}_k$ implicitly depends on all the history of the optimization up to the current point. The NetKet naming convention of the parameters strictly follows the one introduced by the authors of AdaMax. For an in-depth description of this method, please refer to Reference 1 (Algorithm 2 therein).

The AdaMax optimizer can be chosen by specifying Name:AdaMax in pars[Optimizer].

Parameter Possible values Description Default value
Alpha Float The step size 0.001
Beta1 Float $\in [0,1]$ First exponential decay rate 0.9
Beta2 Float $\in [0,1]$ Second exponential decay rate 0.999

### Example

pars['Optimizer']={
'Alpha'   : 0.01,
}


## Momentum

The momentum update incorporates an exponentially weighted moving average over previous gradients to speed up descent [2]. The momentum vector $\mathbf{m}$ is initialized to zero. Given a stochastic estimate of the gradient of the cost function $G(\mathbf{p})$, the updates for the parameter $p_k$ and corresponding component of the momentum $m_k$ are

The Momentum optimizer can be chosen by specifying Name:Momentum in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float $\leq 1$ The learning rate $\eta$ 0.001
Beta Float $\in [0,1]$ Momentum exponential decay rate 0.9

### Example

pars['Optimizer']={
'Name'    : 'Momentum',
'LearningRate'   : 0.01,
}


In many cases, the learning rate $\eta$ should decay as a function of training iteration to prevent overshooting as the optimum is approached. AdaGrad [3] is an adaptive learning rate algorithm that automatically scales the learning rate with a sum over past gradients. The vector $\mathbf{g}$ is initialized to zero. Given a stochastic estimate of the gradient of the cost function $G(\mathbf{p})$, the updates for $g_k$ and the parameter $p_k$ are

AdaGrad has been shown to perform particularly well when the gradients are sparse, but the learning rate may become too small after many updates because the sum over the squares of past gradients is cumulative.

The AdaGrad optimizer can be chosen by specifying Name:AdaGrad in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float $\leq 1$ The learning rate $\eta$ 0.001

### Example

pars['Optimizer']={
'LearningRate'   : 0.01,
}


## RMSProp

RMSProp is a well-known update algorithm proposed by Geoff Hinton in his Neural Networks course notes [4]. It corrects the problem with AdaGrad by using an exponentially weighted moving average over past squared gradients instead of a cumulative sum. After initializing the vector $\mathbf{s}$ to zero, $s_k$ and the parameters $p_k$ are updated as

The RMSProp optimizer can be chosen by specifying Name:RMSProp in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float $\leq 1$ The learning rate $\eta$ 0.001
Beta Float $\in [0,1]$ Exponential decay rate 0.9

### Example

pars['Optimizer']={
'Name'    : 'RMSProp',
'LearningRate'   : 0.01,
}


Like RMSProp, AdaDelta [5] corrects the monotonic decay of learning rates associated with AdaGrad, while additionally eliminating the need to choose a global learning rate $\eta$. The NetKet naming convention of the parameters strictly follows the one introduced in Ref. 5; here $E[g^2]$ is equivalent to the vector $\mathbf{s}$ from RMSProp. $E[g^2]$ and $E[\Delta x^2]$ are initialized as zero vectors.

The AdaDelta optimizer can be chosen by specifying Name:AdaDelta in pars[Optimizer].

Parameter Possible values Description Default value
Rho Float $\in [0,1]$ Exponential decay rate 0.95

### Example

pars['Optimizer']={
'Rho'   : 0.9,
}


In some cases, adaptive learning rate methods such as AdaMax fail to converge to the optimal solution because of the exponential moving average over past gradients. To address this problem, Sashank J. Reddi, Satyen Kale and Sanjiv Kumar proposed the AMSGrad update algorithm [6]. The update rule for $\mathbf{v}$ (equivalent to $E[g^2]$ in AdaDelta and $\mathbf{s}$ in RMSProp) is modified such that $v^\prime_k \geq v_k$ is guaranteed, giving the algorithm a “long-term memory” of past gradients. The vectors $\mathbf{m}$ and $\mathbf{v}$ are initialized to zero, and are updated with the parameters $\mathbf{p}$:

The AMSGrad optimizer can be chosen by specifying Name:AMSGrad in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float $\leq 1$ The learning rate 0.001
Beta1 Float $\in [0,1]$ First exponential decay rate 0.9
Beta2 Float $\in [0,1]$ Second exponential decay rate 0.999

### Example

pars['Optimizer']={