Optimizers

The vast majority of learning applications solves the associated high-dimensional optimization problem in an iterative way.

More formally, given a cost function to be minimized, depending on a set of , then at each step of the optimization, we perform a change in the parameters:

i.e. each component is updated with a variation depending on the current/past set of parameters, and on the cost function . Typically, but not exclusively, contains information directly related to the gradient of .

NetKet implements a series of optimizers, suitable for situations where the gradient of (and itself) is known only stochastically. Optimizers are used in conjunction with one of the available learning Methods, specifying the field Name in the Optimizer section of the input (see for example here).

Common Parameters: Gradient Clipping

Gradient clipping can be used with all the optimisers. We implement two methods to clip gradients: ClipNorm and ClipVal. The clipping would only be used if the parameters are specified, otherwise no clipping would be performed. They can be used independently or together.

ClipNorm clips the variation based on the L2 norm, i.e. if is greater than a certain threshold , we rescale the variation by

ClipVal clips the gradient based on the absolute value of the elements of the variation , i.e. if the element is greater than a certain threshold , we rescale the it by

Parameter Possible values Description Default value
ClipNorm Float > 0 Threshold for L2 norm of the variation None
ClipVal Float > 0 Threshold for each element of the variation None

Stochastic Gradient Descent

Stochastic Gradient Descent is one of the most popular optimizers in machine learning applications. Given a stochastic estimate of the gradient of the cost function (), it performs the update:

where is the so-called learning rate. NetKet also implements two extensions to the simple SGD, the first one is regularization, and the second one is the possibility to set a decay factor for the learning rate, such that at iteration the learning rate is .

The Stochastic Gradient Descent optimizer can be chosen specifying Name:Sgd in pars[Optimizer].

Parameter Possible values Description Default value
DecayFactor Float The decay factor 1
LearningRate Float The learning rate None
L2Reg Float The amount of regularization 0

Example

pars['Optimizer']={
    'Name'    : 'Sgd',
    'LearningRate'   : 0.01,
}

AdaMax

NetKet implements AdaMax, an adaptive stochastic gradient descent method, and a variant of Adam based on the infinity norm. In contrast to the SGD, AdaMax offers the important advantage of being much less sensitive to the choice of the hyper-parameters (for example, the learning rate).

Given a stochastic estimate of the gradient of the cost function (), AdaMax performs an update:

where implicitly depends on all the history of the optimization up to the current point. The NetKet naming convention of the parameters strictly follows the one introduced by the authors of AdaMax. For an in-depth description of this method, please refer to Reference 1 (Algorithm 2 therein).

The AdaMax optimizer can be chosen by specifying Name:AdaMax in pars[Optimizer].

Parameter Possible values Description Default value
Alpha Float The step size 0.001
Beta1 Float First exponential decay rate 0.9
Beta2 Float Second exponential decay rate 0.999

Example

pars['Optimizer']={
    'Name'    : 'AdaMax',
    'Alpha'   : 0.01,
}

Momentum

The momentum update incorporates an exponentially weighted moving average over previous gradients to speed up descent [2]. The momentum vector is initialized to zero. Given a stochastic estimate of the gradient of the cost function , the updates for the parameter and corresponding component of the momentum are

The Momentum optimizer can be chosen by specifying Name:Momentum in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float The learning rate 0.001
Beta Float Momentum exponential decay rate 0.9

Example

pars['Optimizer']={
    'Name'    : 'Momentum',
    'LearningRate'   : 0.01,
}

AdaGrad

In many cases, the learning rate should decay as a function of training iteration to prevent overshooting as the optimum is approached. AdaGrad [3] is an adaptive learning rate algorithm that automatically scales the learning rate with a sum over past gradients. The vector is initialized to zero. Given a stochastic estimate of the gradient of the cost function , the updates for and the parameter are

AdaGrad has been shown to perform particularly well when the gradients are sparse, but the learning rate may become too small after many updates because the sum over the squares of past gradients is cumulative.

The AdaGrad optimizer can be chosen by specifying Name:AdaGrad in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float The learning rate 0.001

Example

pars['Optimizer']={
    'Name'    : 'AdaGrad',
    'LearningRate'   : 0.01,
}

RMSProp

RMSProp is a well-known update algorithm proposed by Geoff Hinton in his Neural Networks course notes [4]. It corrects the problem with AdaGrad by using an exponentially weighted moving average over past squared gradients instead of a cumulative sum. After initializing the vector to zero, and the parameters are updated as

The RMSProp optimizer can be chosen by specifying Name:RMSProp in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float The learning rate 0.001
Beta Float Exponential decay rate 0.9

Example

pars['Optimizer']={
    'Name'    : 'RMSProp',
    'LearningRate'   : 0.01,
}

AdaDelta

Like RMSProp, AdaDelta [5] corrects the monotonic decay of learning rates associated with AdaGrad, while additionally eliminating the need to choose a global learning rate . The NetKet naming convention of the parameters strictly follows the one introduced in Ref. 5; here is equivalent to the vector from RMSProp. and are initialized as zero vectors.

The AdaDelta optimizer can be chosen by specifying Name:AdaDelta in pars[Optimizer].

Parameter Possible values Description Default value
Rho Float Exponential decay rate 0.95

Example

pars['Optimizer']={
    'Name'    : 'AdaDelta',
    'Rho'   : 0.9,
}

AMSGrad

In some cases, adaptive learning rate methods such as AdaMax fail to converge to the optimal solution because of the exponential moving average over past gradients. To address this problem, Sashank J. Reddi, Satyen Kale and Sanjiv Kumar proposed the AMSGrad update algorithm [6]. The update rule for (equivalent to in AdaDelta and in RMSProp) is modified such that is guaranteed, giving the algorithm a “long-term memory” of past gradients. The vectors and are initialized to zero, and are updated with the parameters :

The AMSGrad optimizer can be chosen by specifying Name:AMSGrad in pars[Optimizer].

Parameter Possible values Description Default value
LearningRate Float The learning rate 0.001
Beta1 Float First exponential decay rate 0.9
Beta2 Float Second exponential decay rate 0.999

Example

pars['Optimizer']={
    'Name'    : 'AMSGrad',
    'LearningRate'   : 0.01,
}

References


[1] Kingma, D., & Ba, J. (2015). Adam: a method for stochastic optimization

[2] Qian, N. (1999) On the momentum term in gradient descent learning algorithms

[3] Hinton, G. Lecture 6a Overview of mini-batch gradient descent

[4] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization

[5] Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method

[6] Reddi, S., Kale, S., & Kumar, S. (2018) On the convergence of Adam and beyond