diff options
author | Alfredo Canziani <alfredo.canziani@gmail.com> | 2016-06-29 07:09:27 +0300 |
---|---|---|
committer | Alfredo Canziani <alfredo.canziani@gmail.com> | 2016-06-30 05:51:21 +0300 |
commit | c0c4bbfcc14fad7bc484358821563fddd0b9031e (patch) | |
tree | 5cbac231bfa6bc4e961e075e80604a1b0fd52cba | |
parent | 8755acb1fc6e91afaa9c7973f9efd4239e295d1a (diff) |
Fix state/config improper documentation
-rw-r--r-- | doc/algos.md | 110 | ||||
-rw-r--r-- | doc/intro.md | 11 |
2 files changed, 64 insertions, 57 deletions
diff --git a/doc/algos.md b/doc/algos.md index b69cca7..a671420 100644 --- a/doc/algos.md +++ b/doc/algos.md @@ -25,12 +25,12 @@ Some of these algorithms support a line search, which can be passed as a functio General interface: ```lua -x*, {f}, ... = optim.method(opfunc, x, state) +x*, {f}, ... = optim.method(opfunc, x[, config][, state]) ``` <a name='optim.sgd'></a> -## sgd(opfunc, x, state) +## sgd(opfunc, x[, config][, state]) An implementation of *Stochastic Gradient Descent* (*SGD*). @@ -56,7 +56,7 @@ Returns: <a name='optim.asgd'></a> -## asgd(opfunc, x, state) +## asgd(opfunc, x[, config][, state]) An implementation of *Averaged Stochastic Gradient Descent* (*ASGD*): @@ -72,11 +72,11 @@ Arguments: * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` * `x`: the initial point - * `state`: a table describing the state of the optimizer; after each call the state is modified - * `state.eta0`: learning rate - * `state.lambda`: decay term - * `state.alpha`: power for eta update - * `state.t0`: point at which to start averaging + * `config`: a table with configuration parameters for the optimizer + * `config.eta0`: learning rate + * `config.lambda`: decay term + * `config.alpha`: power for eta update + * `config.t0`: point at which to start averaging Returns: @@ -86,7 +86,7 @@ Returns: <a name='optim.lbfgs'></a> -## lbfgs(opfunc, x, state) +## lbfgs(opfunc, x[, config][, state]) An implementation of *L-BFGS* that relies on a user-provided line search function (`state.lineSearch`). If this function is not provided, then a simple learning rate is used to produce fixed size steps. @@ -100,13 +100,13 @@ Arguments: * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` * `x`: the initial point - * `state`: a table describing the state of the optimizer; after each call the state is modified - * `state.maxIter`: Maximum number of iterations allowed - * `state.maxEval`: Maximum number of function evaluations - * `state.tolFun`: Termination tolerance on the first-order optimality - * `state.tolX`: Termination tol on progress in terms of func/param changes - * `state.lineSearch`: A line search function - * `state.learningRate`: If no line search provided, then a fixed step size is used + * `config`: a table with configuration parameters for the optimizer + * `config.maxIter`: Maximum number of iterations allowed + * `config.maxEval`: Maximum number of function evaluations + * `config.tolFun`: Termination tolerance on the first-order optimality + * `config.tolX`: Termination tol on progress in terms of func/param changes + * `config.lineSearch`: A line search function + * `config.learningRate`: If no line search provided, then a fixed step size is used Returns: * `x*`: the new `x` vector, at the optimal point @@ -116,7 +116,7 @@ Returns: <a name='optim.cg'></a> -## cg(opfunc, x, state) +## cg(opfunc, x[, config][, state]) An implementation of the *Conjugate Gradient* method which is a rewrite of `minimize.m` written by Carl E. Rasmussen. It is supposed to produce exactly same results (give or take numerical accuracy due to some changed order of operations). @@ -132,11 +132,12 @@ Arguments: * `opfunc`: a function that takes a single input, the point of evaluation. * `x`: the initial point + * `config`: a table with configuration parameters for the optimizer + * `config.maxEval`: max number of function evaluations + * `config.maxIter`: max number of iterations * `state`: a table of parameters and temporary allocations. - * `state.maxEval`: max number of function evaluations - * `state.maxIter`: max number of iterations - * `state.df[0, gc, gc, gc]`: if you pass torch.Tensor they will be used for temp storage - * `state.[s, gc0]`: if you pass torch.Tensor they will be used for temp storage + * `state.df[0, 1, 2, 3]`: if you pass `Tensor` they will be used for temp storage + * `state.[s, x0]`: if you pass `Tensor` they will be used for temp storage Returns: @@ -147,9 +148,9 @@ Returns: <a name='optim.adadelta'></a> -## adadelta(opfunc, x, config, state) +## adadelta(opfunc, x[, config][, state]) -*AdaDelta* implementation for *SGD* http://arxiv.org/abs/1212.5701 +*AdaDelta* implementation for *SGD* http://arxiv.org/abs/1212.5701. Arguments: @@ -169,16 +170,17 @@ Returns: <a name='optim.adagrad'></a> -## adagrad(opfunc, x, config, state) +## adagrad(opfunc, x[, config][, state]) -*AdaGrad* implementation for *SGD* +*AdaGrad* implementation for *SGD*. Arguments: -* `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` -* `x`: the initial point -* `state`: a table describing the state of the optimizer; after each call the state is modified -* `state.learningRate`: learning rate + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `config`: a table with configuration parameters for the optimizer + * `config.learningRate`: learning rate + * `state`: a table describing the state of the optimizer; after each call the state is modified * `state.paramVariance`: vector of temporal variances of parameters Returns: @@ -188,7 +190,7 @@ Returns: <a name='optim.adam'></a> -## adam(opfunc, x, config, state) +## adam(opfunc, x[, config][, state]) An implementation of *Adam* from http://arxiv.org/pdf/1412.6980.pdf. @@ -210,9 +212,9 @@ Returns: <a name='optim.adamax'></a> -## adamax(opfunc, x, config, state) +## adamax(opfunc, x[, config][, state]) -An implementation of *AdaMax* http://arxiv.org/pdf/1412.6980.pdf +An implementation of *AdaMax* http://arxiv.org/pdf/1412.6980.pdf. Arguments: @@ -223,7 +225,7 @@ Arguments: * `config.beta1`: first moment coefficient * `config.beta2`: second moment coefficient * `config.epsilon`: for numerical stability - * `state`: a table describing the state of the optimizer; after each call the state is modified. + * `state`: a table describing the state of the optimizer; after each call the state is modified Returns: @@ -232,7 +234,7 @@ Returns: <a name='optim.FistaLS'></a> -## FistaLS(f, g, pl, xinit, params) +## FistaLS(f, g, pl, xinit[, params]) *Fista* with backtracking *Line Search*: @@ -264,7 +266,7 @@ Algorithm is published in http://epubs.siam.org/doi/abs/10.1137/080716542 <a name='optim.nag'></a> -## nag(opfunc, x, config, state) +## nag(opfunc, x[, config][, state]) An implementation of *SGD* adapted with features of *Nesterov's Accelerated Gradient method*, based on the paper "On the Importance of Initialization and Momentum in Deep Learning" (Sutsveker et. al., ICML 2013) http://www.cs.toronto.edu/~fritz/absps/momentum.pdf. @@ -272,12 +274,12 @@ Arguments: * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` * `x`: the initial point - * `state`: a table describing the state of the optimizer; after each call the state is modified - * `state.learningRate`: learning rate - * `state.learningRateDecay`: learning rate decay - * `astate.weightDecay`: weight decay - * `state.momentum`: momentum - * `state.learningRates`: vector of individual learning rates + * `config`: a table with configuration parameters for the optimizer + * `config.learningRate`: learning rate + * `config.learningRateDecay`: learning rate decay + * `config.weightDecay`: weight decay + * `config.momentum`: momentum + * `config.learningRates`: vector of individual learning rates Returns: @@ -286,7 +288,7 @@ Returns: <a name='optim.rmsprop'></a> -## rmsprop(opfunc, x, config, state) +## rmsprop(opfunc, x[, config][, state]) An implementation of *RMSprop*. @@ -309,21 +311,21 @@ Returns: <a name='optim.rprop'></a> -## rprop(opfunc, x, config, state) +## rprop(opfunc, x[, config][, state]) -A plain implementation of *Rprop* (Martin Riedmiller, Koray Kavukcuoglu 2013) +A plain implementation of *Rprop* (Martin Riedmiller, Koray Kavukcuoglu 2013). Arguments: * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` * `x`: the initial point - * `state`: a table describing the state of the optimizer; after each call the state is modified - * `state.stepsize`: initial step size, common to all components - * `state.etaplus`: multiplicative increase factor, > 1 (default 1.2) - * `state.etaminus`: multiplicative decrease factor, < 1 (default 0.5) - * `state.stepsizemax`: maximum stepsize allowed (default 50) - * `state.stepsizemin`: minimum stepsize allowed (default 1e-6) - * `state.niter`: number of iterations (default 1) + * `config`: a table with configuration parameters for the optimizer + * `config.stepsize`: initial step size, common to all components + * `config.etaplus`: multiplicative increase factor, > 1 (default 1.2) + * `config.etaminus`: multiplicative decrease factor, < 1 (default 0.5) + * `config.stepsizemax`: maximum stepsize allowed (default 50) + * `config.stepsizemin`: minimum stepsize allowed (default 1e-6) + * `config.niter`: number of iterations (default 1) Returns: @@ -332,7 +334,7 @@ Returns: <a name='optim.cmaes'></a> -## cmaes(opfunc, x, config, state) +## cmaes(opfunc, x[, config][, state]) An implementation of *CMAES* (*Covariance Matrix Adaptation Evolution Strategy*), ported from https://www.lri.fr/~hansen/barecmaes2.html. @@ -341,8 +343,12 @@ Note that this method will on average take much more function evaluations to con Arguments: +If `state` is specified, then `config` is not used at all. +Otherwise `state` is `config`. + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`. Note that `df/dX` is not used and can be left 0 * `x`: the initial point + * `state`: a table describing the state of the optimizer; after each call the state is modified * `state.sigma`: float, initial step-size (standard deviation in each coordinate) * `state.maxEval`: int, maximal number of function evaluations * `state.ftarget`: float, target function value diff --git a/doc/intro.md b/doc/intro.md index 54d167f..b387235 100644 --- a/doc/intro.md +++ b/doc/intro.md @@ -1,17 +1,18 @@ <a name='optim.overview'></a> -## Overview +# Overview Most optimization algorithms have the following interface: ```lua -x*, {f}, ... = optim.method(opfunc, x, state) +x*, {f}, ... = optim.method(opfunc, x[, config][, state]) ``` where: * `opfunc`: a user-defined closure that respects this API: `f, df/dx = func(x)` * `x`: the current parameter vector (a 1D `Tensor`) -* `state`: a table of parameters, and state variables, dependent upon the algorithm +* `config`: a table of parameters, dependent upon the algorithm +* `state`: a table of state variables, if `nil`, `config` will contain the state * `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)` * `{f}`: a table of all `f` values, in the order they've been evaluated (for some simple algorithms, like SGD, `#f == 1`) @@ -24,7 +25,7 @@ It's usually initialized once, by the user, and then passed to the optim functio Example: ```lua -state = { +config = { learningRate = 1e-3, momentum = 0.5 } @@ -34,7 +35,7 @@ for i, sample in ipairs(training_samples) do -- define eval function return f, df_dx end - optim.sgd(func, x, state) + optim.sgd(func, x, config) end ``` |