diff options
author | Alfredo Canziani <alfredo.canziani@gmail.com> | 2016-06-27 05:17:43 +0300 |
---|---|---|
committer | Alfredo Canziani <alfredo.canziani@gmail.com> | 2016-06-27 05:22:03 +0300 |
commit | 8755acb1fc6e91afaa9c7973f9efd4239e295d1a (patch) | |
tree | 55bc07657034e04a22851f9c8d46e24a1ebff993 | |
parent | f6be4bb195e3e128ab027326255172ff36b6c63c (diff) |
Refactoring documentation
-rw-r--r-- | README.md | 47 | ||||
-rw-r--r-- | doc/algos.md | 357 | ||||
-rw-r--r-- | doc/index.md | 409 | ||||
-rw-r--r-- | doc/intro.md | 40 |
4 files changed, 402 insertions, 451 deletions
@@ -1,45 +1,8 @@ +<a name='optim.dok'></a> # Optimization package -This package contains several optimization routines for [Torch](https://github.com/torch/torch7/blob/master/README.md). -Each optimization algorithm is based on the same interface: +This package contains several optimization routines and a logger for [Torch](https://github.com/torch/torch7/blob/master/README.md): -```lua -x*, {f}, ... = optim.method(func, x, state) -``` - -where: - -* `func`: a user-defined closure that respects this API: `f, df/dx = func(x)` -* `x`: the current parameter vector (a 1D `torch.Tensor`) -* `state`: a table of parameters, and state variables, dependent upon the algorithm -* `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)` -* `{f}`: a table of all f values, in the order they've been evaluated - (for some simple algorithms, like SGD, `#f == 1`) - -## Available algorithms - -Please check [this file](doc/index.md) for the full list of -optimization algorithms available and examples. Get also into the -[`test`](test/) directory for straightforward examples using the -[Rosenbrock's](test/rosenbrock.lua) function. - -## Important Note - -The state table is used to hold the state of the algorithm. -It's usually initialized once, by the user, and then passed to the optim function -as a black box. Example: - -```lua -state = { - learningRate = 1e-3, - momentum = 0.5 -} - -for i,sample in ipairs(training_samples) do - local func = function(x) - -- define eval function - return f,df_dx - end - optim.sgd(func,x,state) -end -``` + * [Overview](doc/intro.md); + * [Optimization algorithms](doc/algos.md); + * Logger. diff --git a/doc/algos.md b/doc/algos.md new file mode 100644 index 0000000..b69cca7 --- /dev/null +++ b/doc/algos.md @@ -0,0 +1,357 @@ +<a name='optim.algos'></a> +# Optimization algorithms + +The following algorithms are provided: + + * [*Stochastic Gradient Descent*](#optim.sgd) + * [*Averaged Stochastic Gradient Descent*](#optim.asgd) + * [*L-BFGS*](#optim.lbfgs) + * [*Congugate Gradients*](#optim.cg) + * [*AdaDelta*](#optim.adadelta) + * [*AdaGrad*](#optim.adagrad) + * [*Adam*](#optim.adam) + * [*AdaMax*](#optim.adamax) + * [*FISTA with backtracking line search*](#optim.FistaLS) + * [*Nesterov's Accelerated Gradient method*](#optim.nag) + * [*RMSprop*](#optim.rmsprop) + * [*Rprop*](#optim.rprop) + * [*CMAES*](#optim.cmaes) + +All these algorithms are designed to support batch optimization as well as stochastic optimization. +It's up to the user to construct an objective function that represents the batch, mini-batch, or single sample on which to evaluate the objective. + +Some of these algorithms support a line search, which can be passed as a function (*L-BFGS*), whereas others only support a learning rate (*SGD*). + +General interface: + +```lua +x*, {f}, ... = optim.method(opfunc, x, state) +``` + + +<a name='optim.sgd'></a> +## sgd(opfunc, x, state) + +An implementation of *Stochastic Gradient Descent* (*SGD*). + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `config`: a table with configuration parameters for the optimizer + * `config.learningRate`: learning rate + * `config.learningRateDecay`: learning rate decay + * `config.weightDecay`: weight decay + * `config.weightDecays`: vector of individual weight decays + * `config.momentum`: momentum + * `config.dampening`: dampening for momentum + * `config.nesterov`: enables Nesterov momentum + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.learningRates`: vector of individual learning rates + +Returns: + + * `x*`: the new x vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.asgd'></a> +## asgd(opfunc, x, state) + +An implementation of *Averaged Stochastic Gradient Descent* (*ASGD*): + +```lua +x = (1 - lambda eta_t) x - eta_t df / dx(z, x) +a = a + mu_t [ x - a ] + +eta_t = eta0 / (1 + lambda eta0 t) ^ 0.75 +mu_t = 1 / max(1, t - t0) +``` + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.eta0`: learning rate + * `state.lambda`: decay term + * `state.alpha`: power for eta update + * `state.t0`: point at which to start averaging + +Returns: + + * `x*`: the new x vector + * `f(x)`: the function, evaluated before the update + * `ax`: the averaged x vector + + +<a name='optim.lbfgs'></a> +## lbfgs(opfunc, x, state) + +An implementation of *L-BFGS* that relies on a user-provided line search function (`state.lineSearch`). +If this function is not provided, then a simple learning rate is used to produce fixed size steps. +Fixed size steps are much less costly than line searches, and can be useful for stochastic problems. + +The learning rate is used even when a line search is provided. +This is also useful for large-scale stochastic problems, where opfunc is a noisy approximation of `f(x)`. +In that case, the learning rate allows a reduction of confidence in the step size. + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.maxIter`: Maximum number of iterations allowed + * `state.maxEval`: Maximum number of function evaluations + * `state.tolFun`: Termination tolerance on the first-order optimality + * `state.tolX`: Termination tol on progress in terms of func/param changes + * `state.lineSearch`: A line search function + * `state.learningRate`: If no line search provided, then a fixed step size is used + +Returns: + * `x*`: the new `x` vector, at the optimal point + * `f`: a table of all function values: + * `f[1]` is the value of the function before any optimization and + * `f[#f]` is the final fully optimized value, at `x*` + + +<a name='optim.cg'></a> +## cg(opfunc, x, state) + +An implementation of the *Conjugate Gradient* method which is a rewrite of `minimize.m` written by Carl E. Rasmussen. +It is supposed to produce exactly same results (give or take numerical accuracy due to some changed order of operations). +You can compare the result on rosenbrock with [`minimize.m`](http://www.gatsby.ucl.ac.uk/~edward/code/minimize/example.html). + +```lua +x, fx, c = minimize([0, 0]', 'rosenbrock', -25) +``` + +Note that we limit the number of function evaluations only, it seems much more important in practical use. + +Arguments: + + * `opfunc`: a function that takes a single input, the point of evaluation. + * `x`: the initial point + * `state`: a table of parameters and temporary allocations. + * `state.maxEval`: max number of function evaluations + * `state.maxIter`: max number of iterations + * `state.df[0, gc, gc, gc]`: if you pass torch.Tensor they will be used for temp storage + * `state.[s, gc0]`: if you pass torch.Tensor they will be used for temp storage + +Returns: + + * `x*`: the new `x` vector, at the optimal point + * `f`: a table of all function values where + * `f[1]` is the value of the function before any optimization and + * `f[#f]` is the final fully optimized value, at `x*` + + +<a name='optim.adadelta'></a> +## adadelta(opfunc, x, config, state) + +*AdaDelta* implementation for *SGD* http://arxiv.org/abs/1212.5701 + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `config`: a table of hyper-parameters + * `config.rho`: interpolation parameter + * `config.eps`: for numerical stability + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.paramVariance`: vector of temporal variances of parameters + * `state.accDelta`: vector of accummulated delta of gradients + +Returns: + + * `x*`: the new x vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.adagrad'></a> +## adagrad(opfunc, x, config, state) + +*AdaGrad* implementation for *SGD* + +Arguments: + +* `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` +* `x`: the initial point +* `state`: a table describing the state of the optimizer; after each call the state is modified +* `state.learningRate`: learning rate + * `state.paramVariance`: vector of temporal variances of parameters + +Returns: + + * `x*`: the new `x` vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.adam'></a> +## adam(opfunc, x, config, state) + +An implementation of *Adam* from http://arxiv.org/pdf/1412.6980.pdf. + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `config`: a table with configuration parameters for the optimizer + * `config.learningRate`: learning rate + * `config.beta1`: first moment coefficient + * `config.beta2`: second moment coefficient + * `config.epsilon`: for numerical stability + * `state`: a table describing the state of the optimizer; after each call the state is modified + +Returns: + + * `x*`: the new x vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.adamax'></a> +## adamax(opfunc, x, config, state) + +An implementation of *AdaMax* http://arxiv.org/pdf/1412.6980.pdf + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `config`: a table with configuration parameters for the optimizer + * `config.learningRate`: learning rate + * `config.beta1`: first moment coefficient + * `config.beta2`: second moment coefficient + * `config.epsilon`: for numerical stability + * `state`: a table describing the state of the optimizer; after each call the state is modified. + +Returns: + + * `x*`: the new `x` vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.FistaLS'></a> +## FistaLS(f, g, pl, xinit, params) + +*Fista* with backtracking *Line Search*: + + * `f`: smooth function + * `g`: non-smooth function + * `pl`: minimizer of intermediate problem Q(x, y) + * `xinit`: initial point + * `params`: table of parameters (**optional**) + * `params.L`: 1/(step size) for ISTA/FISTA iteration (0.1) + * `params.Lstep`: step size multiplier at each iteration (1.5) + * `params.maxiter`: max number of iterations (50) + * `params.maxline`: max number of line search iterations per iteration (20) + * `params.errthres`: Error thershold for convergence check (1e-4) + * `params.doFistaUpdate`: true : use FISTA, false: use ISTA (true) + * `params.verbose`: store each iteration solution and print detailed info (false) + +On output, `params` will contain these additional fields that can be reused. + * `params.L`: last used L value will be written. + +These are temporary storages needed by the algo and if the same params object is +passed a second time, these same storages will be used without new allocation. + * `params.xkm`: previous iterarion point + * `params.y`: fista iteration + * `params.ply`: `ply = pl(y * 1/L grad(f))` + +Returns the solution `x` and history of `{function evals, number of line search , ...}`. + +Algorithm is published in http://epubs.siam.org/doi/abs/10.1137/080716542 + + +<a name='optim.nag'></a> +## nag(opfunc, x, config, state) + +An implementation of *SGD* adapted with features of *Nesterov's Accelerated Gradient method*, based on the paper "On the Importance of Initialization and Momentum in Deep Learning" (Sutsveker et. al., ICML 2013) http://www.cs.toronto.edu/~fritz/absps/momentum.pdf. + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.learningRate`: learning rate + * `state.learningRateDecay`: learning rate decay + * `astate.weightDecay`: weight decay + * `state.momentum`: momentum + * `state.learningRates`: vector of individual learning rates + +Returns: + + * `x*`: the new `x` vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.rmsprop'></a> +## rmsprop(opfunc, x, config, state) + +An implementation of *RMSprop*. + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `config`: a table with configuration parameters for the optimizer + * `config.learningRate`: learning rate + * `config.alpha`: smoothing constant + * `config.epsilon`: value with which to initialise m + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.m`: leaky sum of squares of parameter gradients, + * `state.tmp`: and the square root (with epsilon smoothing) + +Returns: + + * `x*`: the new x vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.rprop'></a> +## rprop(opfunc, x, config, state) + +A plain implementation of *Rprop* (Martin Riedmiller, Koray Kavukcuoglu 2013) + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX` + * `x`: the initial point + * `state`: a table describing the state of the optimizer; after each call the state is modified + * `state.stepsize`: initial step size, common to all components + * `state.etaplus`: multiplicative increase factor, > 1 (default 1.2) + * `state.etaminus`: multiplicative decrease factor, < 1 (default 0.5) + * `state.stepsizemax`: maximum stepsize allowed (default 50) + * `state.stepsizemin`: minimum stepsize allowed (default 1e-6) + * `state.niter`: number of iterations (default 1) + +Returns: + + * `x*`: the new x vector + * `f(x)`: the function, evaluated before the update + + +<a name='optim.cmaes'></a> +## cmaes(opfunc, x, config, state) + +An implementation of *CMAES* (*Covariance Matrix Adaptation Evolution Strategy*), ported from https://www.lri.fr/~hansen/barecmaes2.html. + +*CMAES* is a stochastic, derivative-free method for heuristic global optimization of non-linear or non-convex continuous optimization problems. +Note that this method will on average take much more function evaluations to converge then a gradient based method. + +Arguments: + + * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`. Note that `df/dX` is not used and can be left 0 + * `x`: the initial point + * `state.sigma`: float, initial step-size (standard deviation in each coordinate) + * `state.maxEval`: int, maximal number of function evaluations + * `state.ftarget`: float, target function value + * `state.popsize`: population size. If this is left empty, `4 + int(3 * log(|x|))` will be used + * `state.ftarget`: stop if `fitness < ftarget` + * `state.verb_disp`: display info on console every verb_disp iteration, 0 for never + +Returns: + * `x*`: the new `x` vector, at the optimal point + * `f`: a table of all function values: + * `f[1]` is the value of the function before any optimization and + * `f[#f]` is the final fully optimized value, at `x*` diff --git a/doc/index.md b/doc/index.md deleted file mode 100644 index a399206..0000000 --- a/doc/index.md +++ /dev/null @@ -1,409 +0,0 @@ -<a name='optim.dok'></a> -# Optim Package - -This package provides a set of optimization algorithms, which all follow -a unified, closure-based API. - -This package is fully compatible with the [nn](http://nn.readthedocs.org) package, but can also be -used to optimize arbitrary objective functions. - -For now, the following algorithms are provided: - - * [Stochastic Gradient Descent](#optim.sgd) - * [Averaged Stochastic Gradient Descent](#optim.asgd) - * [L-BFGS](#optim.lbfgs) - * [Congugate Gradients](#optim.cg) - * [AdaDelta](#optim.adadelta) - * [AdaGrad](#optim.adagrad) - * [Adam](#optim.adam) - * [AdaMax](#optim.adamax) - * [FISTA with backtracking line search](#optim.FistaLS) - * [Nesterov's Accelerated Gradient method](#optim.nag) - * [RMSprop](#optim.rmsprop) - * [Rprop](#optim.rprop) - * [CMAES](#optim.cmaes) - -All these algorithms are designed to support batch optimization as -well as stochastic optimization. It's up to the user to construct an -objective function that represents the batch, mini-batch, or single sample -on which to evaluate the objective. - -Some of these algorithms support a line search, which can be passed as -a function (L-BFGS), whereas others only support a learning rate (SGD). - -<a name='optim.overview'></a> -## Overview - -This package contains several optimization routines for [Torch](https://github.com/torch/torch7/blob/master/README.md). -Most optimization algorithms has the following interface: - -```lua -x*, {f}, ... = optim.method(opfunc, x, state) -``` - -where: - -* `opfunc`: a user-defined closure that respects this API: `f, df/dx = func(x)` -* `x`: the current parameter vector (a 1D `torch.Tensor`) -* `state`: a table of parameters, and state variables, dependent upon the algorithm -* `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)` -* `{f}`: a table of all f values, in the order they've been evaluated (for some simple algorithms, like SGD, `#f == 1`) - -<a name='optim.example'></a> -## Example - -The state table is used to hold the state of the algorihtm. -It's usually initialized once, by the user, and then passed to the optim function -as a black box. Example: - -```lua -state = { - learningRate = 1e-3, - momentum = 0.5 -} - -for i,sample in ipairs(training_samples) do - local func = function(x) - -- define eval function - return f,df_dx - end - optim.sgd(func,x,state) -end -``` - -<a name='optim.algorithms'></a> -## Algorithms - -Most algorithms provided rely on a unified interface: -```lua -x_new,fs = optim.method(opfunc, x, state) -``` -where: -x is the trainable/adjustable parameter vector, -state contains both options for the algorithm and the state of the algorihtm, -opfunc is a closure that has the following interface: -```lua -f,df_dx = opfunc(x) -``` -x_new is the new parameter vector (after optimization), -fs is a a table containing all the values of the objective, as evaluated during -the optimization procedure: fs[1] is the value before optimization, and fs[#fs] -is the most optimized one (the lowest). - -<a name='optim.sgd'></a> -### [x] sgd(opfunc, x, state) - -An implementation of Stochastic Gradient Descent (SGD). - -Arguments: - - * `opfunc` : a function that takes a single input (`X`), the point of a evaluation, and returns `f(X)` and `df/dX` - * `x` : the initial point - * `config` : a table with configuration parameters for the optimizer - * `config.learningRate` : learning rate - * `config.learningRateDecay` : learning rate decay - * `config.weightDecay` : weight decay - * `config.weightDecays` : vector of individual weight decays - * `config.momentum` : momentum - * `config.dampening` : dampening for momentum - * `config.nesterov` : enables Nesterov momentum - * `state` : a table describing the state of the optimizer; after each call the state is modified - * `state.learningRates` : vector of individual learning rates - -Returns : - - * `x` : the new x vector - * `f(x)` : the function, evaluated before the update - -<a name='optim.asgd'></a> -### [x] asgd(opfunc, x, state) - -An implementation of Averaged Stochastic Gradient Descent (ASGD): - -``` -x = (1 - lambda eta_t) x - eta_t df/dx(z,x) -a = a + mu_t [ x - a ] - -eta_t = eta0 / (1 + lambda eta0 t) ^ 0.75 -mu_t = 1/max(1,t-t0) -``` - -Arguments: - - * `opfunc` : a function that takes a single input (`X`), the point of evaluation, and returns `f(X)` and `df/dX` - * `x` : the initial point - * `state` : a table describing the state of the optimizer; after each call the state is modified - * `state.eta0` : learning rate - * `state.lambda` : decay term - * `state.alpha` : power for eta update - * `state.t0` : point at which to start averaging - -Returns: - - * `x` : the new x vector - * `f(x)` : the function, evaluated before the update - * `ax` : the averaged x vector - - -<a name='optim.lbfgs'></a> -### [x] lbfgs(opfunc, x, state) - -An implementation of L-BFGS that relies on a user-provided line -search function (`state.lineSearch`). If this function is not -provided, then a simple learningRate is used to produce fixed -size steps. Fixed size steps are much less costly than line -searches, and can be useful for stochastic problems. - -The learning rate is used even when a line search is provided. -This is also useful for large-scale stochastic problems, where -opfunc is a noisy approximation of `f(x)`. In that case, the learning -rate allows a reduction of confidence in the step size. - -Arguments : - - * `opfunc` : a function that takes a single input (`X`), the point of evaluation, and returns `f(X)` and `df/dX` - * `x` : the initial point - * `state` : a table describing the state of the optimizer; after each call the state is modified - * `state.maxIter` : Maximum number of iterations allowed - * `state.maxEval` : Maximum number of function evaluations - * `state.tolFun` : Termination tolerance on the first-order optimality - * `state.tolX` : Termination tol on progress in terms of func/param changes - * `state.lineSearch` : A line search function - * `state.learningRate` : If no line search provided, then a fixed step size is used - -Returns : - * `x*` : the new `x` vector, at the optimal point - * `f` : a table of all function values: - * `f[1]` is the value of the function before any optimization and - * `f[#f]` is the final fully optimized value, at `x*` - - -<a name='optim.cg'></a> -### [x] cg(opfunc, x, state) - -An implementation of the Conjugate Gradient method which is a rewrite of -`minimize.m` written by Carl E. Rasmussen. -It is supposed to produce exactly same results (give -or take numerical accuracy due to some changed order of -operations). You can compare the result on rosenbrock with -[minimize.m](http://www.gatsby.ucl.ac.uk/~edward/code/minimize/example.html). -``` -[x fx c] = minimize([0 0]', 'rosenbrock', -25) -``` - -Note that we limit the number of function evaluations only, it seems much -more important in practical use. - -Arguments : - - * `opfunc` : a function that takes a single input, the point of evaluation. - * `x` : the initial point - * `state` : a table of parameters and temporary allocations. - * `state.maxEval` : max number of function evaluations - * `state.maxIter` : max number of iterations - * `state.df[0,1,2,3]` : if you pass torch.Tensor they will be used for temp storage - * `state.[s,x0]` : if you pass torch.Tensor they will be used for temp storage - -Returns : - - * `x*` : the new x vector, at the optimal point - * `f` : a table of all function values where - * `f[1]` is the value of the function before any optimization and - * `f[#f]` is the final fully optimized value, at x* - -<a name='optim.adadelta'></a> -### [x] adadelta(opfunc, x, config, state) -ADADELTA implementation for SGD http://arxiv.org/abs/1212.5701 - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `config` : a table of hyper-parameters -* `config.rho` : interpolation parameter -* `config.eps` : for numerical stability -* `state` : a table describing the state of the optimizer; after each call the state is modified -* `state.paramVariance` : vector of temporal variances of parameters -* `state.accDelta` : vector of accummulated delta of gradients - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - -<a name='optim.adagrad'></a> -### [x] adagrad(opfunc, x, config, state) -AdaGrad implementation for SGD - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `state` : a table describing the state of the optimizer; after each call the state is modified -* `state.learningRate` : learning rate -* `state.paramVariance` : vector of temporal variances of parameters - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - -<a name='optim.adam'></a> -### [x] adam(opfunc, x, config, state) -An implementation of Adam from http://arxiv.org/pdf/1412.6980.pdf - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `config` : a table with configuration parameters for the optimizer -* `config.learningRate` : learning rate -* `config.beta1` : first moment coefficient -* `config.beta2` : second moment coefficient -* `config.epsilon` : for numerical stability -* `state` : a table describing the state of the optimizer; after each call the state is modified - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - -<a name='optim.adamax'></a> -### [x] adamax(opfunc, x, config, state) -An implementation of AdaMax http://arxiv.org/pdf/1412.6980.pdf - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `config` : a table with configuration parameters for the optimizer -* `config.learningRate` : learning rate -* `config.beta1` : first moment coefficient -* `config.beta2` : second moment coefficient -* `config.epsilon` : for numerical stability -* `state` : a table describing the state of the optimizer; after each call the state is modified. - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - -<a name='optim.FistaLS'></a> -### [x] FistaLS(f, g, pl, xinit, params) -FISTA with backtracking line search -* `f` : smooth function -* `g` : non-smooth function -* `pl` : minimizer of intermediate problem Q(x,y) -* `xinit` : initial point -* `params` : table of parameters (**optional**) -* `params.L` : 1/(step size) for ISTA/FISTA iteration (0.1) -* `params.Lstep` : step size multiplier at each iteration (1.5) -* `params.maxiter` : max number of iterations (50) -* `params.maxline` : max number of line search iterations per iteration (20) -* `params.errthres`: Error thershold for convergence check (1e-4) -* `params.doFistaUpdate` : true : use FISTA, false: use ISTA (true) -* `params.verbose` : store each iteration solution and print detailed info (false) - -On output, `params` will contain these additional fields that can be reused. -* `params.L` : last used L value will be written. - -These are temporary storages needed by the algo and if the same params object is -passed a second time, these same storages will be used without new allocation. -* `params.xkm` : previous iterarion point -* `params.y` : fista iteration -* `params.ply` : ply = pl(y * 1/L grad(f)) - -Returns the solution x and history of {function evals, number of line search ,...} - -Algorithm is published in http://epubs.siam.org/doi/abs/10.1137/080716542 - -<a name='optim.nag'></a> -### [x] nag(opfunc, x, config, state) -An implementation of SGD adapted with features of Nesterov's -Accelerated Gradient method, based on the paper "On the Importance of Initialization and Momentum in Deep Learning" (Sutsveker et. al., ICML 2013). - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `state` : a table describing the state of the optimizer; after each call the state is modified -* `state.learningRate` : learning rate -* `state.learningRateDecay` : learning rate decay -* `astate.weightDecay` : weight decay -* `state.momentum` : momentum -* `state.learningRates` : vector of individual learning rates - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - -<a name='optim.rmsprop'></a> -### [x] rmsprop(opfunc, x, config, state) -An implementation of RMSprop - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `config` : a table with configuration parameters for the optimizer -* `config.learningRate` : learning rate -* `config.alpha` : smoothing constant -* `config.epsilon` : value with which to initialise m -* `state` : a table describing the state of the optimizer; after each call the state is modified -* `state.m` : leaky sum of squares of parameter gradients, -* `state.tmp` : and the square root (with epsilon smoothing) - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - -<a name='optim.rprop'></a> -### [x] rprop(opfunc, x, config, state) -A plain implementation of Rprop -(Martin Riedmiller, Koray Kavukcuoglu 2013) - -Arguments : - -* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX -* `x` : the initial point -* `state` : a table describing the state of the optimizer; after each call the state is modified -* `state.stepsize` : initial step size, common to all components -* `state.etaplus` : multiplicative increase factor, > 1 (default 1.2) -* `state.etaminus` : multiplicative decrease factor, < 1 (default 0.5) -* `state.stepsizemax` : maximum stepsize allowed (default 50) -* `state.stepsizemin` : minimum stepsize allowed (default 1e-6) -* `state.niter` : number of iterations (default 1) - -Returns : - -* `x` : the new x vector -* `f(x)` : the function, evaluated before the update - - - - -<a name='optim.cmaes'></a> -### [x] cmaes(opfunc, x, config, state) -An implementation of `CMAES` (Covariance Matrix Adaptation Evolution Strategy), -ported from https://www.lri.fr/~hansen/barecmaes2.html. - -CMAES is a stochastic, derivative-free method for heuristic global optimization of non-linear or non-convex continuous optimization problems. Note that this method will on average take much more function evaluations to converge then a gradient based method. - -Arguments: - -* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX. Note that df/dX is not used and can be left 0 -* `x` : the initial point -* `state.sigma` : float, initial step-size (standard deviation in each coordinate) -* `state.maxEval` : int, maximal number of function evaluations -* `state.ftarget` : float, target function value -* `state.popsize` : population size. If this is left empty, 4 + int(3 * log(|x|)) will be used -* `state.ftarget` : stop if fitness < ftarget -* `state.verb_disp` : display info on console every verb_disp iteration, 0 for never - -Returns: -* `x*` : the new `x` vector, at the optimal point -* `f` : a table of all function values: - * `f[1]` is the value of the function before any optimization and - * `f[#f]` is the final fully optimized value, at `x*` diff --git a/doc/intro.md b/doc/intro.md new file mode 100644 index 0000000..54d167f --- /dev/null +++ b/doc/intro.md @@ -0,0 +1,40 @@ +<a name='optim.overview'></a> +## Overview + +Most optimization algorithms have the following interface: + +```lua +x*, {f}, ... = optim.method(opfunc, x, state) +``` + +where: + +* `opfunc`: a user-defined closure that respects this API: `f, df/dx = func(x)` +* `x`: the current parameter vector (a 1D `Tensor`) +* `state`: a table of parameters, and state variables, dependent upon the algorithm +* `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)` +* `{f}`: a table of all `f` values, in the order they've been evaluated (for some simple algorithms, like SGD, `#f == 1`) + + +<a name='optim.example'></a> +## Example + +The state table is used to hold the state of the algorihtm. +It's usually initialized once, by the user, and then passed to the optim function as a black box. +Example: + +```lua +state = { + learningRate = 1e-3, + momentum = 0.5 +} + +for i, sample in ipairs(training_samples) do + local func = function(x) + -- define eval function + return f, df_dx + end + optim.sgd(func, x, state) +end +``` + |