diff options
author | MaxReimann <max.reimann@student.hpi.uni-potsdam.de> | 2015-12-20 22:47:48 +0300 |
---|---|---|
committer | MaxReimann <max.reimann@student.hpi.uni-potsdam.de> | 2015-12-20 22:47:48 +0300 |
commit | 58db496d7380f8bc73a8d224e427920e40f5c168 (patch) | |
tree | acad9b8a91cfcacf3db0f991385c9e002c5048d1 | |
parent | 810f29258361e2a928eaa5059062a6540dbf9361 (diff) | |
parent | e963a6942cc7b65e098fec68543df45c25cad544 (diff) |
Merge branch 'master' of https://github.com/torch/optim
-rw-r--r-- | doc/index.md | 202 | ||||
-rw-r--r-- | sgd.lua | 3 |
2 files changed, 191 insertions, 14 deletions
diff --git a/doc/index.md b/doc/index.md index 1f5f253..f5f1b00 100644 --- a/doc/index.md +++ b/doc/index.md @@ -13,6 +13,14 @@ For now, the following algorithms are provided: * [Averaged Stochastic Gradient Descent](#optim.asgd) * [L-BFGS](#optim.lbfgs) * [Congugate Gradients](#optim.cg) + * [AdaDelta](#optim.adadelta) + * [AdaGrad](#optim.adagrad) + * [Adam](#optim.adam) + * [AdaMax](#optim.adamax) + * [FISTA with backtracking line search](#optim.FistaLS) + * [Nesterov's Accelerated Gradient method](#optim.nag) + * [RMSprop](#optim.rmsprop) + * [Rprop](#optim.rprop) All these algorithms are designed to support batch optimization as well as stochastic optimization. It's up to the user to construct an @@ -26,15 +34,15 @@ a function (L-BFGS), whereas others only support a learning rate (SGD). ## Overview This package contains several optimization routines for [Torch](https://github.com/torch/torch7/blob/master/README.md). -Each optimization algorithm is based on the same interface: +Most optimization algorithms has the following interface: ```lua -x*, {f}, ... = optim.method(func, x, state) +x*, {f}, ... = optim.method(opfunc, x, state) ``` where: -* `func`: a user-defined closure that respects this API: `f, df/dx = func(x)` +* `opfunc`: a user-defined closure that respects this API: `f, df/dx = func(x)` * `x`: the current parameter vector (a 1D `torch.Tensor`) * `state`: a table of parameters, and state variables, dependent upon the algorithm * `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)` @@ -65,24 +73,24 @@ end <a name='optim.algorithms'></a> ## Algorithms -All the algorithms provided rely on a unified interface: +Most algorithms provided rely on a unified interface: ```lua -w_new,fs = optim.method(func,w,state) +x_new,fs = optim.method(opfunc, x, state) ``` where: -w is the trainable/adjustable parameter vector, +x is the trainable/adjustable parameter vector, state contains both options for the algorithm and the state of the algorihtm, -func is a closure that has the following interface: +opfunc is a closure that has the following interface: ```lua -f,df_dw = func(w) +f,df_dx = opfunc(x) ``` -w_new is the new parameter vector (after optimization), +x_new is the new parameter vector (after optimization), fs is a a table containing all the values of the objective, as evaluated during the optimization procedure: fs[1] is the value before optimization, and fs[#fs] is the most optimized one (the lowest). <a name='optim.sgd'></a> -### [x] sgd(func, w, state) +### [x] sgd(opfunc, x, state) An implementation of Stochastic Gradient Descent (SGD). @@ -107,7 +115,7 @@ Returns : * `f(x)` : the function, evaluated before the update <a name='optim.asgd'></a> -### [x] asgd(func, w, state) +### [x] asgd(opfunc, x, state) An implementation of Averaged Stochastic Gradient Descent (ASGD): @@ -137,7 +145,7 @@ Returns: <a name='optim.lbfgs'></a> -### [x] lbfgs(func, w, state) +### [x] lbfgs(opfunc, x, state) An implementation of L-BFGS that relies on a user-provided line search function (`state.lineSearch`). If this function is not @@ -170,7 +178,7 @@ Returns : <a name='optim.cg'></a> -### [x] cg(func, w, state) +### [x] cg(opfunc, x, state) An implementation of the Conjugate Gradient method which is a rewrite of `minimize.m` written by Carl E. Rasmussen. @@ -202,4 +210,172 @@ Returns : * `f[1]` is the value of the function before any optimization and * `f[#f]` is the final fully optimized value, at x* +<a name='optim.adadelta'></a> +### [x] adadelta(opfunc, x, config, state) +ADADELTA implementation for SGD http://arxiv.org/abs/1212.5701 +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `config` : a table of hyper-parameters +* `config.rho` : interpolation parameter +* `config.eps` : for numerical stability +* `state` : a table describing the state of the optimizer; after each call the state is modified +* `state.paramVariance` : vector of temporal variances of parameters +* `state.accDelta` : vector of accummulated delta of gradients + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update + +<a name='optim.adagrad'></a> +### [x] adagrad(opfunc, x, config, state) +AdaGrad implementation for SGD + +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `state` : a table describing the state of the optimizer; after each call the state is modified +* `state.learningRate` : learning rate +* `state.paramVariance` : vector of temporal variances of parameters + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update + +<a name='optim.adam'></a> +### [x] adam(opfunc, x, config, state) +An implementation of Adam from http://arxiv.org/pdf/1412.6980.pdf + +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `config` : a table with configuration parameters for the optimizer +* `config.learningRate` : learning rate +* `config.beta1` : first moment coefficient +* `config.beta2` : second moment coefficient +* `config.epsilon` : for numerical stability +* `state` : a table describing the state of the optimizer; after each call the state is modified + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update + +<a name='optim.adamax'></a> +### [x] adamax(opfunc, x, config, state) +An implementation of AdaMax http://arxiv.org/pdf/1412.6980.pdf + +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `config` : a table with configuration parameters for the optimizer +* `config.learningRate` : learning rate +* `config.beta1` : first moment coefficient +* `config.beta2` : second moment coefficient +* `config.epsilon` : for numerical stability +* `state` : a table describing the state of the optimizer; after each call the state is modified. + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update + +<a name='optim.FistaLS'></a> +### [x] FistaLS(f, g, pl, xinit, params) +FISTA with backtracking line search +* `f` : smooth function +* `g` : non-smooth function +* `pl` : minimizer of intermediate problem Q(x,y) +* `xinit` : initial point +* `params` : table of parameters (**optional**) +* `params.L` : 1/(step size) for ISTA/FISTA iteration (0.1) +* `params.Lstep` : step size multiplier at each iteration (1.5) +* `params.maxiter` : max number of iterations (50) +* `params.maxline` : max number of line search iterations per iteration (20) +* `params.errthres`: Error thershold for convergence check (1e-4) +* `params.doFistaUpdate` : true : use FISTA, false: use ISTA (true) +* `params.verbose` : store each iteration solution and print detailed info (false) + +On output, `params` will contain these additional fields that can be reused. +* `params.L` : last used L value will be written. + +These are temporary storages needed by the algo and if the same params object is +passed a second time, these same storages will be used without new allocation. +* `params.xkm` : previous iterarion point +* `params.y` : fista iteration +* `params.ply` : ply = pl(y * 1/L grad(f)) + +Returns the solution x and history of {function evals, number of line search ,...} + +Algorithm is published in http://epubs.siam.org/doi/abs/10.1137/080716542 + +<a name='optim.nag'></a> +### [x] nag(opfunc, x, config, state) +An implementation of SGD adapted with features of Nesterov's +Accelerated Gradient method, based on the paper "On the Importance of Initialization and Momentum in Deep Learning" (Sutsveker et. al., ICML 2013). + +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `state` : a table describing the state of the optimizer; after each call the state is modified +* `state.learningRate` : learning rate +* `state.learningRateDecay` : learning rate decay +* `astate.weightDecay` : weight decay +* `state.momentum` : momentum +* `state.learningRates` : vector of individual learning rates + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update + +<a name='optim.rmsprop'></a> +### [x] rmsprop(opfunc, x, config, state) +An implementation of RMSprop + +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `config` : a table with configuration parameters for the optimizer +* `config.learningRate` : learning rate +* `config.alpha` : smoothing constant +* `config.epsilon` : value with which to initialise m +* `state` : a table describing the state of the optimizer; after each call the state is modified +* `state.m` : leaky sum of squares of parameter gradients, +* `state.tmp` : and the square root (with epsilon smoothing) + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update + +<a name='optim.rprop'></a> +### [x] rprop(opfunc, x, config, state) +A plain implementation of Rprop +(Martin Riedmiller, Koray Kavukcuoglu 2013) + +Arguments : + +* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX +* `x` : the initial point +* `state` : a table describing the state of the optimizer; after each call the state is modified +* `state.stepsize` : initial step size, common to all components +* `state.etaplus` : multiplicative increase factor, > 1 (default 1.2) +* `state.etaminus` : multiplicative decrease factor, < 1 (default 0.5) +* `state.stepsizemax` : maximum stepsize allowed (default 50) +* `state.stepsizemin` : minimum stepsize allowed (default 1e-6) +* `state.niter` : number of iterations (default 1) + +Returns : + +* `x` : the new x vector +* `f(x)` : the function, evaluated before the update @@ -13,9 +13,10 @@ ARGS: - `config.momentum` : momentum - `config.dampening` : dampening for momentum - `config.nesterov` : enables Nesterov momentum +- `config.learningRates` : vector of individual learning rates - `state` : a table describing the state of the optimizer; after each call the state is modified -- `state.learningRates` : vector of individual learning rates +- `state.evalCounter` : evaluation counter (optional: 0, by default) RETURN: - `x` : the new x vector |