Merge branch 'master' of https://github.com/torch/optim

author: MaxReimann <max.reimann@student.hpi.uni-potsdam.de> 2015-12-20 22:47:48 +0300
committer: MaxReimann <max.reimann@student.hpi.uni-potsdam.de> 2015-12-20 22:47:48 +0300
commit: 58db496d7380f8bc73a8d224e427920e40f5c168 (patch)
tree: acad9b8a91cfcacf3db0f991385c9e002c5048d1
parent: 810f29258361e2a928eaa5059062a6540dbf9361 (diff)
parent: e963a6942cc7b65e098fec68543df45c25cad544 (diff)
2 files changed, 191 insertions, 14 deletions
diff --git a/doc/index.md b/doc/index.md
index 1f5f253..f5f1b00 100644
--- a/doc/index.md
+++ b/doc/index.md
@@ -13,6 +13,14 @@ For now, the following algorithms are provided:
   * [Averaged Stochastic Gradient Descent](#optim.asgd)
   * [L-BFGS](#optim.lbfgs)
   * [Congugate Gradients](#optim.cg)
+  * [AdaDelta](#optim.adadelta)
+  * [AdaGrad](#optim.adagrad)
+  * [Adam](#optim.adam)
+  * [AdaMax](#optim.adamax)
+  * [FISTA with backtracking line search](#optim.FistaLS)
+  * [Nesterov's Accelerated Gradient method](#optim.nag)
+  * [RMSprop](#optim.rmsprop)
+  * [Rprop](#optim.rprop)
 
 All these algorithms are designed to support batch optimization as
 well as stochastic optimization. It's up to the user to construct an 
@@ -26,15 +34,15 @@ a function (L-BFGS), whereas others only support a learning rate (SGD).
 ## Overview 
 
 This package contains several optimization routines for [Torch](https://github.com/torch/torch7/blob/master/README.md).
-Each optimization algorithm is based on the same interface:
+Most optimization algorithms has the following interface:
 
 ```lua
-x*, {f}, ... = optim.method(func, x, state)
+x*, {f}, ... = optim.method(opfunc, x, state)
 ```
 
 where:
 
-* `func`: a user-defined closure that respects this API: `f, df/dx = func(x)`
+* `opfunc`: a user-defined closure that respects this API: `f, df/dx = func(x)`
 * `x`: the current parameter vector (a 1D `torch.Tensor`)
 * `state`: a table of parameters, and state variables, dependent upon the algorithm
 * `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)`
@@ -65,24 +73,24 @@ end
 <a name='optim.algorithms'></a>
 ## Algorithms
 
-All the algorithms provided rely on a unified interface:
+Most algorithms provided rely on a unified interface:
 ```lua
-w_new,fs = optim.method(func,w,state)
+x_new,fs = optim.method(opfunc, x, state)
 ```
 where: 
-w is the trainable/adjustable parameter vector,
+x is the trainable/adjustable parameter vector,
 state contains both options for the algorithm and the state of the algorihtm,
-func is a closure that has the following interface:
+opfunc is a closure that has the following interface:
 ```lua
-f,df_dw = func(w)
+f,df_dx = opfunc(x)
 ```
-w_new is the new parameter vector (after optimization),
+x_new is the new parameter vector (after optimization),
 fs is a a table containing all the values of the objective, as evaluated during
 the optimization procedure: fs[1] is the value before optimization, and fs[#fs]
 is the most optimized one (the lowest).
 
 <a name='optim.sgd'></a>
-### [x] sgd(func, w, state) 
+### [x] sgd(opfunc, x, state) 
 
 An implementation of Stochastic Gradient Descent (SGD).
 
@@ -107,7 +115,7 @@ Returns :
   * `f(x)`  : the function, evaluated before the update
 
 <a name='optim.asgd'></a>
-### [x] asgd(func, w, state) 
+### [x] asgd(opfunc, x, state) 
 
 An implementation of Averaged Stochastic Gradient Descent (ASGD): 
 
@@ -137,7 +145,7 @@ Returns:
 
 
 <a name='optim.lbfgs'></a>
-### [x] lbfgs(func, w, state)
+### [x] lbfgs(opfunc, x, state)
 
 An implementation of L-BFGS that relies on a user-provided line
 search function (`state.lineSearch`). If this function is not
@@ -170,7 +178,7 @@ Returns :
 
 
 <a name='optim.cg'></a>
-### [x] cg(func, w, state)
+### [x] cg(opfunc, x, state)
 
 An implementation of the Conjugate Gradient method which is a rewrite of 
 `minimize.m` written by Carl E. Rasmussen. 
@@ -202,4 +210,172 @@ Returns :
    * `f[1]` is the value of the function before any optimization and
    * `f[#f]` is the final fully optimized value, at x*
 
+<a name='optim.adadelta'></a>
+### [x] adadelta(opfunc, x, config, state)
+ADADELTA implementation for SGD http://arxiv.org/abs/1212.5701
 
+Arguments :
+
+* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX
+* `x` : the initial point
+* `config` : a table of hyper-parameters
+* `config.rho` : interpolation parameter
+* `config.eps` : for numerical stability
+* `state` : a table describing the state of the optimizer; after each call the state is modified
+* `state.paramVariance` : vector of temporal variances of parameters
+* `state.accDelta` : vector of accummulated delta of gradients
+
+Returns :
+
+* `x` : the new x vector
+* `f(x)` : the function, evaluated before the update
+
+<a name='optim.adagrad'></a>
+### [x] adagrad(opfunc, x, config, state)
+AdaGrad implementation for SGD
+
+Arguments :
+
+* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX
+* `x` : the initial point
+* `state` : a table describing the state of the optimizer; after each call the state is modified
+* `state.learningRate` : learning rate
+* `state.paramVariance` : vector of temporal variances of parameters
+
+Returns :
+
+* `x` : the new x vector
+* `f(x)` : the function, evaluated before the update
+
+<a name='optim.adam'></a>
+### [x] adam(opfunc, x, config, state)
+An implementation of Adam from http://arxiv.org/pdf/1412.6980.pdf
+
+Arguments :
+
+* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX
+* `x`      : the initial point
+* `config` : a table with configuration parameters for the optimizer
+* `config.learningRate`      : learning rate
+* `config.beta1`             : first moment coefficient
+* `config.beta2`             : second moment coefficient
+* `config.epsilon`           : for numerical stability
+* `state`                    : a table describing the state of the optimizer; after each call the state is modified
+
+Returns :
+
+* `x`     : the new x vector
+* `f(x)`  : the function, evaluated before the update
+
+<a name='optim.adamax'></a>
+### [x] adamax(opfunc, x, config, state)
+An implementation of AdaMax http://arxiv.org/pdf/1412.6980.pdf
+
+Arguments :
+
+* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX
+* `x`      : the initial point
+* `config` : a table with configuration parameters for the optimizer
+* `config.learningRate`      : learning rate
+* `config.beta1`             : first moment coefficient
+* `config.beta2`             : second moment coefficient
+* `config.epsilon`           : for numerical stability
+* `state`                    : a table describing the state of the optimizer; after each call the state is modified.
+
+Returns :
+
+* `x`     : the new x vector
+* `f(x)`  : the function, evaluated before the update
+
+<a name='optim.FistaLS'></a>
+### [x] FistaLS(f, g, pl, xinit, params)
+FISTA with backtracking line search
+* `f`        : smooth function
+* `g`        : non-smooth function
+* `pl`       : minimizer of intermediate problem Q(x,y)
+* `xinit`    : initial point
+* `params`   : table of parameters (**optional**)
+* `params.L`       : 1/(step size) for ISTA/FISTA iteration (0.1)
+* `params.Lstep`   : step size multiplier at each iteration (1.5)
+* `params.maxiter` : max number of iterations (50)
+* `params.maxline` : max number of line search iterations per iteration (20)
+* `params.errthres`: Error thershold for convergence check (1e-4)
+* `params.doFistaUpdate` : true : use FISTA, false: use ISTA (true)
+* `params.verbose` : store each iteration solution and print detailed info (false)
+
+On output, `params` will contain these additional fields that can be reused.
+* `params.L`       : last used L value will be written.
+
+These are temporary storages needed by the algo and if the same params object is 
+passed a second time, these same storages will be used without new allocation.
+* `params.xkm`     : previous iterarion point
+* `params.y`       : fista iteration
+* `params.ply`     : ply = pl(y * 1/L grad(f))
+
+Returns the solution x and history of {function evals, number of line search ,...}
+
+Algorithm is published in http://epubs.siam.org/doi/abs/10.1137/080716542
+
+<a name='optim.nag'></a>
+### [x] nag(opfunc, x, config, state)      
+An implementation of SGD adapted with features of Nesterov's 
+Accelerated Gradient method, based on the paper "On the Importance of Initialization and Momentum in Deep Learning" (Sutsveker et. al., ICML 2013).
+
+Arguments :
+
+*  `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX
+*  `x` : the initial point
+*  `state`  : a table describing the state of the optimizer; after each call the state is modified
+*  `state.learningRate`      : learning rate
+*  `state.learningRateDecay` : learning rate decay
+*  `astate.weightDecay`       : weight decay
+*  `state.momentum`          : momentum
+*  `state.learningRates`     : vector of individual learning rates
+
+Returns :
+
+* `x`     : the new x vector
+* `f(x)` : the function, evaluated before the update
+
+<a name='optim.rmsprop'></a>
+### [x] rmsprop(opfunc, x, config, state)
+An implementation of RMSprop
+
+Arguments :
+
+* `opfunc` : a function that takes a single input (X), the point of a evaluation, and returns f(X) and df/dX
+* `x`      : the initial point
+* `config` : a table with configuration parameters for the optimizer
+* `config.learningRate`      : learning rate
+* `config.alpha`             : smoothing constant
+* `config.epsilon`           : value with which to initialise m
+* `state`                    : a table describing the state of the optimizer; after each call the state is modified
+* `state.m`                  : leaky sum of squares of parameter gradients,
+* `state.tmp`                : and the square root (with epsilon smoothing)
+
+Returns :
+
+* `x`     : the new x vector
+* `f(x)`  : the function, evaluated before the update
+
+<a name='optim.rprop'></a>
+### [x] rprop(opfunc, x, config, state)
+A plain implementation of Rprop
+(Martin Riedmiller, Koray Kavukcuoglu 2013)
+
+Arguments :
+
+* `opfunc` : a function that takes a single input (X), the point of evaluation, and returns f(X) and df/dX
+* `x`      : the initial point
+* `state`  : a table describing the state of the optimizer; after each call the state is modified
+* `state.stepsize`    : initial step size, common to all components
+* `state.etaplus`     : multiplicative increase factor, > 1 (default 1.2)
+* `state.etaminus`    : multiplicative decrease factor, < 1 (default 0.5)
+* `state.stepsizemax` : maximum stepsize allowed (default 50)
+* `state.stepsizemin` : minimum stepsize allowed (default 1e-6)
+* `state.niter`       : number of iterations (default 1)
+
+Returns :
+
+* `x`     : the new x vector
+* `f(x)`  : the function, evaluated before the update
diff --git a/sgd.lua b/sgd.lua
index 8ad59e4..ea13c55 100644
--- a/sgd.lua
+++ b/sgd.lua
@@ -13,9 +13,10 @@ ARGS:
 - `config.momentum`          : momentum
 - `config.dampening`         : dampening for momentum
 - `config.nesterov`          : enables Nesterov momentum
+- `config.learningRates`     : vector of individual learning rates
 - `state`  : a table describing the state of the optimizer; after each
              call the state is modified
-- `state.learningRates`      : vector of individual learning rates
+- `state.evalCounter`        : evaluation counter (optional: 0, by default)
 
 RETURN:
 - `x`     : the new x vector
author	MaxReimann <max.reimann@student.hpi.uni-potsdam.de>	2015-12-20 22:47:48 +0300
committer	MaxReimann <max.reimann@student.hpi.uni-potsdam.de>	2015-12-20 22:47:48 +0300
commit	58db496d7380f8bc73a8d224e427920e40f5c168 (patch)
tree	acad9b8a91cfcacf3db0f991385c9e002c5048d1
parent	810f29258361e2a928eaa5059062a6540dbf9361 (diff)
parent	e963a6942cc7b65e098fec68543df45c25cad544 (diff)