diff options
author | Soumith Chintala <soumith@gmail.com> | 2015-08-30 06:33:35 +0300 |
---|---|---|
committer | Soumith Chintala <soumith@gmail.com> | 2015-08-30 06:33:35 +0300 |
commit | b11ce0a73d8fa56fc7837552a1b25c7adc99581a (patch) | |
tree | f05c513f1c37cd7a9ad1224915bfb8296003952f | |
parent | 5906efd4b601e63d5bc6e33be8d4621354111471 (diff) | |
parent | 7918f15cf0a98678857354e4e29ef6e9b8af020d (diff) |
Merge pull request #66 from nicholas-leonard/readthedocs
readthedocs
-rw-r--r-- | doc/index.md | 205 | ||||
-rw-r--r-- | dok/index.dok | 78 | ||||
-rw-r--r-- | mkdocs.yml | 8 |
3 files changed, 213 insertions, 78 deletions
diff --git a/doc/index.md b/doc/index.md new file mode 100644 index 0000000..1f5f253 --- /dev/null +++ b/doc/index.md @@ -0,0 +1,205 @@ +<a name='optim.dok'></a> +# Optim Package + +This package provides a set of optimization algorithms, which all follow +a unified, closure-based API. + +This package is fully compatible with the [nn](http://nn.readthedocs.org) package, but can also be +used to optimize arbitrary objective functions. + +For now, the following algorithms are provided: + + * [Stochastic Gradient Descent](#optim.sgd) + * [Averaged Stochastic Gradient Descent](#optim.asgd) + * [L-BFGS](#optim.lbfgs) + * [Congugate Gradients](#optim.cg) + +All these algorithms are designed to support batch optimization as +well as stochastic optimization. It's up to the user to construct an +objective function that represents the batch, mini-batch, or single sample +on which to evaluate the objective. + +Some of these algorithms support a line search, which can be passed as +a function (L-BFGS), whereas others only support a learning rate (SGD). + +<a name='optim.overview'></a> +## Overview + +This package contains several optimization routines for [Torch](https://github.com/torch/torch7/blob/master/README.md). +Each optimization algorithm is based on the same interface: + +```lua +x*, {f}, ... = optim.method(func, x, state) +``` + +where: + +* `func`: a user-defined closure that respects this API: `f, df/dx = func(x)` +* `x`: the current parameter vector (a 1D `torch.Tensor`) +* `state`: a table of parameters, and state variables, dependent upon the algorithm +* `x*`: the new parameter vector that minimizes `f, x* = argmin_x f(x)` +* `{f}`: a table of all f values, in the order they've been evaluated (for some simple algorithms, like SGD, `#f == 1`) + +<a name='optim.example'></a> +## Example + +The state table is used to hold the state of the algorihtm. +It's usually initialized once, by the user, and then passed to the optim function +as a black box. Example: + +```lua +state = { + learningRate = 1e-3, + momentum = 0.5 +} + +for i,sample in ipairs(training_samples) do + local func = function(x) + -- define eval function + return f,df_dx + end + optim.sgd(func,x,state) +end +``` + +<a name='optim.algorithms'></a> +## Algorithms + +All the algorithms provided rely on a unified interface: +```lua +w_new,fs = optim.method(func,w,state) +``` +where: +w is the trainable/adjustable parameter vector, +state contains both options for the algorithm and the state of the algorihtm, +func is a closure that has the following interface: +```lua +f,df_dw = func(w) +``` +w_new is the new parameter vector (after optimization), +fs is a a table containing all the values of the objective, as evaluated during +the optimization procedure: fs[1] is the value before optimization, and fs[#fs] +is the most optimized one (the lowest). + +<a name='optim.sgd'></a> +### [x] sgd(func, w, state) + +An implementation of Stochastic Gradient Descent (SGD). + +Arguments: + + * `opfunc` : a function that takes a single input (`X`), the point of a evaluation, and returns `f(X)` and `df/dX` + * `x` : the initial point + * `config` : a table with configuration parameters for the optimizer + * `config.learningRate` : learning rate + * `config.learningRateDecay` : learning rate decay + * `config.weightDecay` : weight decay + * `config.weightDecays` : vector of individual weight decays + * `config.momentum` : momentum + * `config.dampening` : dampening for momentum + * `config.nesterov` : enables Nesterov momentum + * `state` : a table describing the state of the optimizer; after each call the state is modified + * `state.learningRates` : vector of individual learning rates + +Returns : + + * `x` : the new x vector + * `f(x)` : the function, evaluated before the update + +<a name='optim.asgd'></a> +### [x] asgd(func, w, state) + +An implementation of Averaged Stochastic Gradient Descent (ASGD): + +``` +x = (1 - lambda eta_t) x - eta_t df/dx(z,x) +a = a + mu_t [ x - a ] + +eta_t = eta0 / (1 + lambda eta0 t) ^ 0.75 +mu_t = 1/max(1,t-t0) +``` + +Arguments: + + * `opfunc` : a function that takes a single input (`X`), the point of evaluation, and returns `f(X)` and `df/dX` + * `x` : the initial point + * `state` : a table describing the state of the optimizer; after each call the state is modified + * `state.eta0` : learning rate + * `state.lambda` : decay term + * `state.alpha` : power for eta update + * `state.t0` : point at which to start averaging + +Returns: + + * `x` : the new x vector + * `f(x)` : the function, evaluated before the update + * `ax` : the averaged x vector + + +<a name='optim.lbfgs'></a> +### [x] lbfgs(func, w, state) + +An implementation of L-BFGS that relies on a user-provided line +search function (`state.lineSearch`). If this function is not +provided, then a simple learningRate is used to produce fixed +size steps. Fixed size steps are much less costly than line +searches, and can be useful for stochastic problems. + +The learning rate is used even when a line search is provided. +This is also useful for large-scale stochastic problems, where +opfunc is a noisy approximation of `f(x)`. In that case, the learning +rate allows a reduction of confidence in the step size. + +Arguments : + + * `opfunc` : a function that takes a single input (`X`), the point of evaluation, and returns `f(X)` and `df/dX` + * `x` : the initial point + * `state` : a table describing the state of the optimizer; after each call the state is modified + * `state.maxIter` : Maximum number of iterations allowed + * `state.maxEval` : Maximum number of function evaluations + * `state.tolFun` : Termination tolerance on the first-order optimality + * `state.tolX` : Termination tol on progress in terms of func/param changes + * `state.lineSearch` : A line search function + * `state.learningRate` : If no line search provided, then a fixed step size is used + +Returns : + * `x*` : the new `x` vector, at the optimal point + * `f` : a table of all function values: + * `f[1]` is the value of the function before any optimization and + * `f[#f]` is the final fully optimized value, at `x*` + + +<a name='optim.cg'></a> +### [x] cg(func, w, state) + +An implementation of the Conjugate Gradient method which is a rewrite of +`minimize.m` written by Carl E. Rasmussen. +It is supposed to produce exactly same results (give +or take numerical accuracy due to some changed order of +operations). You can compare the result on rosenbrock with +[minimize.m](http://www.gatsby.ucl.ac.uk/~edward/code/minimize/example.html). +``` +[x fx c] = minimize([0 0]', 'rosenbrock', -25) +``` + +Note that we limit the number of function evaluations only, it seems much +more important in practical use. + +Arguments : + + * `opfunc` : a function that takes a single input, the point of evaluation. + * `x` : the initial point + * `state` : a table of parameters and temporary allocations. + * `state.maxEval` : max number of function evaluations + * `state.maxIter` : max number of iterations + * `state.df[0,1,2,3]` : if you pass torch.Tensor they will be used for temp storage + * `state.[s,x0]` : if you pass torch.Tensor they will be used for temp storage + +Returns : + + * `x*` : the new x vector, at the optimal point + * `f` : a table of all function values where + * `f[1]` is the value of the function before any optimization and + * `f[#f]` is the final fully optimized value, at x* + + diff --git a/dok/index.dok b/dok/index.dok deleted file mode 100644 index 532ecf5..0000000 --- a/dok/index.dok +++ /dev/null @@ -1,78 +0,0 @@ -====== Optimization Package ======= -{{anchor:optim.dok}} - -This package provides a set of optimization algorithms, which all follow -a unified, closure-based API. - -This package is fully compatible with the 'nn' package, but can also be -used to optimize arbitrary objective functions. - -For now, the following algorithms are provided: - * Stochastic Gradient Descent (SGD): [[#optim.sgd|optim.sgd]] - * Averaged Stochastic Gradient Descent (ASGD): [[#optim.asgd|optim.asgd]] - * L-BFGS: [[#optim.lbfgs|optim.lbfgs]] - * Congugate Gradients (CG): [[#optim.cg|optim.cg]] - -All these algorithms are designed to support batch optimization as -well as stochastic optimization. It's up to the user to construct an -objective function that represents the batch, mini-batch, or single sample -on which to evaluate the objective. - -Some of these algorithms support a line search, which can be passed as -a function (L-BFGS), whereas others only support a learning rate (SGD). - - -====== Overview of the Optimization Package ====== -{{anchor:optim.overview.dok}} - -Rather than long descriptions, let's simply start with a little example. - -<file lua> --- write an example here. -</file> - -===== Simple Objective ===== - -===== Neural Network Objective ===== - - -====== Algorithms ====== -{{anchor:nn.API}} - -All the algorithms provided rely on a unified interface: -<file lua> -w_new,fs = optim.method(func,w,state) -</file> -where: -w is the trainable/adjustable parameter vector, -state contains both options for the algorithm and the state of the algorihtm, -func is a closure that has the following interface: -<file lua> -f,df_dw = func(w) -</file> -w_new is the new parameter vector (after optimization), -fs is a a table containing all the values of the objective, as evaluated during -the optimization procedure: fs[1] is the value before optimization, and fs[#fs] -is the most optimized one (the lowest). - -===== [x] sgd(func, w, state) ===== -{{anchor:optim.sgd}} - -An implementation of Stochastic Gradient Descent. - -===== [x] asgd(func, w, state) ===== -{{anchor:optim.asgd}} - -An implementation of Averaged Stochastic Gradient Descent. - -===== [x] lbfgs(func, w, state) ===== -{{anchor:optim.lbfgs}} - -An implementation of L-BFGS. - -===== [x] cg(func, w, state) ===== -{{anchor:optim.cg}} - -An implementation of the Conjugate Gradient method. - - diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..9624b2a --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,8 @@ +site_name: optim +theme : simplex +repo_url : https://github.com/torch/optim +use_directory_urls : false +markdown_extensions: [extra] +docs_dir : doc +pages: +- [index.md, Optim] |