doc/algos.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367

<a name='optim.algos'></a>
# Optimization algorithms

The following algorithms are provided:

  * [*Stochastic Gradient Descent*](#optim.sgd)
  * [*Averaged Stochastic Gradient Descent*](#optim.asgd)
  * [*L-BFGS*](#optim.lbfgs)
  * [*Congugate Gradients*](#optim.cg)
  * [*AdaDelta*](#optim.adadelta)
  * [*AdaGrad*](#optim.adagrad)
  * [*Adam*](#optim.adam)
  * [*AdaMax*](#optim.adamax)
  * [*FISTA with backtracking line search*](#optim.FistaLS)
  * [*Nesterov's Accelerated Gradient method*](#optim.nag)
  * [*RMSprop*](#optim.rmsprop)
  * [*Rprop*](#optim.rprop)
  * [*CMAES*](#optim.cmaes)

All these algorithms are designed to support batch optimization as well as stochastic optimization.
It's up to the user to construct an objective function that represents the batch, mini-batch, or single sample on which to evaluate the objective.

Some of these algorithms support a line search, which can be passed as a function (*L-BFGS*), whereas others only support a learning rate (*SGD*).

General interface:

```lua
x*, {f}, ... = optim.method(opfunc, x[, config][, state])
```


<a name='optim.sgd'></a>
## sgd(opfunc, x[, config][, state])

An implementation of *Stochastic Gradient Descent* (*SGD*).

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.learningRate`: learning rate
  * `config.learningRateDecay`: learning rate decay
  * `config.weightDecay`: weight decay
  * `config.weightDecays`: vector of individual weight decays
  * `config.momentum`: momentum
  * `config.dampening`: dampening for momentum
  * `config.nesterov`: enables Nesterov momentum
  * `state`: a table describing the state of the optimizer; after each call the state is modified
  * `state.learningRates`: vector of individual learning rates

Returns:

  * `x*`: the new x vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.asgd'></a>
## asgd(opfunc, x[, config][, state])

An implementation of *Averaged Stochastic Gradient Descent* (*ASGD*):

```lua
x = (1 - lambda eta_t) x - eta_t df / dx(z, x)
a = a + mu_t [ x - a ]

eta_t = eta0 / (1 + lambda eta0 t) ^ 0.75
mu_t = 1 / max(1, t - t0)
```

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.eta0`: learning rate
  * `config.lambda`: decay term
  * `config.alpha`: power for eta update
  * `config.t0`: point at which to start averaging

Returns:

  * `x*`: the new x vector
  * `f(x)`: the function, evaluated before the update
  * `ax`: the averaged x vector


<a name='optim.lbfgs'></a>
## lbfgs(opfunc, x[, config][, state])

An implementation of *L-BFGS* that relies on a user-provided line search function (`state.lineSearch`).
If this function is not provided, then a simple learning rate is used to produce fixed size steps.
Fixed size steps are much less costly than line searches, and can be useful for stochastic problems.

The learning rate is used even when a line search is provided.
This is also useful for large-scale stochastic problems, where opfunc is a noisy approximation of `f(x)`.
In that case, the learning rate allows a reduction of confidence in the step size.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.maxIter`: Maximum number of iterations allowed
  * `config.maxEval`: Maximum number of function evaluations
  * `config.tolFun`: Termination tolerance on the first-order optimality
  * `config.tolX`: Termination tol on progress in terms of func/param changes
  * `config.lineSearch`: A line search function
  * `config.learningRate`: If no line search provided, then a fixed step size is used

Returns:
  * `x*`: the new `x` vector, at the optimal point
  * `f`: a table of all function values:
   * `f[1]` is the value of the function before any optimization and
   * `f[#f]` is the final fully optimized value, at `x*`


<a name='optim.cg'></a>
## cg(opfunc, x[, config][, state])

An implementation of the *Conjugate Gradient* method which is a rewrite of `minimize.m` written by Carl E. Rasmussen.
It is supposed to produce exactly same results (give or take numerical accuracy due to some changed order of operations).
You can compare the result on rosenbrock with [`minimize.m`](http://www.gatsby.ucl.ac.uk/~edward/code/minimize/example.html).

```lua
x, fx, c = minimize([0, 0]', 'rosenbrock', -25)
```

Note that we limit the number of function evaluations only, it seems much more important in practical use.

Arguments:

  * `opfunc`: a function that takes a single input, the point of evaluation.
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.maxEval`: max number of function evaluations
  * `config.maxIter`: max number of iterations
  * `state`: a table of parameters and temporary allocations.
  * `state.df[0, 1, 2, 3]`: if you pass `Tensor` they will be used for temp storage
  * `state.[s, x0]`: if you pass `Tensor` they will be used for temp storage

Returns:

  * `x*`: the new `x` vector, at the optimal point
  * `f`: a table of all function values where
   * `f[1]` is the value of the function before any optimization and
   * `f[#f]` is the final fully optimized value, at `x*`


<a name='optim.adadelta'></a>
## adadelta(opfunc, x[, config][, state])

*AdaDelta* implementation for *SGD* http://arxiv.org/abs/1212.5701.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table of hyper-parameters
  * `config.rho`: interpolation parameter
  * `config.eps`: for numerical stability
  * `state`: a table describing the state of the optimizer; after each call the state is modified
  * `state.paramVariance`: vector of temporal variances of parameters
  * `state.accDelta`: vector of accummulated delta of gradients

Returns:

  * `x*`: the new x vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.adagrad'></a>
## adagrad(opfunc, x[, config][, state])

*AdaGrad* implementation for *SGD*.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.learningRate`: learning rate
  * `config.learningRateDecay`: learning rate decay
  * `config.weightDecay`: weight decay coefficient for regularization
  * `state`: a table describing the state of the optimizer; after each call the state is modified
  * `state.paramVariance`: vector of temporal variances of parameters

Returns:

  * `x*`: the new `x` vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.adam'></a>
## adam(opfunc, x[, config][, state])

An implementation of *Adam* from http://arxiv.org/pdf/1412.6980.pdf.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.learningRate`: learning rate
  * `config.learningRateDecay`: learning rate decay
  * `config.weightDecay`: weight decay coefficient for regularization
  * `config.beta1`: first moment coefficient
  * `config.beta2`: second moment coefficient
  * `config.epsilon`: for numerical stability
  * `state`: a table describing the state of the optimizer; after each call the state is modified

Returns:

  * `x*`: the new x vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.adamax'></a>
## adamax(opfunc, x[, config][, state])

An implementation of *AdaMax* http://arxiv.org/pdf/1412.6980.pdf.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.learningRate`: learning rate
  * `config.beta1`: first moment coefficient
  * `config.beta2`: second moment coefficient
  * `config.epsilon`: for numerical stability
  * `state`: a table describing the state of the optimizer; after each call the state is modified

Returns:

  * `x*`: the new `x` vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.FistaLS'></a>
## FistaLS(f, g, pl, xinit[, params])

*Fista* with backtracking *Line Search*:

  * `f`: smooth function
  * `g`: non-smooth function
  * `pl`: minimizer of intermediate problem Q(x, y)
  * `xinit`: initial point
  * `params`: table of parameters (**optional**)
  * `params.L`: 1/(step size) for ISTA/FISTA iteration (0.1)
  * `params.Lstep`: step size multiplier at each iteration (1.5)
  * `params.maxiter`: max number of iterations (50)
  * `params.maxline`: max number of line search iterations per iteration (20)
  * `params.errthres`: Error thershold for convergence check (1e-4)
  * `params.doFistaUpdate`: true : use FISTA, false: use ISTA (true)
  * `params.verbose`: store each iteration solution and print detailed info (false)

On output, `params` will contain these additional fields that can be reused.
  * `params.L`: last used L value will be written.

These are temporary storages needed by the algo and if the same params object is
passed a second time, these same storages will be used without new allocation.
  * `params.xkm`: previous iterarion point
  * `params.y`: fista iteration
  * `params.ply`: `ply = pl(y * 1/L grad(f))`

Returns the solution `x` and history of `{function evals, number of line search , ...}`.

Algorithm is published in http://epubs.siam.org/doi/abs/10.1137/080716542


<a name='optim.nag'></a>
## nag(opfunc, x[, config][, state])

An implementation of *SGD* adapted with features of *Nesterov's Accelerated Gradient method*, based on the paper "On the Importance of Initialization and Momentum in Deep Learning" (Sutskever et. al., ICML 2013) http://www.cs.toronto.edu/~fritz/absps/momentum.pdf.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.learningRate`: learning rate
  * `config.learningRateDecay`: learning rate decay
  * `config.weightDecay`: weight decay
  * `config.momentum`: momentum
  * `config.learningRates`: vector of individual learning rates

Returns:

  * `x*`: the new `x` vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.rmsprop'></a>
## rmsprop(opfunc, x[, config][, state])

An implementation of *RMSprop*.

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of a evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.learningRate`: learning rate
  * `config.alpha`: smoothing constant
  * `config.epsilon`: value with which to initialise m
  * `state`: a table describing the state of the optimizer; after each call the state is modified
  * `state.m`: leaky sum of squares of parameter gradients,
  * `state.tmp`: and the square root (with epsilon smoothing)

Returns:

  * `x*`: the new x vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.rprop'></a>
## rprop(opfunc, x[, config][, state])

A plain implementation of *Rprop* (Martin Riedmiller, Koray Kavukcuoglu 2013).

Arguments:

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`
  * `x`: the initial point
  * `config`: a table with configuration parameters for the optimizer
  * `config.stepsize`: initial step size, common to all components
  * `config.etaplus`: multiplicative increase factor, > 1 (default 1.2)
  * `config.etaminus`: multiplicative decrease factor, < 1 (default 0.5)
  * `config.stepsizemax`: maximum stepsize allowed (default 50)
  * `config.stepsizemin`: minimum stepsize allowed (default 1e-6)
  * `config.niter`: number of iterations (default 1)

Returns:

  * `x*`: the new x vector
  * `f(x)`: the function, evaluated before the update


<a name='optim.cmaes'></a>
## cmaes(opfunc, x[, config][, state])

An implementation of *CMAES* (*Covariance Matrix Adaptation Evolution Strategy*), ported from https://www.lri.fr/~hansen/barecmaes2.html.

*CMAES* is a stochastic, derivative-free method for heuristic global optimization of non-linear or non-convex continuous optimization problems.
Note that this method will on average take much more function evaluations to converge then a gradient based method.

Arguments:

If `state` is specified, then `config` is not used at all.
Otherwise `state` is `config`.

  * `opfunc`: a function that takes a single input `X`, the point of evaluation, and returns `f(X)` and `df/dX`. Note that `df/dX` is not used and can be left 0
  * `x`: the initial point
  * `state`: a table describing the state of the optimizer; after each call the state is modified
  * `state.sigma`: float, initial step-size (standard deviation in each coordinate)
  * `state.maxEval`: int, maximal number of function evaluations
  * `state.ftarget`: float, target function value
  * `state.popsize`: population size. If this is left empty, `4 + int(3 * log(|x|))` will be used
  * `state.ftarget`: stop if `fitness < ftarget`
  * `state.verb_disp`: display info on console every verb_disp iteration, 0 for never

Returns:
  * `x*`: the new `x` vector, at the optimal point
  * `f`: a table of all function values:
   * `f[1]` is the value of the function before any optimization and
   * `f[#f]` is the final fully optimized value, at `x*`