replace the StochasticGradient example with optim example

author: Hugh Perkins <hughperkins@gmail.com> 2016-05-16 16:51:52 +0300
committer: Hugh Perkins <hughperkins@gmail.com> 2016-05-16 16:52:44 +0300
commit: 0594d54443b22c43bd8b5e9e5fcc80361f6e923a (patch)
tree: 2a641b9a13741f39186962f51f35fa81875d8d54
parent: 9f64ffa3e20c73ada3ab8c1564eb821e719f0155 (diff)
1 files changed, 75 insertions, 76 deletions
diff --git a/doc/training.md b/doc/training.md
index 8125ee0..063e08b 100644
--- a/doc/training.md
+++ b/doc/training.md
@@ -4,102 +4,101 @@
 Training a neural network is easy with a [simple `for` loop](#nn.DoItYourself).
 While doing your own loop provides great flexibility, you might
 want sometimes a quick way of training neural
-networks. [StochasticGradient](#nn.StochasticGradient), a simple class
-which does the job for you is provided as standard.
+networks. [optim](https://github.com/torch/optim) is the standard way of training Torch7 neural networks.
 
-<a name="nn.StochasticGradient.dok"></a>
-## StochasticGradient ##
+`optim` is a quite general optimizer, for minimizing any function that outputs a loss.  In our case, our
+function will be the loss of our network, given an input, and a set of weights.  The goal of training 
+a neural net is to
+optimize the weights to give the lowest loss over our training set of input data.  So, we are going to use optim
+to minimize the loss with respect to the weights, over our training set.  We will feed the data to 
+`optim` in minibatches.  For this particular example, we will use just one minibatch, but in your own training
+you will almost certainly want to break your training set into minibatches, and feed each minibatch to `optim`,
+one by one.
 
-`StochasticGradient` is a high-level class for training [neural networks](#nn.Module), using a stochastic gradient
-algorithm. This class is [serializable](https://github.com/torch/torch7/blob/master/doc/serialization.md#serialization).
+We need to give `optim` a function that will output the loss and the derivative of the loss with respect to the
+weights, given a set of input weights.  The function will have access to our training minibatch, and use this
+to calculate the loss, for this minibatch.  Typically, the function would be defined inside our loop over
+batches, and therefore have access to the current minibatch data.
 
-<a name="nn.StochasticGradient"></a>
-### StochasticGradient(module, criterion) ###
+Here's how this looks:
 
-Create a `StochasticGradient` class, using the given [Module](module.md#nn.Module) and [Criterion](criterion.md#nn.Criterion).
-The class contains [several parameters](#nn.StochasticGradientParameters) you might want to set after initialization.
-
-<a name="nn.StochasticGradientTrain"></a>
-### train(dataset) ###
-
-Train the module and criterion given in the
-[constructor](#nn.StochasticGradient) over `dataset`, using the
-internal [parameters](#nn.StochasticGradientParameters).
-
-StochasticGradient expect as a `dataset` an object which implements the operator
-`dataset[index]` and implements the method `dataset:size()`. The `size()` methods
-returns the number of examples and `dataset[i]` has to return the i-th example.
-
-An `example` has to be an object which implements the operator
-`example[field]`, where `field` might take the value `1` (input features)
-or `2` (corresponding label which will be given to the criterion). 
-The input is usually a Tensor (except if you use special kind of gradient modules,
-like [table layers](table.md#nn.TableLayers)). The label type depends of the criterion.
-For example, the [MSECriterion](criterion.md#nn.MSECriterion) expects a Tensor, but the
-[ClassNLLCriterion](criterion.md#nn.ClassNLLCriterion) except a integer number (the class).
+__Neural Network__
 
-Such a dataset is easily constructed by using Lua tables, but it could any `C` object
-for example, as long as required operators/methods are implemented. 
-[See an example](#nn.DoItStochasticGradient).
+We create a simple neural network with one hidden layer.
+```lua
+require 'nn'
 
-<a name="nn.StochasticGradientParameters"></a>
-### Parameters ###
+local model = nn.Sequential();  -- make a multi-layer perceptron
+local inputs = 2; outputs = 1; HUs = 20; -- parameters
+model:add(nn.Linear(inputs, HUs))
+model:add(nn.Tanh())
+model:add(nn.Linear(HUs, outputs))
+```
 
-`StochasticGradient` has several field which have an impact on a call to [train()](#nn.StochasticGradientTrain).
+__Criterion__
 
-  * `learningRate`: This is the learning rate used during training. The update of the parameters will be `parameters = parameters - learningRate * parameters_gradient`. Default value is `0.01`.
-  * `learningRateDecay`: The learning rate decay. If non-zero, the learning rate (note: the field learningRate will not change value) will be computed after each iteration (pass over the dataset) with: `current_learning_rate =learningRate / (1 + iteration * learningRateDecay)`
-  * `maxIteration`: The maximum number of iteration (passes over the dataset). Default is `25`.
-  * `shuffleIndices`: Boolean which says if the examples will be randomly sampled or not. Default is `true`. If `false`, the examples will be taken in the order of the dataset.
-  * `hookExample`: A possible hook function which will be called (if non-nil) during training after each example forwarded and backwarded through the network. The function takes `(self, example)` as parameters. Default is `nil`.
-  * `hookIteration`: A possible hook function which will be called (if non-nil) during training after a complete pass over the dataset. The function takes `(self, iteration, currentError)` as parameters. Default is `nil`.
+We choose the Mean Squared Error criterion and train the dataset.
+```lua
+local criterion = nn.MSECriterion()
+```
 
-<a name="nn.DoItStochasticGradient"></a>
-## Example of training using StochasticGradient ##
+__Dataset__
 
-We show an example here on a classical XOR problem.
+We will just create one minibatch of 128 examples.  In your own networks, you'd want to break down your
+rather larger dataset into multiple minibatches, of around 32-512 examples each.
 
-__Dataset__
+```
+local batchSize = 128
+local batchInputs = torch.Tensor(batchSize, inputs)
+local batchLabels = torch.ByteTensor(batchSize)
 
-We first need to create a dataset, following the conventions described in
-[StochasticGradient](#nn.StochasticGradientTrain).
-```lua
-dataset={};
-function dataset:size() return 100 end -- 100 examples
-for i=1,dataset:size() do 
-  local input = torch.randn(2);     -- normally distributed example in 2d
-  local output = torch.Tensor(1);
+for i=1,batchSize do
+  local input = torch.randn(2)     -- normally distributed example in 2d
+  local label = 1
   if input[1]*input[2]>0 then     -- calculate label for XOR function
-    output[1] = -1;
-  else
-    output[1] = 1
+    label = -1;
   end
-  dataset[i] = {input, output}
+  batchInputs[i]:copy(input)
+  batchLabels[i] = label
 end
 ```
 
-__Neural Network__
-
-We create a simple neural network with one hidden layer.
-```lua
-require "nn"
-mlp = nn.Sequential();  -- make a multi-layer perceptron
-inputs = 2; outputs = 1; HUs = 20; -- parameters
-mlp:add(nn.Linear(inputs, HUs))
-mlp:add(nn.Tanh())
-mlp:add(nn.Linear(HUs, outputs))
-```
-
 __Training__
 
-We choose the Mean Squared Error criterion and train the dataset.
-```lua
-criterion = nn.MSECriterion()  
-trainer = nn.StochasticGradient(mlp, criterion)
-trainer.learningRate = 0.01
-trainer:train(dataset)
-```
+`optim` provides []various training algorithms](https://github.com/torch/optim/blob/master/doc/index.md).  We
+will use [Stochastic Gradient Descent](https://github.com/torch/optim/blob/master/doc/index.md#x-sgdopfunc-x-state).  We
+need to provide the learning rate, via an optimization state table:
 
+```
+require 'optim'
+
+local optimState = {learningRate=0.01}
+
+-- retrieve the weights and biases from the model, as 1-dimensional flattened tensors
+-- these are views onto the underlying weights and biases, and we will give them to optim
+-- When optim updates these params, it is implicitly updating the weights and biases of our
+-- models
+local params, gradParams = model:getParameters()
+for epoch=1,50 do
+  -- local function we give to optim
+  -- it takes current weights as input, and outputs the loss
+  -- and the gradient of the loss with respect to the weights
+  -- gradParams is calculated implicitly by calling 'backward'
+  -- because gradParams is a view onto the model's weight and bias
+  -- gradients tensor
+  local function feval(params)
+    gradParams:zero()
+
+    local outputs = model:forward(batchInputs)
+    local loss = criterion:forward(outputs, batchLabels)
+    local dloss_doutput = criterion:backward(outputs, batchLabels)
+    model:backward(batchInputs, dloss_doutput)
+
+    return loss,gradParams
+  end
+  optim.sgd(feval, params, optimState)
+end
+```
 __Test the network__
 
 ```lua
author	Hugh Perkins <hughperkins@gmail.com>	2016-05-16 16:51:52 +0300
committer	Hugh Perkins <hughperkins@gmail.com>	2016-05-16 16:52:44 +0300
commit	0594d54443b22c43bd8b5e9e5fcc80361f6e923a (patch)
tree	2a641b9a13741f39186962f51f35fa81875d8d54
parent	9f64ffa3e20c73ada3ab8c1564eb821e719f0155 (diff)