diff options
author | Soumith Chintala <soumith@gmail.com> | 2016-05-17 18:17:54 +0300 |
---|---|---|
committer | Soumith Chintala <soumith@gmail.com> | 2016-05-17 18:17:54 +0300 |
commit | 72c7a8ae44d1718989ebf5716f06263a1b33870a (patch) | |
tree | 47ac2560bf1735d479aec4fde9685d4c66cf7a05 | |
parent | 7c6cdf00006a8f4ec8abf9d7345f6ed15f7640c8 (diff) | |
parent | 1f9b938eee38bfe30db2b98e0f00c7bace1cac77 (diff) |
Merge pull request #272 from hughperkins/add-some-doc
Add some doc
-rw-r--r-- | README.md | 92 |
1 files changed, 92 insertions, 0 deletions
@@ -3,3 +3,95 @@ This package provides a CUDA implementation for many of the modules in the base nn package: [nn](https://github.com/torch/nn/blob/master/README.md) * [Modules](doc/cunnmodules.md#nn.cunnmodules.dok): There are also additional GPU-related modules not found in the nn package. + +## To use + +Simply convert your network model to CUDA by calling `:cuda()`: +```lua +local model = nn.Sequential() +model:add(nn.Linear(2,2)) +model:add(nn.LogSoftMax()) + +model:cuda() -- convert model to CUDA +``` + +... and similarly for your tensors: +```lua +local input = torch.Tensor(32,2):uniform() +input = input:cuda() +local output = model:forward(input) +``` +... or create them directly as `CudaTensor`s: +```lua +local input = torch.CudaTensor(32,2):uniform() +local output = model:forward(input) +``` + +## To run unit-tests + +```lua +luajit -l cunn -e 'cunn.test()' +``` + +## GPU Training Concepts + +__Performance__ + +* data should be transferred between main memory and gpu in batches, otherwise the transfer time will be dominated +by latency associated with speed of light, and execution overheads, rather than by bandwidth +* therefore, train and predict using mini-batches +* allocating GPU memory causes a sync-point, which will noticeably affect performance + * therefore try to allocate any `CudaTensor`s once, at the start of the program, + and then simply copy data backwards and forwards + between main memory and existing `CudaTensor`s +* similarly, try to avoid any operations that implicitly allocate new tensors. For example, if you write: +```lua +require 'cutorch' + +local a = torch.CudaTensor(1000):uniform() +for it=1,1000 do + local b = a:add(1) +end +``` +... this will allocate one thousand new `CudaTensor`s, one for each call to `a:add(1)`. + +Use instead this form: +```lua +require 'cutorch' + +local a = torch.CudaTensor(1000):uniform() +local b = torch.CudaTensor(1000):uniform() +for it=1,1000 do + b:add(a, 1) +end +``` +In this form, `b` is allocated only once, before the loop. Then the `b:add(a,1)` operation will perform +the add inside the GPU kernel, and immediately store the result into the original `b` `CudaTensor`. This +will run noticeably faster, in general. It's also a lot likely to eat up arbitrary amounts of memory, +and less likely to need frequent calls to `collectgarbage(); collectgarbage()`. + +__Benchmarking__ + +* GPU operations will typically continue after an instruction has been issued +* eg, if you do: +```lua +require 'cutorch' +local a = torch.CudaTensor(1000,1000):uniform() +a:add(1) +``` +... the GPU kernel to add 1 will only be scheduled for launch by `a:add(1)`. It wont have completed yet, or +even have reached the GPU, at the time that the `a:add(1)` instructions has completed +* therefore for running wall-clock timings, you should call `cutorch.synchronize()` before each timecheck +point: +```lua +require 'cutorch' +require 'sys' + +local a = torch.CudaTensor(1000,1000):uniform() +cutorch.synchronize() +start = sys.tic() +a:add(1) +cutorch.synchronize() +print(sys.toc()) +``` + |