1 files changed, 130 insertions, 0 deletions
diff --git a/benchmark/README.md b/benchmark/README.md
new file mode 100644
index 0000000..e4289f7
--- /dev/null
+++ b/benchmark/README.md
@@ -0,0 +1,130 @@
+argcheck benchmark
+==================
+
+We show here an example of `argcheck` in a real-life case: wrapping a call
+to the numerical library [TH](https://github.com/torch/TH), used in
+[torch7](https://github.com/torch/torch7).
+
+The code does a simple loop over this particular function call. In torch7,
+this looks like:
+```lua
+   for i=1,N do
+      torch.add(y, x, 5)
+      torch.add(y, x, scale, y)
+   end
+```
+
+The add function of torch7 is non-trivial, because it has to handle cases
+where one wants to add a tensor to a tensor or a value to a tensor. There
+is also an optional scale argument. The function is also overloaded for 7
+different types of tensors (double, float, int...), which makes things even
+more uneasy. We define the double overloading last, to study the worst case
+performance.
+
+In the following, we compare:
+  - `torch7` (here ran with luajit). Torch7 uses the regular lua/C API.
+  - `torch9`, a FFI interface for [`luajit`](http://luajit.org), to the TH library achieved with `argcheck`.
+  - `torch9lua`, running [`lua`](http://www.lua.org) with [`libffi`](https://github.com/jmckaskill/luaffi) and `argcheck`.
+  - `C`, plain C calls to `TH` library. Contrary to other versions, _it does not include the overhead of multiple tensor types_.
+
+What we call `torch9` here is only a thin interface to `TH` with FFI,
+limited to the purpose of this benchmark. The only thing it has to do with
+the upcoming `torch9` is the way we use `argcheck` with FFI.
+
+We avoid garbage-collection side-effects by not allocating objects.
+
+## Call to argcheck
+
+We create a function `add()`, which is overloaded to handle various possible argument situations.
+
+```lua
+add = argcheck{
+   overload = add,
+   {name="res", type="torch.DoubleTensor", opt=true},
+   {name="src", type="torch.DoubleTensor"},
+   {name="value", type="number"},
+   call =
+      function(res, src, value)
+         res = res or DoubleTensor()
+         C.THDoubleTensor_add(res, src, value)
+         return res
+      end
+}
+
+add = argcheck{
+   overload = add,
+   {name="res", type="torch.DoubleTensor", opt=true},
+   {name="src1", type="torch.DoubleTensor"},
+   {name="value", type="number", default=1},
+   {name="src2", type="torch.DoubleTensor"},
+   call =
+      function(res, src1, value, src2)
+         res = res or torch.DoubleTensor()
+         C.THDoubleTensor_cadd(res, src1, value, src2)
+         return res
+      end
+}
+```
+
+As you can see, there are many variations to handle. The generated code is
+201 lines of `lua` code, only for the case of DoubleTensor. With all the 7
+types of tensor, it is 5250 lines of code! This code handles both ordered
+arguments (as in `torch7`) and named arguments calls. Arguments calls is
+just for syntactic sugar, but is slower (it implies creating argument
+tables, and looping over them, which is not JIT in the current
+`luajit` 2.1).
+
+The tree generated in the case of DoubleTensor is alone, is the following:
+![](doc/tree1.png)
+When it includes all the 7 types of tensors:
+![](doc/tree7.png)
+
+## Running it
+
+We now compare our different setups with matrix sizes of size 2, 10, 100,
+and 300 over 100,000,000, 10,000,000, 1,000,000 and 100,000 iterations
+respectively. Running time is given is seconds. Experiments were performed
+on a MacBook Pro 2.6GHz Quad-core i7, using one core. Overhead per call is
+reported, in nano-seconds, computed with the first two columns (w.r.t C
+performance).
+
+|                                 |    2     |    10    |    100   |    300   |  overhead  |
+|:--------------------------------|---------:|---------:|---------:|---------:|-----------:|
+| C                               |   3.82s  |   1.16s  |   8.74s  |  10.34s  |     0ns    |
+| torch7 (luajit+C API) (jit=on)  |  73.45s  |   8.22s  |   9.47s  |  10.47s  |   701ns    |
+| torch7 (luajit+C API) (jit=off) |  72.22s  |   8.21s  |   9.49s  |  10.59s  |   694ns    |
+| torch9 (luajit+ffi)   (jit=on)  |   3.80s  |   1.14s  |   8.82s  |  10.30s  |    -1ns    |
+| torch9 (luajit+ffi)   (jit=off) | 167.62s  |  17.35s  |  10.75s  |  10.83s  |  1619ns    |
+| torch9 (lua+luaffi)             | 256.20s  |  26.93s  |  11.30s  |  10.66s  |  2550ns    |
+
+### Comments
+
+Not suprisingly, the old lua/C API has quite some overheads when calling
+short duration C code.
+
+`luajit` does an impressive job in calling C functions through FFI. It
+stays on par with C performance, even when C operations are limited (small
+matrix size). `argcheck` is viable even in interpreted mode with luajit,
+with only a x2 overhead compared to the lua/C API.
+
+Lua interpreter (with luaffi library) has clearly more
+overheads. `argcheck` might be still very usable (here 2.5ms per call, in a
+pretty complicated setup), depending on your use-case.
+
+## Named arguments
+
+As mention earlier, named argument calls are expected to be slower. Here is
+a comparison against ordered arguments calls, using the same benchmark. In
+our case, the overhead is about 1ms per call with luajit (note that with
+jit off, the performance is similar, meaning luajit relies mainly on the
+interpreter in that case). Our test case is pretty complicated, your
+mileage might vary...
+
+|                                          |    2     |    10    |    100   |    300   |  overhead  |
+|:-----------------------------------------|---------:|---------:|---------:|---------:|-----------:|
+| torch9 (luajit+ffi) (jit=on)  (ordered)  |   3.80s  |   1.14s  |   8.82s  |  10.30s  |     -1ns   |
+| torch9 (luajit+ffi) (jit=off) (ordered)  | 167.62s  |  17.35s  |  10.75s  |  10.83s  |   1628ns   |
+| torch9 (lua+luaffi)           (ordered)  | 256.20s  |  26.93s  |  11.30s  |  10.66s  |   2550ns   |
+| torch9 (luajit+ffi) (jit=on)  (named)    | 110.24s  |  11.81s  |   9.85s  |  10.29s  |   1064ns   |
+| torch9 (luajit+ffi) (jit=off) (named)    | 205.99s  |  21.92s  |  11.08s  |  10.72s  |   2049ns   |
+| torch9 (lua+luaffi)           (named)    | 486.19s  |  49.48s  |  13.87s  |  10.66s  |   4828ns   |