argcheck benchmark ================== We show here an example of `argcheck` in a real-life case: wrapping a call to the numerical library [TH](https://github.com/torch/TH), used in [torch7](https://github.com/torch/torch7). The code does a simple loop over this particular function call. In torch7, this looks like: ```lua for i=1,N do torch.add(y, x, 5) torch.add(y, x, scale, y) end ``` The add function of torch7 is non-trivial, because it has to handle cases where one wants to add a tensor to a tensor or a value to a tensor. There is also an optional scale argument. The function is also overloaded for 7 different types of tensors (double, float, int...), which makes things even more uneasy. We define the double overloading last, to study the worst case performance. In the following, we compare: - `torch7` (here ran with luajit). Torch7 uses the regular lua/C API. - `torch9`, a FFI interface for [`luajit`](http://luajit.org), to the TH library achieved with `argcheck`. - `torch9lua`, running [`lua`](http://www.lua.org) with [`libffi`](https://github.com/jmckaskill/luaffi) and `argcheck`. - `C`, plain C calls to `TH` library. Contrary to other versions, _it does not include the overhead of multiple tensor types_. What we call `torch9` here is only a thin interface to `TH` with FFI, limited to the purpose of this benchmark. The only thing it has to do with the upcoming `torch9` is the way we use `argcheck` with FFI. We avoid garbage-collection side-effects by not allocating objects. ## Call to argcheck We create a function `add()`, which is overloaded to handle various possible argument situations. ```lua add = argcheck{ overload = add, {name="res", type="torch.DoubleTensor", opt=true}, {name="src", type="torch.DoubleTensor"}, {name="value", type="number"}, call = function(res, src, value) res = res or DoubleTensor() C.THDoubleTensor_add(res, src, value) return res end } add = argcheck{ overload = add, {name="res", type="torch.DoubleTensor", opt=true}, {name="src1", type="torch.DoubleTensor"}, {name="value", type="number", default=1}, {name="src2", type="torch.DoubleTensor"}, call = function(res, src1, value, src2) res = res or torch.DoubleTensor() C.THDoubleTensor_cadd(res, src1, value, src2) return res end } ``` As you can see, there are many variations to handle. The generated code is 201 lines of `lua` code, only for the case of DoubleTensor. With all the 7 types of tensor, it is 5250 lines of code! This code handles both ordered arguments (as in `torch7`) and named arguments calls. Arguments calls is just for syntactic sugar, but is slower (it implies creating argument tables, and looping over them, which is not JIT in the current `luajit` 2.1). The tree generated in the case of DoubleTensor is alone, is the following: ![](doc/tree1.png) When it includes all the 7 types of tensors: ![](doc/tree7.png) ## Running it We now compare our different setups with matrix sizes of size 2, 10, 100, and 300 over 100,000,000, 10,000,000, 1,000,000 and 100,000 iterations respectively. Running time is given is seconds. Experiments were performed on a MacBook Pro 2.6GHz Quad-core i7, using one core. Overhead per call is reported, in nano-seconds, computed with the first two columns (w.r.t C performance). | | 2 | 10 | 100 | 300 | overhead | |:--------------------------------|---------:|---------:|---------:|---------:|-----------:| | C | 3.82s | 1.16s | 8.74s | 10.34s | 0ns | | torch7 (luajit+C API) (jit=on) | 73.45s | 8.22s | 9.47s | 10.47s | 701ns | | torch7 (luajit+C API) (jit=off) | 72.22s | 8.21s | 9.49s | 10.59s | 694ns | | torch9 (luajit+ffi) (jit=on) | 3.80s | 1.14s | 8.82s | 10.30s | -1ns | | torch9 (luajit+ffi) (jit=off) | 167.62s | 17.35s | 10.75s | 10.83s | 1619ns | | torch9 (lua+luaffi) | 256.20s | 26.93s | 11.30s | 10.66s | 2550ns | ### Comments Not suprisingly, the old lua/C API has quite some overheads when calling short duration C code. `luajit` does an impressive job in calling C functions through FFI. It stays on par with C performance, even when C operations are limited (small matrix size). `argcheck` is viable even in interpreted mode with luajit, with only a x2 overhead compared to the lua/C API. Lua interpreter (with luaffi library) has clearly more overheads. `argcheck` might be still very usable (here 2.5ms per call, in a pretty complicated setup), depending on your use-case. ## Named arguments As mention earlier, named argument calls are expected to be slower. Here is a comparison against ordered arguments calls, using the same benchmark. In our case, the overhead is about 1ms per call with luajit (note that with jit off, the performance is similar, meaning luajit relies mainly on the interpreter in that case). Our test case is pretty complicated, your mileage might vary... | | 2 | 10 | 100 | 300 | overhead | |:-----------------------------------------|---------:|---------:|---------:|---------:|-----------:| | torch9 (luajit+ffi) (jit=on) (ordered) | 3.80s | 1.14s | 8.82s | 10.30s | -1ns | | torch9 (luajit+ffi) (jit=off) (ordered) | 167.62s | 17.35s | 10.75s | 10.83s | 1628ns | | torch9 (lua+luaffi) (ordered) | 256.20s | 26.93s | 11.30s | 10.66s | 2550ns | | torch9 (luajit+ffi) (jit=on) (named) | 110.24s | 11.81s | 9.85s | 10.29s | 1064ns | | torch9 (luajit+ffi) (jit=off) (named) | 205.99s | 21.92s | 11.08s | 10.72s | 2049ns | | torch9 (lua+luaffi) (named) | 486.19s | 49.48s | 13.87s | 10.66s | 4828ns |