diff options
Diffstat (limited to 'benchmark/README.md')
-rw-r--r-- | benchmark/README.md | 130 |
1 files changed, 130 insertions, 0 deletions
diff --git a/benchmark/README.md b/benchmark/README.md new file mode 100644 index 0000000..e4289f7 --- /dev/null +++ b/benchmark/README.md @@ -0,0 +1,130 @@ +argcheck benchmark +================== + +We show here an example of `argcheck` in a real-life case: wrapping a call +to the numerical library [TH](https://github.com/torch/TH), used in +[torch7](https://github.com/torch/torch7). + +The code does a simple loop over this particular function call. In torch7, +this looks like: +```lua + for i=1,N do + torch.add(y, x, 5) + torch.add(y, x, scale, y) + end +``` + +The add function of torch7 is non-trivial, because it has to handle cases +where one wants to add a tensor to a tensor or a value to a tensor. There +is also an optional scale argument. The function is also overloaded for 7 +different types of tensors (double, float, int...), which makes things even +more uneasy. We define the double overloading last, to study the worst case +performance. + +In the following, we compare: + - `torch7` (here ran with luajit). Torch7 uses the regular lua/C API. + - `torch9`, a FFI interface for [`luajit`](http://luajit.org), to the TH library achieved with `argcheck`. + - `torch9lua`, running [`lua`](http://www.lua.org) with [`libffi`](https://github.com/jmckaskill/luaffi) and `argcheck`. + - `C`, plain C calls to `TH` library. Contrary to other versions, _it does not include the overhead of multiple tensor types_. + +What we call `torch9` here is only a thin interface to `TH` with FFI, +limited to the purpose of this benchmark. The only thing it has to do with +the upcoming `torch9` is the way we use `argcheck` with FFI. + +We avoid garbage-collection side-effects by not allocating objects. + +## Call to argcheck + +We create a function `add()`, which is overloaded to handle various possible argument situations. + +```lua +add = argcheck{ + overload = add, + {name="res", type="torch.DoubleTensor", opt=true}, + {name="src", type="torch.DoubleTensor"}, + {name="value", type="number"}, + call = + function(res, src, value) + res = res or DoubleTensor() + C.THDoubleTensor_add(res, src, value) + return res + end +} + +add = argcheck{ + overload = add, + {name="res", type="torch.DoubleTensor", opt=true}, + {name="src1", type="torch.DoubleTensor"}, + {name="value", type="number", default=1}, + {name="src2", type="torch.DoubleTensor"}, + call = + function(res, src1, value, src2) + res = res or torch.DoubleTensor() + C.THDoubleTensor_cadd(res, src1, value, src2) + return res + end +} +``` + +As you can see, there are many variations to handle. The generated code is +201 lines of `lua` code, only for the case of DoubleTensor. With all the 7 +types of tensor, it is 5250 lines of code! This code handles both ordered +arguments (as in `torch7`) and named arguments calls. Arguments calls is +just for syntactic sugar, but is slower (it implies creating argument +tables, and looping over them, which is not JIT in the current +`luajit` 2.1). + +The tree generated in the case of DoubleTensor is alone, is the following: +![](doc/tree1.png) +When it includes all the 7 types of tensors: +![](doc/tree7.png) + +## Running it + +We now compare our different setups with matrix sizes of size 2, 10, 100, +and 300 over 100,000,000, 10,000,000, 1,000,000 and 100,000 iterations +respectively. Running time is given is seconds. Experiments were performed +on a MacBook Pro 2.6GHz Quad-core i7, using one core. Overhead per call is +reported, in nano-seconds, computed with the first two columns (w.r.t C +performance). + +| | 2 | 10 | 100 | 300 | overhead | +|:--------------------------------|---------:|---------:|---------:|---------:|-----------:| +| C | 3.82s | 1.16s | 8.74s | 10.34s | 0ns | +| torch7 (luajit+C API) (jit=on) | 73.45s | 8.22s | 9.47s | 10.47s | 701ns | +| torch7 (luajit+C API) (jit=off) | 72.22s | 8.21s | 9.49s | 10.59s | 694ns | +| torch9 (luajit+ffi) (jit=on) | 3.80s | 1.14s | 8.82s | 10.30s | -1ns | +| torch9 (luajit+ffi) (jit=off) | 167.62s | 17.35s | 10.75s | 10.83s | 1619ns | +| torch9 (lua+luaffi) | 256.20s | 26.93s | 11.30s | 10.66s | 2550ns | + +### Comments + +Not suprisingly, the old lua/C API has quite some overheads when calling +short duration C code. + +`luajit` does an impressive job in calling C functions through FFI. It +stays on par with C performance, even when C operations are limited (small +matrix size). `argcheck` is viable even in interpreted mode with luajit, +with only a x2 overhead compared to the lua/C API. + +Lua interpreter (with luaffi library) has clearly more +overheads. `argcheck` might be still very usable (here 2.5ms per call, in a +pretty complicated setup), depending on your use-case. + +## Named arguments + +As mention earlier, named argument calls are expected to be slower. Here is +a comparison against ordered arguments calls, using the same benchmark. In +our case, the overhead is about 1ms per call with luajit (note that with +jit off, the performance is similar, meaning luajit relies mainly on the +interpreter in that case). Our test case is pretty complicated, your +mileage might vary... + +| | 2 | 10 | 100 | 300 | overhead | +|:-----------------------------------------|---------:|---------:|---------:|---------:|-----------:| +| torch9 (luajit+ffi) (jit=on) (ordered) | 3.80s | 1.14s | 8.82s | 10.30s | -1ns | +| torch9 (luajit+ffi) (jit=off) (ordered) | 167.62s | 17.35s | 10.75s | 10.83s | 1628ns | +| torch9 (lua+luaffi) (ordered) | 256.20s | 26.93s | 11.30s | 10.66s | 2550ns | +| torch9 (luajit+ffi) (jit=on) (named) | 110.24s | 11.81s | 9.85s | 10.29s | 1064ns | +| torch9 (luajit+ffi) (jit=off) (named) | 205.99s | 21.92s | 11.08s | 10.72s | 2049ns | +| torch9 (lua+luaffi) (named) | 486.19s | 49.48s | 13.87s | 10.66s | 4828ns | |