diff options
author | Benoit Jacob <benoitjacob@google.com> | 2020-07-21 06:08:19 +0300 |
---|---|---|
committer | Copybara-Service <copybara-worker@google.com> | 2020-07-21 06:08:41 +0300 |
commit | ec99c704a19d38ea502e81c0a9f5b82026471cef (patch) | |
tree | 68815e0c2cd78cdad56114121816e2d9332f8254 /ruy/kernel_avx512.cc | |
parent | bebf022784e9b22277b84373c9877aebff8411a7 (diff) |
Optimized packing code path for row-major float inputs.
This is implemented in plain C++ with memcpy and memset because:
- The 1x8 kernel block layout lends itself well to such an implementation when the source is row-major.
- This allows to cover at once ARM64, ARM32, and x86 AVX2 and AVX512. These kernels' layouts only differ in the number of columns. Implementing this in C++ allowed to just make that a `int KernelCols` template param.
- Surprisingly, despite the humble implementation, this already seems to make row-major sources faster than column-major on x86, ARM32 and ARM64. I don't have an explanation for that!
PiperOrigin-RevId: 322279263
Diffstat (limited to 'ruy/kernel_avx512.cc')
0 files changed, 0 insertions, 0 deletions