Groupwise direct convolution when number of channels per group is small

Summary: **Summary** This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. **Some Highlights:** 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. **Desired Extensions:** 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. **Without rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 **Without rowoffset and with requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 **With rowoffset and without requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 **With rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. **Update:** includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5
author: Daya S Khudia <dskhudia@fb.com> 2019-01-14 23:53:37 +0300
committer: Facebook Github Bot <facebook-github-bot@users.noreply.github.com> 2019-01-14 23:56:47 +0300
commit: ae9f02719c40fb601ef26dc506b823bed02bfca6 (patch)
tree: 1eadf0d9b04b14a52e525c4fa7e17ee5543010d9 /CMakeLists.txt
parent: 9a59fbd05f1df91db548547753d8b0e7a79c031e (diff)
1 files changed, 2 insertions, 0 deletions
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 5d889cd..dfb5623 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -34,12 +34,14 @@ set(FBGEMM_GENERIC_SRCS src/ExecuteKernel.cc
                 src/GenerateKernelU8S8S32ACC16Avx512.cc
                 src/GenerateKernelU8S8S32ACC32.cc
                 src/GenerateKernelU8S8S32ACC32Avx512.cc
+                src/GroupwiseConvAcc32Avx2.cc
                 src/PackAMatrix.cc
                 src/PackAWithIm2Col.cc
                 src/PackBMatrix.cc
                 src/PackMatrix.cc
                 src/PackAWithQuantRowOffset.cc
                 src/PackAWithRowOffset.cc
+                src/PackWeightMatrixForGConv.cc
                 src/QuantUtils.cc
                 src/RefImplementations.cc
                 src/Utils.cc)
author	Daya S Khudia <dskhudia@fb.com>	2019-01-14 23:53:37 +0300
committer	Facebook Github Bot <facebook-github-bot@users.noreply.github.com>	2019-01-14 23:56:47 +0300
commit	ae9f02719c40fb601ef26dc506b823bed02bfca6 (patch)
tree	1eadf0d9b04b14a52e525c4fa7e17ee5543010d9 /CMakeLists.txt
parent	9a59fbd05f1df91db548547753d8b0e7a79c031e (diff)