Groupwise direct convolution when number of channels per group is small

Summary: **Summary** This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. **Some Highlights:** 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. **Desired Extensions:** 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. **Without rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 **Without rowoffset and with requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 **With rowoffset and without requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 **With rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. **Update:** includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5
author: Daya S Khudia <dskhudia@fb.com> 2019-01-14 23:53:37 +0300
committer: Facebook Github Bot <facebook-github-bot@users.noreply.github.com> 2019-01-14 23:56:47 +0300
commit: ae9f02719c40fb601ef26dc506b823bed02bfca6 (patch)
tree: 1eadf0d9b04b14a52e525c4fa7e17ee5543010d9 /include/fbgemm/Fbgemm.h
parent: 9a59fbd05f1df91db548547753d8b0e7a79c031e (diff)
1 files changed, 85 insertions, 0 deletions
diff --git a/include/fbgemm/Fbgemm.h b/include/fbgemm/Fbgemm.h
index bca5347..f49da57 100644
--- a/include/fbgemm/Fbgemm.h
+++ b/include/fbgemm/Fbgemm.h
@@ -453,6 +453,56 @@ class FBGEMM_API PackBMatrix final
 };
 
 /**
+ * @brief Matrix packed for direct group convolution.
+ *        The source matrix is already quantized. Default accumulation
+ *        type is int32.
+ */
+template <typename T, typename accT = std::int32_t, int SPATIAL_DIM = 2>
+class FBGEMM_API PackWeightMatrixForGConv {
+ public:
+  using This = PackWeightMatrixForGConv<T, accT, SPATIAL_DIM>;
+  using inpType = T;
+  using accType = accT;
+
+  PackWeightMatrixForGConv() = delete; // no default constructor
+
+  /**
+   * @params pmat if nullptr, a buffer is allocated and owned by this class.
+   *
+   */
+  PackWeightMatrixForGConv(
+      matrix_op_t trans,
+      const conv_param_t<SPATIAL_DIM>& conv_param,
+      const inpType* sdata,
+      inpType* pdata = nullptr);
+
+  /**
+   * @brief Packs a block of source matrix into pmat buffer.
+   */
+  void pack();
+
+  /**
+   * @brief Return packed data
+   */
+  inpType* getBuf() {
+    return pdata_;
+  }
+
+  ~PackWeightMatrixForGConv() {
+    if (bufAllocatedHere_) {
+      free(pdata_);
+    }
+  }
+
+ private:
+  matrix_op_t trans_;
+  const conv_param_t<SPATIAL_DIM> conv_param_;
+  const T* sdata_;
+  T* pdata_;
+  bool bufAllocatedHere_;
+};
+
+/**
  * @brief Matrix packed for the first input matrix in GEMM (usually activation),
  *        and row offsets used for requantization is computed during packing.
  *        Im2col is fused with packing here. The source matrix is already
@@ -1106,6 +1156,35 @@ FBGEMM_API void fbgemmPacked(
     int num_threads);
 
 /**
+ * @brief Perform small-channels-per-group groupwise convolution
+ *
+ */
+
+template <
+    typename packed_W,
+    typename outType,
+    typename processOutputType,
+    int SPATIAL_DIM = 2>
+FBGEMM_API void fbgemmGroupwiseConv(
+    const conv_param_t<SPATIAL_DIM>& conv_param,
+    const std::uint8_t* activations,
+    std::int32_t a_zero_point,
+    std::int32_t* rowOffsetBuf,
+    packed_W& packed_weights,
+    outType* out,
+    std::int32_t* outBuffer,
+    const processOutputType& outProcess,
+    int thread_id,
+    int num_threads);
+/**
+ * @return Size of row offset buffer in number of elements needed for
+ * fbgemmGroupwiseConv
+ */
+template <int SPATIAL_DIM = 2>
+FBGEMM_API int rowOffsetBufferSizeGConv(
+    const conv_param_t<SPATIAL_DIM>& conv_param);
+
+/**
  * @brief Perform depthwise separable convolution
  */
 template <
@@ -1122,6 +1201,12 @@ void convDepthwiseSeparable(
     const processOutputType& output);
 
 /**
+ * @brief Is this groupwise convolution supported?
+ */
+template <int SPATIAL_DIM>
+FBGEMM_API bool fbgemmOptimizedGConv(const conv_param_t<SPATIAL_DIM>& conv_p);
+
+/**
  * @brief Allocate __size bytes of uninitialized storage whose alignment is
  * specified by __align.
  */
author	Daya S Khudia <dskhudia@fb.com>	2019-01-14 23:53:37 +0300
committer	Facebook Github Bot <facebook-github-bot@users.noreply.github.com>	2019-01-14 23:56:47 +0300
commit	ae9f02719c40fb601ef26dc506b823bed02bfca6 (patch)
tree	1eadf0d9b04b14a52e525c4fa7e17ee5543010d9 /include/fbgemm/Fbgemm.h
parent	9a59fbd05f1df91db548547753d8b0e7a79c031e (diff)