Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/bench
AgeCommit message (Collapse)Author
2019-09-25Merge remote-tracking branch 'upstream/master' into youki/win-jit-debug-int8Young Jin Kim
Fix for windows build errors
2019-09-24remove template parameter from PackedDepthWiseConvMatrix (#128)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/128 We don't really need to have KERNEL_PROD as a compile time constant template parameter in PackedDepthWiseConvMatrix for performance. Removing the template parameter will make generalizing depth-wise convolution to non 3x3 cases easier. This diff only changes fbgemm while maintaining the old interface. The follow-up diff will change Caffe2 code using the old interface and remove the old interface. This diff also splits FbgemmI8DepthwiseAvx2.cc into FbgemmI8Depthwise3DAvx2.cc and PackDepthwiseConvMatrixAvx2.cc to avoid compilation timeouts in OSS build tests. Reviewed By: dskhudia Differential Revision: D17514003 fbshipit-source-id: 2214637ac0762a585f619f0035d3449cc4f7669e
2019-09-11API changes to take unquantized bias for depthwise convDaya Khudia
Summary: Changing interface for on the fly bias quantization Also adding code to quantize bias on the fly Reviewed By: jianyuh Differential Revision: D17099709 fbshipit-source-id: 5cca79189c00710e703044350260a9fcaca77bb3
2019-09-05Modifying PackAWithIm2Col to support dilated convolution and adding test casesProtonu Basu
Summary: Modifying PackAWithIm2Col to support dilated convolution and adding test cases Reviewed By: dskhudia Differential Revision: D17184638 fbshipit-source-id: e2935b1e1577505440019f732d03be630d1be040
2019-09-04remove dw conv refs and use conv_ref instead (#122)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/122 To prepare depth-wise convolution other than 3x3. The existing reference depth-wise convolution is limited to 3x3 and we should reuse conv_ref implementation for easier maintenance. Reviewed By: dskhudia Differential Revision: D17176591 fbshipit-source-id: 9f6f90a801a0ad95091f1d085e66861f86c3a8f1
2019-09-03disable clang formatting in a few array definitions (#121)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/121 By adding "// clang-format off" and "// clang-format on" we can still apply clang-format to these files. Reviewed By: jianyuh Differential Revision: D17159312 fbshipit-source-id: de523536df4c33f0efe332f9bc7b0290cdac1ba0
2019-08-01Merge upstream masterYoung Jin Kim
2019-07-19Support pointwise with unified convolution interface as well (#108)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/108 Pointwise gets converted to direct GEMM Reviewed By: jianyuh Differential Revision: D16296356 fbshipit-source-id: 68c88df90e5de669bfcddf426c6488e2a04d55d6
2019-07-16Assume input weights to be in transposed format for convUnified (#104)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/104 For consistency, we always assume that weights to PackWeightsForConv are in format K R S C/G, which is same as G K/G R S C/G cc: Huihan Liu: Please note this change. Reviewed By: jianyuh Differential Revision: D16186932 fbshipit-source-id: 9ca2562f213d6b296ef8bd2eca1e5b6e98c436ec
2019-06-14Improve some memroy allocation codesYoung Jin Kim
2019-06-13Compile both on windows and linuxYoung Jin Kim
2019-06-05Unified convolution interfaceDaya Khudia
Summary: We want to combine three different convolution interfaces under one top level function. Reviewed By: protonu Differential Revision: D15399811 fbshipit-source-id: 7390616d92783506fc156f0f6017f10b5f7f8e30
2019-04-19make sure cpuinfo_initialize called before fbgemmHasAvx2/512Support (#94)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/94 If we don't call cpuinfo_initialize before hand, fbgemmHasAvx2/512Support will always return false. We should really careful about this. Reviewed By: jianyuh Differential Revision: D14994129 fbshipit-source-id: b78028f0543d05595caaa627be2feb743d0694b1
2019-04-02Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) ↵Protonu Basu
(#90) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90 Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) Reviewed By: dskhudia Differential Revision: D14358148 fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5
2019-03-13optimize requantize for float out processing (#85)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/85 Optimizing performance of output processing when output is dequantized right away. Reviewed By: protonu Differential Revision: D14433141 fbshipit-source-id: f99a8d82000c43e554461acf036462a4e8f7e300
2019-02-26barebone int8-acc16 and int8-acc32 benchmarksDaya S Khudia
Summary: adding barebone gemm benchmarks for comparisons **Performance on Skylake T6 (turbo off; single thread)** M, N, K, Type, GOPS 64, 800, 320, MKL_fp32, 91.1 64, 800, 320, FBGEMM_i8_acc32, 118.7 64, 800, 320, FBGEMM_i8_acc16, 137.0 64, 768, 512, MKL_fp32, 102.0 64, 768, 512, FBGEMM_i8_acc32, 132.2 64, 768, 512, FBGEMM_i8_acc16, 160.1 16, 256, 512, MKL_fp32, 39.8 16, 256, 512, FBGEMM_i8_acc32, 55.3 16, 256, 512, FBGEMM_i8_acc16, 63.4 128, 128, 128, MKL_fp32, 49.2 128, 128, 128, FBGEMM_i8_acc32, 54.1 128, 128, 128, FBGEMM_i8_acc16, 54.4 256, 512, 256, MKL_fp32, 97.7 256, 512, 256, FBGEMM_i8_acc32, 126.2 256, 512, 256, FBGEMM_i8_acc16, 170.1 1024, 1024, 1024, MKL_fp32, 114.3 1024, 1024, 1024, FBGEMM_i8_acc32, 150.8 1024, 1024, 1024, FBGEMM_i8_acc16, 202.9 **Breakdown** M, N, K, Type, Packing (us), Kernel (us), Postproc (us), Total (us), GOPs 64, 800, 320, MKL_fp32, 0, 0, 0, 0, 95.7 64, 800, 320, FBGEMM_i8_acc32, 5.9, 261.9, 2.0, 275.9, 115.5 64, 800, 320, FBGEMM_i8_acc16, 17.4, 210.6, 3.3, 238.2, 132.1 64, 768, 512, MKL_fp32, 0, 0, 0, 0, 103.2 64, 768, 512, FBGEMM_i8_acc32, 9.0, 366.2, 1.9, 383.2, 128.0 64, 768, 512, FBGEMM_i8_acc16, 9.9, 298.3, 1.5, 314.8, 155.4 16, 256, 512, MKL_fp32, 0, 0, 0, 0, 40.8 16, 256, 512, FBGEMM_i8_acc32, 3.3, 60.5, 1.0, 68.3, 54.3 16, 256, 512, FBGEMM_i8_acc16, 3.2, 55.2, 0.5, 61.2, 60.6 128, 128, 128, MKL_fp32, 0, 0, 0, 0, 51.3 128, 128, 128, FBGEMM_i8_acc32, 8.1, 60.4, 0.6, 71.0, 52.4 128, 128, 128, FBGEMM_i8_acc16, 16.0, 44.8, 0.4, 64.6, 56.4 256, 512, 256, MKL_fp32, 0, 0, 0, 0, 95.0 256, 512, 256, FBGEMM_i8_acc32, 12.9, 512.1, 3.9, 542.1, 122.1 256, 512, 256, FBGEMM_i8_acc16, 12.1, 376.4, 2.3, 396.2, 165.8 1024, 1024, 1024, MKL_fp32, 0, 0, 0, 0, 114.9 1024, 1024, 1024, FBGEMM_i8_acc32, 116.9, 13999.2, 47.9, 14276.1, 150.3 1024, 1024, 1024, FBGEMM_i8_acc16, 125.7, 10490.3, 31.8, 10730.1, 200.0 TODO: add mkl-dnn as well. Reviewed By: jianyuh Differential Revision: D14196397 fbshipit-source-id: 4cfb22374a6553a774d2f92ef37e295b7296de8d
2019-02-15simple spmdm optimization (#76)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/76 Create a temp buffer for accumulating results instead of directly accessing C matrix with strides. This speeds up hyper-sparse case implemented w/o transpose so we adjust the threshold between the implementation w/o transpose and w/ transpose accordingly. Reviewed By: jianyuh Differential Revision: D14097154 fbshipit-source-id: 22e37d0a9f38ccb3d15813edcd96f3d341eacf1c
2019-02-14clean up depthwise conv interface (#72)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/72 depthwise conv without requantization is not really useful and was generating more template parameter options Reviewed By: jianyuh Differential Revision: D14021514 fbshipit-source-id: 61f646373fcd902fdb2854a96d003a548f29f8eb
2019-02-13group conv optimized for 16 channels per group (#68)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/68 Continuing optimizations for group convolution. Even though op-level speedup for 16 channels per group is lower compared to 4 or 8-channel cases, we have a nice overall speedup in resnext101-32x4d because it has many Conv operators with 16 channels per group. Reviewed By: protonu Differential Revision: D13949873 fbshipit-source-id: 1dff4b1acfdabe23616e7df365daf2b7f6e8aea9
2019-02-02gconv optimized for 8 channels per group (#65)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/65 As title says Reviewed By: jianyuh Differential Revision: D13834287 fbshipit-source-id: ff174fdfcc27bcc227e435ff27e5c2a7024bf736
2019-01-31use 1 thread in benchmarks if OMP_NUM_THREADS is not explicitly set (#66)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/66 As title says Reviewed By: jianyuh Differential Revision: D13834515 fbshipit-source-id: 928778ea3207e25eb9861cce683f88b9164d5521
2019-01-31Add threading for FBGEMM FP16Jianyu Huang
Summary: Add threading support for FBGEMM FP16 routines. Reviewed By: dskhudia, jacobkahn Differential Revision: D13792341 fbshipit-source-id: eb31a11340ac9fd0ee9b4f570d161e7c7e6a7602
2019-01-14Groupwise direct convolution when number of channels per group is smallDaya S Khudia
Summary: **Summary** This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. **Some Highlights:** 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. **Desired Extensions:** 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. **Without rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 **Without rowoffset and with requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 **With rowoffset and without requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 **With rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. **Update:** includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5
2019-01-14FP16Benchmark: Allow fp32 comparison using cblas (#56)WilliamTambellini
Summary: FP16Benchmark: Allow comparison against fp32 using any local cblas library if MKL not found. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/56 Reviewed By: jianyuh Differential Revision: D13645545 Pulled By: dskhudia fbshipit-source-id: ca98e84bfb85eb3b0edebad664d211c3af8db309
2019-01-123x3x3 depthwise convolution with per channel quantization (#15775)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15775 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/55 fbgemm didn't have per-channel quantization for 3x3x3 depth-wise convolution Reviewed By: jianyuh Differential Revision: D13587438 fbshipit-source-id: 91c36fae7a0e8386e3bc49808e18918b01681dd1
2019-01-04missing copyright headersDaya S Khudia
Summary: Adding missing copyright headers in newly added files Reviewed By: jianyuh Differential Revision: D13582255 fbshipit-source-id: bc043ff34cd0cf8f17b99876b9c738d9a92c922a
2019-01-03optimize remainder loops of requantization and rowoffset (#54)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/54 Optimizations I implemented in ver 2 doesn't seem to help (will remove this). Looks like also using JIT for row_offset is a right long term solution. AVX512 has new instructions that could help row_offset and requantization computation. Added benchmarks for row_offset and requantization computation to make measuring their performance easier. Reviewed By: dskhudia Differential Revision: D13561062 fbshipit-source-id: f11678395c4f9e62a64874e1a0b1f8833fda779f
2019-01-02use 1 omp thread unless OMP_NUM_THREADS is explicitly set (#53)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/53 As title Reviewed By: jianyuh Differential Revision: D13561724 fbshipit-source-id: 815ab310f2f4862c65ad0e3d61bf221cb8cf679b
2018-12-21Update the profiling format for Acc32 Benchmark (#50)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/50 Before this DIFF: M, N, K, Packing (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs 3136, 256, 64, MKL_fp32, 64.5 0.1, 1.3, 0.3, 1.8, 3136, 256, 64, FBGEMM_i8_acc32, 55.7 3136, 64, 64, MKL_fp32, 54.9 0.1, 0.3, 0.1, 0.5, 3136, 64, 64, FBGEMM_i8_acc32, 50.7 3136, 64, 576, MKL_fp32, 60.9 0.4, 2.7, 0.1, 3.3, 3136, 64, 576, FBGEMM_i8_acc32, 70.3 ... After this DIFF: M, N, K, Packing (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs 3136, 256, 64, MKL_fp32, 62.4 3136, 256, 64, 0.1, 1.3, 0.3, 1.8, FBGEMM_i8_acc32, 54.8 3136, 64, 64, MKL_fp32, 49.4 3136, 64, 64, 0.1, 0.3, 0.1, 0.5, FBGEMM_i8_acc32, 46.3 3136, 64, 576, MKL_fp32, 65.6 3136, 64, 576, 0.4, 2.7, 0.1, 3.3, FBGEMM_i8_acc32, 70.0 ... Reviewed By: dskhudia Differential Revision: D13531989 fbshipit-source-id: 267b8aea76bd11cd0aedec05b2f9b1ae75c10779
2018-12-21Update with clang format (#51)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/51 Use Clang formatting with "clang-format -i *.cc *.h". Reviewed By: dskhudia Differential Revision: D13532121 fbshipit-source-id: 6792d008f3295c128942f4896e8221aebbf2566e
2018-12-06File name change for FbgemmI8Depthwise.h and FbgemmI8Depthwise.cc (#14725)Daya S Khudia
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14725 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/33 Renaming FbgemmI8Depthwise.h to FbgemmI8DepthwiseAvx2.h and FbgemmI8Depthwise.cc to FbgemmI8DepthwiseAvx2.cc since FbgemmI8DepthwiseAvx2.cc will be compiled with avx2 flags Reviewed By: jianyuh Differential Revision: D13313898 fbshipit-source-id: a8111eacf3d79a466ce0565bfe5f2f0b200a5c33
2018-12-04Fix the group issue in the benchmark and use ResNext101 conv shapes (#32)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/32 - Fix the group convolution issues in the benchmark - Add the convolution shapes in ResNext101 Result: (The following results are tested and collected on my devserver.) ResNext101: int16 Accumulation: batch_size:1; remove 1x1 convolutions: P60395456 ResNext101: int32 Accumulation: batch_size:1; remove 1x1 convolutions: P60395457 ResNext101: int16 Accumulation: batch_size:1 P60394563 ResNext101: int32 Accumulation: batch_size:1 P60394565 ResNext101: int16 Accumulation: batch_size:50 P60394548 ResNext101: int32 Accumulation: batch_size:50 P60394552 Xray OCR: int16 Accumulation: P60394527 Xray OCR: int32 Accumulation: P60394534 Reviewed By: jspark1105 Differential Revision: D13286215 fbshipit-source-id: e78b691999006c25e92a746783b8bd1b87703a38
2018-11-30protect omp.h include by a pragmaDaya S Khudia
Summary: Fixes build when there is no openmp Reviewed By: jianyuh Differential Revision: D13271068 fbshipit-source-id: d5c80818c168465b9f76a28943b2c2d81667bb99
2018-11-27per-group and per-channel quantization (#14340)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Reviewed By: dskhudia Differential Revision: D13176386 fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd
2018-11-26remove unnecessary zero_point argument from constructors (#14323)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14323 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/24 As title says. Reviewed By: dskhudia Differential Revision: D13167073 fbshipit-source-id: 6d6c526fd6e29a14e97f71a0881f28ada8703107
2018-11-20Parallelize the benchmarkJianyu Huang
Summary: Add omp parallel to parallelize the benchmark Reviewed By: jspark1105 Differential Revision: D13106978 fbshipit-source-id: cdc8ce3db86d38745487ac0cafa5bd656f182604
2018-11-19clang-format (#11)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/11 clang format of fbgemm Reviewed By: dskhudia Differential Revision: D13115202 fbshipit-source-id: 6dab29cb8b5f4fabcc165019663351567a2a2952
2018-11-16grouped (batched) gemm (#7)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7 This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 . Reviewed By: jianyuh Differential Revision: D13039210 fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d
2018-11-08Sync with internal copy: Asymmetric padding; fbgemm2 -> fbgemmJianyu Huang
2018-11-06generalized conv_param_t and download third party libraries in build dirdskhudia
2018-11-05CMake minimum version required updatedskhudia
2018-11-03Manually syncing with internal copydskhudia
2018-10-31Initial commitDaya S Khudia