Age | Commit message (Collapse) | Author | |
---|---|---|---|
2019-09-25 | Merge remote-tracking branch 'upstream/master' into youki/win-jit-debug-int8 | Young Jin Kim | |
Fix for windows build errors | |||
2019-09-24 | remove template parameter from PackedDepthWiseConvMatrix (#128) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/128 We don't really need to have KERNEL_PROD as a compile time constant template parameter in PackedDepthWiseConvMatrix for performance. Removing the template parameter will make generalizing depth-wise convolution to non 3x3 cases easier. This diff only changes fbgemm while maintaining the old interface. The follow-up diff will change Caffe2 code using the old interface and remove the old interface. This diff also splits FbgemmI8DepthwiseAvx2.cc into FbgemmI8Depthwise3DAvx2.cc and PackDepthwiseConvMatrixAvx2.cc to avoid compilation timeouts in OSS build tests. Reviewed By: dskhudia Differential Revision: D17514003 fbshipit-source-id: 2214637ac0762a585f619f0035d3449cc4f7669e | |||
2019-09-11 | API changes to take unquantized bias for depthwise conv | Daya Khudia | |
Summary: Changing interface for on the fly bias quantization Also adding code to quantize bias on the fly Reviewed By: jianyuh Differential Revision: D17099709 fbshipit-source-id: 5cca79189c00710e703044350260a9fcaca77bb3 | |||
2019-09-05 | Modifying PackAWithIm2Col to support dilated convolution and adding test cases | Protonu Basu | |
Summary: Modifying PackAWithIm2Col to support dilated convolution and adding test cases Reviewed By: dskhudia Differential Revision: D17184638 fbshipit-source-id: e2935b1e1577505440019f732d03be630d1be040 | |||
2019-09-04 | remove dw conv refs and use conv_ref instead (#122) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/122 To prepare depth-wise convolution other than 3x3. The existing reference depth-wise convolution is limited to 3x3 and we should reuse conv_ref implementation for easier maintenance. Reviewed By: dskhudia Differential Revision: D17176591 fbshipit-source-id: 9f6f90a801a0ad95091f1d085e66861f86c3a8f1 | |||
2019-09-03 | disable clang formatting in a few array definitions (#121) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/121 By adding "// clang-format off" and "// clang-format on" we can still apply clang-format to these files. Reviewed By: jianyuh Differential Revision: D17159312 fbshipit-source-id: de523536df4c33f0efe332f9bc7b0290cdac1ba0 | |||
2019-08-01 | Merge upstream master | Young Jin Kim | |
2019-07-19 | Support pointwise with unified convolution interface as well (#108) | Daya Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/108 Pointwise gets converted to direct GEMM Reviewed By: jianyuh Differential Revision: D16296356 fbshipit-source-id: 68c88df90e5de669bfcddf426c6488e2a04d55d6 | |||
2019-07-16 | Assume input weights to be in transposed format for convUnified (#104) | Daya Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/104 For consistency, we always assume that weights to PackWeightsForConv are in format K R S C/G, which is same as G K/G R S C/G cc: Huihan Liu: Please note this change. Reviewed By: jianyuh Differential Revision: D16186932 fbshipit-source-id: 9ca2562f213d6b296ef8bd2eca1e5b6e98c436ec | |||
2019-06-14 | Improve some memroy allocation codes | Young Jin Kim | |
2019-06-13 | Compile both on windows and linux | Young Jin Kim | |
2019-06-05 | Unified convolution interface | Daya Khudia | |
Summary: We want to combine three different convolution interfaces under one top level function. Reviewed By: protonu Differential Revision: D15399811 fbshipit-source-id: 7390616d92783506fc156f0f6017f10b5f7f8e30 | |||
2019-04-19 | make sure cpuinfo_initialize called before fbgemmHasAvx2/512Support (#94) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/94 If we don't call cpuinfo_initialize before hand, fbgemmHasAvx2/512Support will always return false. We should really careful about this. Reviewed By: jianyuh Differential Revision: D14994129 fbshipit-source-id: b78028f0543d05595caaa627be2feb743d0694b1 | |||
2019-04-02 | Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) ↵ | Protonu Basu | |
(#90) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90 Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) Reviewed By: dskhudia Differential Revision: D14358148 fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5 | |||
2019-03-13 | optimize requantize for float out processing (#85) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/85 Optimizing performance of output processing when output is dequantized right away. Reviewed By: protonu Differential Revision: D14433141 fbshipit-source-id: f99a8d82000c43e554461acf036462a4e8f7e300 | |||
2019-02-26 | barebone int8-acc16 and int8-acc32 benchmarks | Daya S Khudia | |
Summary: adding barebone gemm benchmarks for comparisons **Performance on Skylake T6 (turbo off; single thread)** M, N, K, Type, GOPS 64, 800, 320, MKL_fp32, 91.1 64, 800, 320, FBGEMM_i8_acc32, 118.7 64, 800, 320, FBGEMM_i8_acc16, 137.0 64, 768, 512, MKL_fp32, 102.0 64, 768, 512, FBGEMM_i8_acc32, 132.2 64, 768, 512, FBGEMM_i8_acc16, 160.1 16, 256, 512, MKL_fp32, 39.8 16, 256, 512, FBGEMM_i8_acc32, 55.3 16, 256, 512, FBGEMM_i8_acc16, 63.4 128, 128, 128, MKL_fp32, 49.2 128, 128, 128, FBGEMM_i8_acc32, 54.1 128, 128, 128, FBGEMM_i8_acc16, 54.4 256, 512, 256, MKL_fp32, 97.7 256, 512, 256, FBGEMM_i8_acc32, 126.2 256, 512, 256, FBGEMM_i8_acc16, 170.1 1024, 1024, 1024, MKL_fp32, 114.3 1024, 1024, 1024, FBGEMM_i8_acc32, 150.8 1024, 1024, 1024, FBGEMM_i8_acc16, 202.9 **Breakdown** M, N, K, Type, Packing (us), Kernel (us), Postproc (us), Total (us), GOPs 64, 800, 320, MKL_fp32, 0, 0, 0, 0, 95.7 64, 800, 320, FBGEMM_i8_acc32, 5.9, 261.9, 2.0, 275.9, 115.5 64, 800, 320, FBGEMM_i8_acc16, 17.4, 210.6, 3.3, 238.2, 132.1 64, 768, 512, MKL_fp32, 0, 0, 0, 0, 103.2 64, 768, 512, FBGEMM_i8_acc32, 9.0, 366.2, 1.9, 383.2, 128.0 64, 768, 512, FBGEMM_i8_acc16, 9.9, 298.3, 1.5, 314.8, 155.4 16, 256, 512, MKL_fp32, 0, 0, 0, 0, 40.8 16, 256, 512, FBGEMM_i8_acc32, 3.3, 60.5, 1.0, 68.3, 54.3 16, 256, 512, FBGEMM_i8_acc16, 3.2, 55.2, 0.5, 61.2, 60.6 128, 128, 128, MKL_fp32, 0, 0, 0, 0, 51.3 128, 128, 128, FBGEMM_i8_acc32, 8.1, 60.4, 0.6, 71.0, 52.4 128, 128, 128, FBGEMM_i8_acc16, 16.0, 44.8, 0.4, 64.6, 56.4 256, 512, 256, MKL_fp32, 0, 0, 0, 0, 95.0 256, 512, 256, FBGEMM_i8_acc32, 12.9, 512.1, 3.9, 542.1, 122.1 256, 512, 256, FBGEMM_i8_acc16, 12.1, 376.4, 2.3, 396.2, 165.8 1024, 1024, 1024, MKL_fp32, 0, 0, 0, 0, 114.9 1024, 1024, 1024, FBGEMM_i8_acc32, 116.9, 13999.2, 47.9, 14276.1, 150.3 1024, 1024, 1024, FBGEMM_i8_acc16, 125.7, 10490.3, 31.8, 10730.1, 200.0 TODO: add mkl-dnn as well. Reviewed By: jianyuh Differential Revision: D14196397 fbshipit-source-id: 4cfb22374a6553a774d2f92ef37e295b7296de8d | |||
2019-02-15 | simple spmdm optimization (#76) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/76 Create a temp buffer for accumulating results instead of directly accessing C matrix with strides. This speeds up hyper-sparse case implemented w/o transpose so we adjust the threshold between the implementation w/o transpose and w/ transpose accordingly. Reviewed By: jianyuh Differential Revision: D14097154 fbshipit-source-id: 22e37d0a9f38ccb3d15813edcd96f3d341eacf1c | |||
2019-02-14 | clean up depthwise conv interface (#72) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/72 depthwise conv without requantization is not really useful and was generating more template parameter options Reviewed By: jianyuh Differential Revision: D14021514 fbshipit-source-id: 61f646373fcd902fdb2854a96d003a548f29f8eb | |||
2019-02-13 | group conv optimized for 16 channels per group (#68) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/68 Continuing optimizations for group convolution. Even though op-level speedup for 16 channels per group is lower compared to 4 or 8-channel cases, we have a nice overall speedup in resnext101-32x4d because it has many Conv operators with 16 channels per group. Reviewed By: protonu Differential Revision: D13949873 fbshipit-source-id: 1dff4b1acfdabe23616e7df365daf2b7f6e8aea9 | |||
2019-02-02 | gconv optimized for 8 channels per group (#65) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/65 As title says Reviewed By: jianyuh Differential Revision: D13834287 fbshipit-source-id: ff174fdfcc27bcc227e435ff27e5c2a7024bf736 | |||
2019-01-31 | use 1 thread in benchmarks if OMP_NUM_THREADS is not explicitly set (#66) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/66 As title says Reviewed By: jianyuh Differential Revision: D13834515 fbshipit-source-id: 928778ea3207e25eb9861cce683f88b9164d5521 | |||
2019-01-31 | Add threading for FBGEMM FP16 | Jianyu Huang | |
Summary: Add threading support for FBGEMM FP16 routines. Reviewed By: dskhudia, jacobkahn Differential Revision: D13792341 fbshipit-source-id: eb31a11340ac9fd0ee9b4f570d161e7c7e6a7602 | |||
2019-01-14 | Groupwise direct convolution when number of channels per group is small | Daya S Khudia | |
Summary: **Summary** This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. **Some Highlights:** 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. **Desired Extensions:** 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. **Without rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 **Without rowoffset and with requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 **With rowoffset and without requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 **With rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. **Update:** includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5 | |||
2019-01-14 | FP16Benchmark: Allow fp32 comparison using cblas (#56) | WilliamTambellini | |
Summary: FP16Benchmark: Allow comparison against fp32 using any local cblas library if MKL not found. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/56 Reviewed By: jianyuh Differential Revision: D13645545 Pulled By: dskhudia fbshipit-source-id: ca98e84bfb85eb3b0edebad664d211c3af8db309 | |||
2019-01-12 | 3x3x3 depthwise convolution with per channel quantization (#15775) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15775 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/55 fbgemm didn't have per-channel quantization for 3x3x3 depth-wise convolution Reviewed By: jianyuh Differential Revision: D13587438 fbshipit-source-id: 91c36fae7a0e8386e3bc49808e18918b01681dd1 | |||
2019-01-04 | missing copyright headers | Daya S Khudia | |
Summary: Adding missing copyright headers in newly added files Reviewed By: jianyuh Differential Revision: D13582255 fbshipit-source-id: bc043ff34cd0cf8f17b99876b9c738d9a92c922a | |||
2019-01-03 | optimize remainder loops of requantization and rowoffset (#54) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/54 Optimizations I implemented in ver 2 doesn't seem to help (will remove this). Looks like also using JIT for row_offset is a right long term solution. AVX512 has new instructions that could help row_offset and requantization computation. Added benchmarks for row_offset and requantization computation to make measuring their performance easier. Reviewed By: dskhudia Differential Revision: D13561062 fbshipit-source-id: f11678395c4f9e62a64874e1a0b1f8833fda779f | |||
2019-01-02 | use 1 omp thread unless OMP_NUM_THREADS is explicitly set (#53) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/53 As title Reviewed By: jianyuh Differential Revision: D13561724 fbshipit-source-id: 815ab310f2f4862c65ad0e3d61bf221cb8cf679b | |||
2018-12-21 | Update the profiling format for Acc32 Benchmark (#50) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/50 Before this DIFF: M, N, K, Packing (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs 3136, 256, 64, MKL_fp32, 64.5 0.1, 1.3, 0.3, 1.8, 3136, 256, 64, FBGEMM_i8_acc32, 55.7 3136, 64, 64, MKL_fp32, 54.9 0.1, 0.3, 0.1, 0.5, 3136, 64, 64, FBGEMM_i8_acc32, 50.7 3136, 64, 576, MKL_fp32, 60.9 0.4, 2.7, 0.1, 3.3, 3136, 64, 576, FBGEMM_i8_acc32, 70.3 ... After this DIFF: M, N, K, Packing (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs 3136, 256, 64, MKL_fp32, 62.4 3136, 256, 64, 0.1, 1.3, 0.3, 1.8, FBGEMM_i8_acc32, 54.8 3136, 64, 64, MKL_fp32, 49.4 3136, 64, 64, 0.1, 0.3, 0.1, 0.5, FBGEMM_i8_acc32, 46.3 3136, 64, 576, MKL_fp32, 65.6 3136, 64, 576, 0.4, 2.7, 0.1, 3.3, FBGEMM_i8_acc32, 70.0 ... Reviewed By: dskhudia Differential Revision: D13531989 fbshipit-source-id: 267b8aea76bd11cd0aedec05b2f9b1ae75c10779 | |||
2018-12-21 | Update with clang format (#51) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/51 Use Clang formatting with "clang-format -i *.cc *.h". Reviewed By: dskhudia Differential Revision: D13532121 fbshipit-source-id: 6792d008f3295c128942f4896e8221aebbf2566e | |||
2018-12-06 | File name change for FbgemmI8Depthwise.h and FbgemmI8Depthwise.cc (#14725) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14725 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/33 Renaming FbgemmI8Depthwise.h to FbgemmI8DepthwiseAvx2.h and FbgemmI8Depthwise.cc to FbgemmI8DepthwiseAvx2.cc since FbgemmI8DepthwiseAvx2.cc will be compiled with avx2 flags Reviewed By: jianyuh Differential Revision: D13313898 fbshipit-source-id: a8111eacf3d79a466ce0565bfe5f2f0b200a5c33 | |||
2018-12-04 | Fix the group issue in the benchmark and use ResNext101 conv shapes (#32) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/32 - Fix the group convolution issues in the benchmark - Add the convolution shapes in ResNext101 Result: (The following results are tested and collected on my devserver.) ResNext101: int16 Accumulation: batch_size:1; remove 1x1 convolutions: P60395456 ResNext101: int32 Accumulation: batch_size:1; remove 1x1 convolutions: P60395457 ResNext101: int16 Accumulation: batch_size:1 P60394563 ResNext101: int32 Accumulation: batch_size:1 P60394565 ResNext101: int16 Accumulation: batch_size:50 P60394548 ResNext101: int32 Accumulation: batch_size:50 P60394552 Xray OCR: int16 Accumulation: P60394527 Xray OCR: int32 Accumulation: P60394534 Reviewed By: jspark1105 Differential Revision: D13286215 fbshipit-source-id: e78b691999006c25e92a746783b8bd1b87703a38 | |||
2018-11-30 | protect omp.h include by a pragma | Daya S Khudia | |
Summary: Fixes build when there is no openmp Reviewed By: jianyuh Differential Revision: D13271068 fbshipit-source-id: d5c80818c168465b9f76a28943b2c2d81667bb99 | |||
2018-11-27 | per-group and per-channel quantization (#14340) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Reviewed By: dskhudia Differential Revision: D13176386 fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd | |||
2018-11-26 | remove unnecessary zero_point argument from constructors (#14323) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14323 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/24 As title says. Reviewed By: dskhudia Differential Revision: D13167073 fbshipit-source-id: 6d6c526fd6e29a14e97f71a0881f28ada8703107 | |||
2018-11-20 | Parallelize the benchmark | Jianyu Huang | |
Summary: Add omp parallel to parallelize the benchmark Reviewed By: jspark1105 Differential Revision: D13106978 fbshipit-source-id: cdc8ce3db86d38745487ac0cafa5bd656f182604 | |||
2018-11-19 | clang-format (#11) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/11 clang format of fbgemm Reviewed By: dskhudia Differential Revision: D13115202 fbshipit-source-id: 6dab29cb8b5f4fabcc165019663351567a2a2952 | |||
2018-11-16 | grouped (batched) gemm (#7) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7 This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 . Reviewed By: jianyuh Differential Revision: D13039210 fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d | |||
2018-11-08 | Sync with internal copy: Asymmetric padding; fbgemm2 -> fbgemm | Jianyu Huang | |
2018-11-06 | generalized conv_param_t and download third party libraries in build dir | dskhudia | |
2018-11-05 | CMake minimum version required update | dskhudia | |
2018-11-03 | Manually syncing with internal copy | dskhudia | |
2018-10-31 | Initial commit | Daya S Khudia | |