Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-03-22 | Remove unused comment | Young Jin Kim | |
2021-03-22 | gcc 9.3+ build fix (#10) | Young Jin Kim | |
* Turn -march=native off when using gcc 9.3+ (-march=x86-64) | |||
2020-09-03 | Restore CMake 3.5.1 compatibility by reimplementing ↵ | Aaron Burke | |
list(TRANSFORM...PREPEND) with a foreach() (#8) | |||
2020-08-21 | Fix dependent library interface include directories to use build/install ↵ | Aaron Burke | |
generator expressions (#7) I verified it working well with marian-dev master and stand-alone fbgemm. | |||
2020-08-12 | Fix public header property in cpuinfo and clog to support submodule installs ↵ | Aaron Burke | |
(#6) Looks good. Thanks! | |||
2019-09-25 | Merge remote-tracking branch 'upstream/master' into youki/win-jit-debug-int8 | Young Jin Kim | |
Fix for windows build errors | |||
2019-09-24 | remove template parameter from PackedDepthWiseConvMatrix (#128) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/128 We don't really need to have KERNEL_PROD as a compile time constant template parameter in PackedDepthWiseConvMatrix for performance. Removing the template parameter will make generalizing depth-wise convolution to non 3x3 cases easier. This diff only changes fbgemm while maintaining the old interface. The follow-up diff will change Caffe2 code using the old interface and remove the old interface. This diff also splits FbgemmI8DepthwiseAvx2.cc into FbgemmI8Depthwise3DAvx2.cc and PackDepthwiseConvMatrixAvx2.cc to avoid compilation timeouts in OSS build tests. Reviewed By: dskhudia Differential Revision: D17514003 fbshipit-source-id: 2214637ac0762a585f619f0035d3449cc4f7669e | |||
2019-08-15 | Merge branch 'upstream/master' into youki/prepack_constrcopyPublic | Young Jin Kim | |
2019-08-09 | Integrate VNNI into FBGEMM master branch (#114) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/114 Adding the VNNI support in FBGEMM. Previously, we have the issue on CMake version. Currently PyTorch and FBGEMM OSS test has the CMake 3.5 test, while ASMJIT requires CMake to be 3.8+. This caused the build failure for some platforms. Now the CMake version issue is resolved by a PR to ASMJIT to downgrade the CMake requirement: https://github.com/asmjit/asmjit/pull/252. Reviewed By: dskhudia Differential Revision: D16720839 fbshipit-source-id: e5e5f2d26f924df8d9fb955f4a3758561fa73288 | |||
2019-08-06 | Back out "[fbgemm] Integrate VNNI into FBGEMM master branch" | Jianyu Huang | |
Summary: Original commit changeset: fcaa13cc3159 ASMJIT requires the CMake version to be 3.8 However, FBGEMM and PyTorch only need the CMake version to be 3.5+. This caused the build failure in FBGEMM: https://circleci.com/gh/pytorch/FBGEMM/122#build-timing/containers/0 Reviewed By: dskhudia Differential Revision: D16670547 fbshipit-source-id: 506714c3db1cb82cf98895f58f82f235128f5285 | |||
2019-08-06 | Integrate VNNI into FBGEMM master branch (#113) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/113 Adding the VNNI support in FBGEMM. Reviewed By: dskhudia Differential Revision: D16276574 fbshipit-source-id: 832ccdb27339489ebc138f3b2678e53d107c1b79 | |||
2019-06-13 | Compile both on windows and linux | Young Jin Kim | |
2019-06-05 | Unified convolution interface | Daya Khudia | |
Summary: We want to combine three different convolution interfaces under one top level function. Reviewed By: protonu Differential Revision: D15399811 fbshipit-source-id: 7390616d92783506fc156f0f6017f10b5f7f8e30 | |||
2019-05-30 | Adding -02 flag to the cmake build | Protonu Basu | |
Summary: Adding this flag makes up for perf diff seen with the cmake build system Reviewed By: dskhudia Differential Revision: D15377782 fbshipit-source-id: cf5308ff2b5d8d42ac57b555a94d845268a857c6 | |||
2019-05-14 | Use submodules instead of cmake downloads | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/96 Reviewed By: jianyuh Differential Revision: D15336047 Pulled By: dskhudia fbshipit-source-id: 93435ba920baa3a712c5741e60c479901c95115d | |||
2019-05-13 | Back out "[FBGEMM][PR] switch from cmake downloads to git submodules" | Daya S Khudia | |
Summary: Original commit changeset: 9a33573ba34b Reviewed By: jianyuh Differential Revision: D15320950 fbshipit-source-id: f6501b57346cc5e82fa2198dcf6b60b26cd4f7c6 | |||
2019-05-13 | switch from cmake downloads to git submodules (#95) | David Pollack | |
Summary: I created a pull request for #87. I also tend to do a lot of hacking without an internet connection and it is nice to have the required library offline. I also get a cryptic error message when I build pytorch without an internet connection because these modules aren't available. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/95 Reviewed By: jianyuh Differential Revision: D15299133 Pulled By: dskhudia fbshipit-source-id: 6cf9ed47482eceee5f0444a8361720e0cfe25a13 | |||
2019-03-06 | Add Avx512BW/VL/DQ check (#84) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/84 Add AVX512BW Check: AVX-512 Byte and Word Instructions add support for for 8-bit and 16-bit integer operations such as vpmaddubsw. Similarly, add AVX512VL/DQ check. Reviewed By: jspark1105 Differential Revision: D14321050 fbshipit-source-id: bd34745fd488ce4efe3248aeb78c54e1c2d91d47 | |||
2019-01-23 | add missing include files to public headers so that they get installed properly | Daya S Khudia | |
Summary: Same as title Reviewed By: jspark1105 Differential Revision: D13787161 fbshipit-source-id: c3d44afc812e7676d618b4b940e15ef0a2b12436 | |||
2019-01-14 | Groupwise direct convolution when number of channels per group is small | Daya S Khudia | |
Summary: **Summary** This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. **Some Highlights:** 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. **Desired Extensions:** 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. **Without rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 **Without rowoffset and with requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 **With rowoffset and without requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 **With rowoffset and requantization** MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. **Update:** includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5 | |||
2018-12-06 | Final cleanup for avx2 isolation and consistent file names (#40) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/40 File name changes + removal of -mavx2 compiler flag non-avx files This completes the separation of avx2 code to few files that make minimal use of c++ std lib. Reviewed By: jianyuh Differential Revision: D13330577 fbshipit-source-id: b469ebee484168800ce2d12fd2356edecbf0fa4d | |||
2018-12-06 | avx2 intrinsic separation from OutputProcessing-inl.h (#38) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/38 Moves intrinsics code from OutputProcessing-inl.h (included in Fbgemm.h) to src/QuantUtilsAvx2.cc Reviewed By: Maratyszcza Differential Revision: D13328841 fbshipit-source-id: 0a5c7b065ba9d69573390f3fbcd68df8d82827a0 | |||
2018-12-06 | File name change for FbgemmI8Depthwise.h and FbgemmI8Depthwise.cc (#14725) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14725 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/33 Renaming FbgemmI8Depthwise.h to FbgemmI8DepthwiseAvx2.h and FbgemmI8Depthwise.cc to FbgemmI8DepthwiseAvx2.cc since FbgemmI8DepthwiseAvx2.cc will be compiled with avx2 flags Reviewed By: jianyuh Differential Revision: D13313898 fbshipit-source-id: a8111eacf3d79a466ce0565bfe5f2f0b200a5c33 | |||
2018-12-05 | Removed avx2 code from PackAWithRowOffset.cc (#34) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/34 Removing avx2 intrinsics from PackAWithRowOffset.cc Reviewed By: jspark1105 Differential Revision: D13281312 fbshipit-source-id: fcaf70fc666674ace26b72b733af03e3b8586e6e | |||
2018-12-05 | avx2 specific code in a separate file for QuantUtils (#29) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/29 avx2 code separation for QuantUtils Reviewed By: jianyuh Differential Revision: D13269041 fbshipit-source-id: df798cc0d93e0f2081cb832f4341fb2effa68294 | |||
2018-12-05 | Move avx2 specific code in different source files (#28) | Daya S Khudia | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/28 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14516 This is the first diff in a series of diffs that will separate out avx2 specific code in separate files. The goal is to compile as little as possible code with avx2 and avx512 compiler flags. Reviewed By: jianyuh Differential Revision: D13248376 fbshipit-source-id: 401c2e9d3cd96c420fd08c3efa011febce96ffbb | |||
2018-11-30 | Only export symbols that are required while building shared library | Daya S Khudia | |
Summary: We now use -fvisibility=hidden flag for compiling fbgemm as a shared library and only explicitly exported symbols will be visible to applications linking against fbgemm shared library. Reviewed By: jianyuh Differential Revision: D13221957 fbshipit-source-id: 2283727a7f9bc8b05015a621ae1116f3cb3231bc | |||
2018-11-22 | adding quantization utility functions (#19) | Jongsoo Park | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/19 Copying some of quantization utility functions from caffe2/quantization/server/dnnlowp.h to fbgemm/include/QuantUtils.h Will have another diff that removes the utility functions in caffe2/quantization/server/dnnlowp.h Reviewed By: jianyuh Differential Revision: D13159231 fbshipit-source-id: e409c0adc16b9ae1f32a3a62926817588a860855 | |||
2018-11-22 | Unify the PackA file names (#21) | Jianyu Huang | |
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/21 PackWithQuantRowOffset.cc -> PackAWithQuantRowOffset.cc PackWithRowOffset.cc -> PackAWithRowOffset.cc Reviewed By: jspark1105, dskhudia Differential Revision: D13165323 fbshipit-source-id: 5e039647b0b22d2c24dd9a8131ed2f9305073525 | |||
2018-11-22 | Fix minor issues | Daya S Khudia | |
Summary: 1) Building of shared library was broken. 2) Fix to a particular commit of asmjit so that we don't accidentally break anything due to changes in asmjit. Reviewed By: jspark1105, jianyuh Differential Revision: D13165343 fbshipit-source-id: 21ea6cf16c2e7d9b341339fccc0f50b4bf78f903 | |||
2018-11-20 | Simple parallelism, add -openmp flags and omp parallel for Acc16/32 Unit ↵ | Jianyu Huang | |
Test (#14) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14 This DIFF triggered a concurrent bug in the unit test. It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest". Update 1: There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result. I have a discussion with Daya. Now I understand the reason for the failure of this unit test - For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code. - If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_ : 0); - If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations. Update 2: I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken. Update 3: I change the above code snippet to int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_; Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations. Update 4: Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_. So I optimize the above code snippet to // If the accumulation buffer C_buffer_ is the same as matC_ (inplace output // processing), then each thread use the different parts of output buffer // matC_; // Otherwise, each thread uses different portions of the accumulation // buffer C_buffer_. Note that each thread can use at most MC * n portion of // C_buffer_. If the number of threads is 1, the only thread (thread 0) will // always reuse the first rowblock of C_buffer_. int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_ : std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_)); Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`. Update 5: Rebase, Also add the parts of D12937408 to remove the dependency. Reviewed By: jspark1105 Differential Revision: D13001149 fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65 | |||
2018-11-06 | generalized conv_param_t and download third party libraries in build dir | dskhudia | |
2018-11-04 | Syncing with internal version. Fixes for Mac/clang build. Other minor fixes | dskhudia | |
2018-11-03 | Manually syncing with internal copy | dskhudia | |
2018-10-31 | Initial commit | Daya S Khudia | |