github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2021-03-22	Remove unused comment	Young Jin Kim

2021-03-22	gcc 9.3+ build fix (#10)	Young Jin Kim
	* Turn -march=native off when using gcc 9.3+ (-march=x86-64)
2020-09-03	Restore CMake 3.5.1 compatibility by reimplementing ↵	Aaron Burke
	list(TRANSFORM...PREPEND) with a foreach() (#8)
2020-08-21	Fix dependent library interface include directories to use build/install ↵	Aaron Burke
	generator expressions (#7) I verified it working well with marian-dev master and stand-alone fbgemm.
2020-08-12	Fix public header property in cpuinfo and clog to support submodule installs ↵	Aaron Burke
	(#6) Looks good. Thanks!
2019-09-25	Merge remote-tracking branch 'upstream/master' into youki/win-jit-debug-int8	Young Jin Kim
	Fix for windows build errors
2019-09-24	remove template parameter from PackedDepthWiseConvMatrix (#128)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/128 We don't really need to have KERNEL_PROD as a compile time constant template parameter in PackedDepthWiseConvMatrix for performance. Removing the template parameter will make generalizing depth-wise convolution to non 3x3 cases easier. This diff only changes fbgemm while maintaining the old interface. The follow-up diff will change Caffe2 code using the old interface and remove the old interface. This diff also splits FbgemmI8DepthwiseAvx2.cc into FbgemmI8Depthwise3DAvx2.cc and PackDepthwiseConvMatrixAvx2.cc to avoid compilation timeouts in OSS build tests. Reviewed By: dskhudia Differential Revision: D17514003 fbshipit-source-id: 2214637ac0762a585f619f0035d3449cc4f7669e
2019-08-15	Merge branch 'upstream/master' into youki/prepack_constrcopyPublic	Young Jin Kim

2019-08-09	Integrate VNNI into FBGEMM master branch (#114)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/114 Adding the VNNI support in FBGEMM. Previously, we have the issue on CMake version. Currently PyTorch and FBGEMM OSS test has the CMake 3.5 test, while ASMJIT requires CMake to be 3.8+. This caused the build failure for some platforms. Now the CMake version issue is resolved by a PR to ASMJIT to downgrade the CMake requirement: https://github.com/asmjit/asmjit/pull/252. Reviewed By: dskhudia Differential Revision: D16720839 fbshipit-source-id: e5e5f2d26f924df8d9fb955f4a3758561fa73288
2019-08-06	Back out "[fbgemm] Integrate VNNI into FBGEMM master branch"	Jianyu Huang
	Summary: Original commit changeset: fcaa13cc3159 ASMJIT requires the CMake version to be 3.8 However, FBGEMM and PyTorch only need the CMake version to be 3.5+. This caused the build failure in FBGEMM: https://circleci.com/gh/pytorch/FBGEMM/122#build-timing/containers/0 Reviewed By: dskhudia Differential Revision: D16670547 fbshipit-source-id: 506714c3db1cb82cf98895f58f82f235128f5285
2019-08-06	Integrate VNNI into FBGEMM master branch (#113)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/113 Adding the VNNI support in FBGEMM. Reviewed By: dskhudia Differential Revision: D16276574 fbshipit-source-id: 832ccdb27339489ebc138f3b2678e53d107c1b79
2019-06-13	Compile both on windows and linux	Young Jin Kim

2019-06-05	Unified convolution interface	Daya Khudia
	Summary: We want to combine three different convolution interfaces under one top level function. Reviewed By: protonu Differential Revision: D15399811 fbshipit-source-id: 7390616d92783506fc156f0f6017f10b5f7f8e30
2019-05-30	Adding -02 flag to the cmake build	Protonu Basu
	Summary: Adding this flag makes up for perf diff seen with the cmake build system Reviewed By: dskhudia Differential Revision: D15377782 fbshipit-source-id: cf5308ff2b5d8d42ac57b555a94d845268a857c6
2019-05-14	Use submodules instead of cmake downloads	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/96 Reviewed By: jianyuh Differential Revision: D15336047 Pulled By: dskhudia fbshipit-source-id: 93435ba920baa3a712c5741e60c479901c95115d
2019-05-13	Back out "[FBGEMM][PR] switch from cmake downloads to git submodules"	Daya S Khudia
	Summary: Original commit changeset: 9a33573ba34b Reviewed By: jianyuh Differential Revision: D15320950 fbshipit-source-id: f6501b57346cc5e82fa2198dcf6b60b26cd4f7c6
2019-05-13	switch from cmake downloads to git submodules (#95)	David Pollack
	Summary: I created a pull request for #87. I also tend to do a lot of hacking without an internet connection and it is nice to have the required library offline. I also get a cryptic error message when I build pytorch without an internet connection because these modules aren't available. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/95 Reviewed By: jianyuh Differential Revision: D15299133 Pulled By: dskhudia fbshipit-source-id: 6cf9ed47482eceee5f0444a8361720e0cfe25a13
2019-03-06	Add Avx512BW/VL/DQ check (#84)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/84 Add AVX512BW Check: AVX-512 Byte and Word Instructions add support for for 8-bit and 16-bit integer operations such as vpmaddubsw. Similarly, add AVX512VL/DQ check. Reviewed By: jspark1105 Differential Revision: D14321050 fbshipit-source-id: bd34745fd488ce4efe3248aeb78c54e1c2d91d47
2019-01-23	add missing include files to public headers so that they get installed properly	Daya S Khudia
	Summary: Same as title Reviewed By: jspark1105 Differential Revision: D13787161 fbshipit-source-id: c3d44afc812e7676d618b4b940e15ef0a2b12436
2019-01-14	Groupwise direct convolution when number of channels per group is small	Daya S Khudia
	Summary: Summary This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. Some Highlights: 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. Desired Extensions: 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. Without rowoffset and requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 Without rowoffset and with requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 With rowoffset and without requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 With rowoffset and requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. Update: includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5
2018-12-06	Final cleanup for avx2 isolation and consistent file names (#40)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/40 File name changes + removal of -mavx2 compiler flag non-avx files This completes the separation of avx2 code to few files that make minimal use of c++ std lib. Reviewed By: jianyuh Differential Revision: D13330577 fbshipit-source-id: b469ebee484168800ce2d12fd2356edecbf0fa4d
2018-12-06	avx2 intrinsic separation from OutputProcessing-inl.h (#38)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/38 Moves intrinsics code from OutputProcessing-inl.h (included in Fbgemm.h) to src/QuantUtilsAvx2.cc Reviewed By: Maratyszcza Differential Revision: D13328841 fbshipit-source-id: 0a5c7b065ba9d69573390f3fbcd68df8d82827a0
2018-12-06	File name change for FbgemmI8Depthwise.h and FbgemmI8Depthwise.cc (#14725)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14725 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/33 Renaming FbgemmI8Depthwise.h to FbgemmI8DepthwiseAvx2.h and FbgemmI8Depthwise.cc to FbgemmI8DepthwiseAvx2.cc since FbgemmI8DepthwiseAvx2.cc will be compiled with avx2 flags Reviewed By: jianyuh Differential Revision: D13313898 fbshipit-source-id: a8111eacf3d79a466ce0565bfe5f2f0b200a5c33
2018-12-05	Removed avx2 code from PackAWithRowOffset.cc (#34)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/34 Removing avx2 intrinsics from PackAWithRowOffset.cc Reviewed By: jspark1105 Differential Revision: D13281312 fbshipit-source-id: fcaf70fc666674ace26b72b733af03e3b8586e6e
2018-12-05	avx2 specific code in a separate file for QuantUtils (#29)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/29 avx2 code separation for QuantUtils Reviewed By: jianyuh Differential Revision: D13269041 fbshipit-source-id: df798cc0d93e0f2081cb832f4341fb2effa68294
2018-12-05	Move avx2 specific code in different source files (#28)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/28 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14516 This is the first diff in a series of diffs that will separate out avx2 specific code in separate files. The goal is to compile as little as possible code with avx2 and avx512 compiler flags. Reviewed By: jianyuh Differential Revision: D13248376 fbshipit-source-id: 401c2e9d3cd96c420fd08c3efa011febce96ffbb
2018-11-30	Only export symbols that are required while building shared library	Daya S Khudia
	Summary: We now use -fvisibility=hidden flag for compiling fbgemm as a shared library and only explicitly exported symbols will be visible to applications linking against fbgemm shared library. Reviewed By: jianyuh Differential Revision: D13221957 fbshipit-source-id: 2283727a7f9bc8b05015a621ae1116f3cb3231bc
2018-11-22	adding quantization utility functions (#19)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/19 Copying some of quantization utility functions from caffe2/quantization/server/dnnlowp.h to fbgemm/include/QuantUtils.h Will have another diff that removes the utility functions in caffe2/quantization/server/dnnlowp.h Reviewed By: jianyuh Differential Revision: D13159231 fbshipit-source-id: e409c0adc16b9ae1f32a3a62926817588a860855
2018-11-22	Unify the PackA file names (#21)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/21 PackWithQuantRowOffset.cc -> PackAWithQuantRowOffset.cc PackWithRowOffset.cc -> PackAWithRowOffset.cc Reviewed By: jspark1105, dskhudia Differential Revision: D13165323 fbshipit-source-id: 5e039647b0b22d2c24dd9a8131ed2f9305073525
2018-11-22	Fix minor issues	Daya S Khudia
	Summary: 1) Building of shared library was broken. 2) Fix to a particular commit of asmjit so that we don't accidentally break anything due to changes in asmjit. Reviewed By: jspark1105, jianyuh Differential Revision: D13165343 fbshipit-source-id: 21ea6cf16c2e7d9b341339fccc0f50b4bf78f903
2018-11-20	Simple parallelism, add -openmp flags and omp parallel for Acc16/32 Unit ↵	Jianyu Huang
	Test (#14) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14 This DIFF triggered a concurrent bug in the unit test. It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest". Update 1: There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result. I have a discussion with Daya. Now I understand the reason for the failure of this unit test - For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code. - If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t>(matC_)) ? row_start_A ldc_ : 0); - If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations. Update 2: I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken. Update 3: I change the above code snippet to int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_; Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations. Update 4: Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_. So I optimize the above code snippet to // If the accumulation buffer C_buffer_ is the same as matC_ (inplace output // processing), then each thread use the different parts of output buffer // matC_; // Otherwise, each thread uses different portions of the accumulation // buffer C_buffer_. Note that each thread can use at most MC * n portion of // C_buffer_. If the number of threads is 1, the only thread (thread 0) will // always reuse the first rowblock of C_buffer_. int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t>(matC_)) ? row_start_A ldc_ : std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_)); Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`. Update 5: Rebase, Also add the parts of D12937408 to remove the dependency. Reviewed By: jspark1105 Differential Revision: D13001149 fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65
2018-11-06	generalized conv_param_t and download third party libraries in build dir	dskhudia

2018-11-04	Syncing with internal version. Fixes for Mac/clang build. Other minor fixes	dskhudia

2018-11-03	Manually syncing with internal copy	dskhudia

2018-10-31	Initial commit	Daya S Khudia