github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2019-09-11	fbgemmPacked and fbgemmConv apis with float bias + tests	Daya Khudia
	Summary: fbgemmPacked and fbgemmConv api changes to take float bias. Reviewed By: jianyuh Differential Revision: D17244262 fbshipit-source-id: 0531c829190d20e31cb957a3f1861d4a65645cee
2019-08-09	Integrate VNNI into FBGEMM master branch (#114)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/114 Adding the VNNI support in FBGEMM. Previously, we have the issue on CMake version. Currently PyTorch and FBGEMM OSS test has the CMake 3.5 test, while ASMJIT requires CMake to be 3.8+. This caused the build failure for some platforms. Now the CMake version issue is resolved by a PR to ASMJIT to downgrade the CMake requirement: https://github.com/asmjit/asmjit/pull/252. Reviewed By: dskhudia Differential Revision: D16720839 fbshipit-source-id: e5e5f2d26f924df8d9fb955f4a3758561fa73288
2019-08-06	Back out "[fbgemm] Integrate VNNI into FBGEMM master branch"	Jianyu Huang
	Summary: Original commit changeset: fcaa13cc3159 ASMJIT requires the CMake version to be 3.8 However, FBGEMM and PyTorch only need the CMake version to be 3.5+. This caused the build failure in FBGEMM: https://circleci.com/gh/pytorch/FBGEMM/122#build-timing/containers/0 Reviewed By: dskhudia Differential Revision: D16670547 fbshipit-source-id: 506714c3db1cb82cf98895f58f82f235128f5285
2019-08-06	Integrate VNNI into FBGEMM master branch (#113)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/113 Adding the VNNI support in FBGEMM. Reviewed By: dskhudia Differential Revision: D16276574 fbshipit-source-id: 832ccdb27339489ebc138f3b2678e53d107c1b79
2019-04-19	make sure cpuinfo_initialize called before fbgemmHasAvx2/512Support (#94)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/94 If we don't call cpuinfo_initialize before hand, fbgemmHasAvx2/512Support will always return false. We should really careful about this. Reviewed By: jianyuh Differential Revision: D14994129 fbshipit-source-id: b78028f0543d05595caaa627be2feb743d0694b1
2019-04-02	Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) ↵	Protonu Basu
	(#90) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90 Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) Reviewed By: dskhudia Differential Revision: D14358148 fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5
2019-03-21	Improves small N cases back to what they were	Daya S Khudia
	Summary: In D14507536 and D14516232 small N cases suffered if we increased the NR. This fixes those cases. Reviewed By: jianyuh Differential Revision: D14529494 fbshipit-source-id: 6f53797948de760d6ed24b767cbbe8d27768660f
2019-03-06	Add Avx512BW/VL/DQ check (#84)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/84 Add AVX512BW Check: AVX-512 Byte and Word Instructions add support for for 8-bit and 16-bit integer operations such as vpmaddubsw. Similarly, add AVX512VL/DQ check. Reviewed By: jspark1105 Differential Revision: D14321050 fbshipit-source-id: bd34745fd488ce4efe3248aeb78c54e1c2d91d47
2019-02-19	remove unused member var kBlock_ (#77)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/77 As title Reviewed By: protonu Differential Revision: D14124479 fbshipit-source-id: 3a44a1de8bf5da75e0d69d98d93f55b6b058b7ce
2019-02-14	JIT kernel should only handle a small portion of NCB for the last block: ↵	Jianyu Huang
	multiple of NR Summary: Before this Diff: we pass into the JIT kernel with nc = NCB ( packedB_.blockColSize() ) instead of nc = leftover size (packedB_.lastBcol() ) for the last block of B (diffusion/FBS/browse/master/fbcode/deeplearning/fbgemm/src/ExecuteKernelU8S8.cc;1adfe7977ef7ea2a1aee0ed785bd3fed5b7c4a20$102), which cause the additional computation when n is small. After this Diff: we pass into the JIT kernel with a small portion of NCB (still multiple of NR) for the last block of B. The main performance gain is for Acc16, because NCB = 4 * NR for Acc16 and NCB = NR for Acc32 in our current settings (AVX2 and AVX512). Reviewed By: jspark1105 Differential Revision: D14063628 fbshipit-source-id: 5829d06553daf617e2fefa7d26cb2d761af402c1
2019-02-13	no need to subtract col offset if a_zp is 0 (#69)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/69 This diff prepares for D14013931 that folds column offsets into bias. In depthwise convolution, we allow passing column_offsets == nullptr which means column_offsets are folded into bias. We bypass adding column_offset * A_zero_point if either column_offset == nullptr or A_zero_point == 0 Reviewed By: jianyuh Differential Revision: D14017772 fbshipit-source-id: ad4a79402f43cbf78dbad68e1bff6d07c19dded0
2018-12-21	Update with clang format (#51)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/51 Use Clang formatting with "clang-format -i .cc .h". Reviewed By: dskhudia Differential Revision: D13532121 fbshipit-source-id: 6792d008f3295c128942f4896e8221aebbf2566e
2018-12-11	instantiate more kernels for PackAmatrix (#47)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/47 PackAMatrix (compared to PackAWithRowOffset) can be a faster alternative when B_zero_point = 0 Reviewed By: jianyuh Differential Revision: D13413605 fbshipit-source-id: 2cac4560e8f166d19c58c65ae25400d1b0795b19
2018-11-29	sparse convolution output processing (#27)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/27 DoSpmdmOnInpBuffer can't be used together with PackAWithIm2Col because DoSpmdmOnInpBuffer expects im2col'ed A matrix. This diff implements DoSConvOnInpBuffer that does sparse convolution directly on A input without im2col. The performance is well optimized and need to see if this implementation is good enough to get good resnet50 performance. Reviewed By: dskhudia Differential Revision: D13192336 fbshipit-source-id: 2076555ba9749e111afbaec408a2bfa0f55bd5bc
2018-11-27	per-group and per-channel quantization (#14340)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Reviewed By: dskhudia Differential Revision: D13176386 fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd
2018-11-20	Simple parallelism, add -openmp flags and omp parallel for Acc16/32 Unit ↵	Jianyu Huang
	Test (#14) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14 This DIFF triggered a concurrent bug in the unit test. It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest". Update 1: There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result. I have a discussion with Daya. Now I understand the reason for the failure of this unit test - For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code. - If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t>(matC_)) ? row_start_A ldc_ : 0); - If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations. Update 2: I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken. Update 3: I change the above code snippet to int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_; Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations. Update 4: Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_. So I optimize the above code snippet to // If the accumulation buffer C_buffer_ is the same as matC_ (inplace output // processing), then each thread use the different parts of output buffer // matC_; // Otherwise, each thread uses different portions of the accumulation // buffer C_buffer_. Note that each thread can use at most MC * n portion of // C_buffer_. If the number of threads is 1, the only thread (thread 0) will // always reuse the first rowblock of C_buffer_. int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t>(matC_)) ? row_start_A ldc_ : std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_)); Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`. Update 5: Rebase, Also add the parts of D12937408 to remove the dependency. Reviewed By: jspark1105 Differential Revision: D13001149 fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65
2018-11-16	grouped (batched) gemm (#7)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7 This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 . Reviewed By: jianyuh Differential Revision: D13039210 fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d
2018-11-08	Sync with internal copy: Asymmetric padding; fbgemm2 -> fbgemm	Jianyu Huang

2018-11-06	generalized conv_param_t and download third party libraries in build dir	dskhudia

2018-10-31	Initial commit	Daya S Khudia