github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2019-04-02	Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) ↵	Protonu Basu
	(#90) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90 Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) Reviewed By: dskhudia Differential Revision: D14358148 fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5
2019-02-20	optimize PackAWithIm2Col for symmetric b quant	Jongsoo Park
	Summary: Add additional option b_symmetric and skip row offset computation if it's true Reviewed By: jianyuh Differential Revision: D14119128 fbshipit-source-id: fa079347562b7f75727b3a1414e9bdda3f9c65dd
2019-02-13	optimize gconv for b symmetric quantization (#70)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/70 Skip row offset computation if B_zero_point == 0 . Reviewed By: jianyuh Differential Revision: D14020675 fbshipit-source-id: 88a6e225671762c67afefc15538b79f879d125a6
2019-02-02	Remove inappropriate consts (#67)	Lu Fang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/67 The inappropriate consts will fail the builds of pytorch. Let's remove them. Please check https://circleci.com/api/v1.1/project/github/pytorch/pytorch/704282/output/104/0?file=true Reviewed By: bddppq, BIT-silence Differential Revision: D13935557 fbshipit-source-id: 5dea01310be8bce38e3864d69a7b1ac97ed976c7
2019-02-01	specialized requantization for gconv (#61)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/61 requantization was the bottleneck of group conv with 4 channels per group. This diff implements a version of requantization specialized for group conv with 4 channels per group. TODO: generalize for different group conv Reviewed By: dskhudia Differential Revision: D13831466 fbshipit-source-id: 1ac7225d3133a2304c5b07730374584afc6ec259
2019-01-14	Groupwise direct convolution when number of channels per group is small	Daya S Khudia
	Summary: Summary This adds groupwise convolution when number of channels per group is small. Performance on Skylake T1 (turbo off) for a reasonable sized conv layer is 42-45 GOPS without row offset calculations and post processing. Currently rowoffset and requantization are killing the overall performance. Some Highlights: 1. Works for any convolution but only certain cases are optimized. Whether a particular convolution is optimized or not can be queried with the function fbgemmSupportedGConv 2. We generate only 1 kernel for different heights and widths, i.e., same kernel works for H, W = 56 or H = 48, W = 56 or H = 128, W = 124 etc. 3. As you can see, we have to generate more code for the edges than the main part of an image. Handling edge cases is more time consuming from the kernel generation point of view. 4. Currently only the case when input_channels_per_group == 4 == output_channels_per_group is supported. I will extend it for input_channels_per_group == output_channels_per_group = 8, 16 and 32. Desired Extensions: 1. Share the JIT runtime with other gemm kernels we generate. 2. Support the remaining cases 3. Standalone testcase for groupwise convolution. 4. Parallelization: We will parallelize across Minibatch and Group dimensions. This should be easier since just the right indexes needs to be calculated based thread_ids and num_threads. Without rowoffset and requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 42.46 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 42.75 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 43.77 Without rowoffset and with requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 4.20 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 4.18 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 4.17 With rowoffset and without requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.85 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.72 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.86 With rowoffset and requantization MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.66 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 1.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.65 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 1.79 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.66 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 1.92 So rowoffset + requantization is killing the performance. These isn't much we can do about requantization but there are two ways we can improve rowoffset calculations ( currently it's done in a very naive way). 1. Calculate it while doing convolution. It will make the the already complicated kernel more complex. 2. Just generate another kernel that calculates rowoffsets Let me know your thoughts. Update: includes rowoffset + requantization We now generate code for rowoffset calculations as well. MB, IC, OC, IH, IW, KH, KW, stride_h, stride_w, pad_h, pad_w, Type, M, N, K, GOPS 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.64 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 3.27 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 2688, 4, 1152, 0.62 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 2.92 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 3136, 4, 1152, 0.63 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 3.10 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, FusedIm2Col, 6272, 4, 1152, 0.62 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 2.75 With rowoffset and without requantization: 1, 128, 128, 56, 48, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 31.96 1, 128, 128, 48, 56, 32, 3, 3, 1, 1, 1, 1, direct, 2688, 4, 1152, 32.57 1, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 3136, 4, 1152, 32.47 2, 128, 128, 56, 56, 32, 3, 3, 1, 1, 1, 1, direct, 6272, 4, 1152, 33.23 Reviewed By: jianyuh Differential Revision: D13556028 fbshipit-source-id: adc0afcaea5ca624b82c071d103ced3a0b1b6ef5
2019-01-11	don't keep conv_param_p member as a const reference (#57)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/57 conv_p passed to constructors can be destroyed later (it could be allocated on stack). Reviewed By: dskhudia Differential Revision: D13619634 fbshipit-source-id: f9e86672b1f49db163ccedde4ab22c12dac2d0f1
2018-12-21	Update with clang format (#51)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/51 Use Clang formatting with "clang-format -i .cc .h". Reviewed By: dskhudia Differential Revision: D13532121 fbshipit-source-id: 6792d008f3295c128942f4896e8221aebbf2566e
2018-12-17	add comments on col_offsets (#48)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/48 Adding more comments on how we should provide various buffers. In general, we should improve the documentation instead of just providing examples in the test directory. Reviewed By: dskhudia, jianyuh Differential Revision: D13489925 fbshipit-source-id: 89ede9410c823dcd86b31dada3faf773ebd22f0f
2018-12-06	Final cleanup for avx2 isolation and consistent file names (#40)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/40 File name changes + removal of -mavx2 compiler flag non-avx files This completes the separation of avx2 code to few files that make minimal use of c++ std lib. Reviewed By: jianyuh Differential Revision: D13330577 fbshipit-source-id: b469ebee484168800ce2d12fd2356edecbf0fa4d
2018-12-06	avx2 intrinsic separation from OutputProcessing-inl.h (#38)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/38 Moves intrinsics code from OutputProcessing-inl.h (included in Fbgemm.h) to src/QuantUtilsAvx2.cc Reviewed By: Maratyszcza Differential Revision: D13328841 fbshipit-source-id: 0a5c7b065ba9d69573390f3fbcd68df8d82827a0
2018-12-01	Build fix with fbgemm shared lib (#31)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/31 Fix this bug with shared library build https://circleci.com/gh/pytorch/FBGEMM/23 Reviewed By: jianyuh Differential Revision: D13286222 fbshipit-source-id: 82e8a7004c39de9c7e3d0cb9f364000d10fd5c2c
2018-11-30	Only export symbols that are required while building shared library	Daya S Khudia
	Summary: We now use -fvisibility=hidden flag for compiling fbgemm as a shared library and only explicitly exported symbols will be visible to applications linking against fbgemm shared library. Reviewed By: jianyuh Differential Revision: D13221957 fbshipit-source-id: 2283727a7f9bc8b05015a621ae1116f3cb3231bc
2018-11-29	sparse convolution output processing (#27)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/27 DoSpmdmOnInpBuffer can't be used together with PackAWithIm2Col because DoSpmdmOnInpBuffer expects im2col'ed A matrix. This diff implements DoSConvOnInpBuffer that does sparse convolution directly on A input without im2col. The performance is well optimized and need to see if this implementation is good enough to get good resnet50 performance. Reviewed By: dskhudia Differential Revision: D13192336 fbshipit-source-id: 2076555ba9749e111afbaec408a2bfa0f55bd5bc
2018-11-27	per-group and per-channel quantization (#14340)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Reviewed By: dskhudia Differential Revision: D13176386 fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd
2018-11-27	fix group convention in B packing (#26)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/26 Set convention that group is the leading (slowest moving) dimension of B. Reviewed By: dskhudia Differential Revision: D13176477 fbshipit-source-id: 64d5f168434e7fa0f90b46b0a8559569804c844b
2018-11-26	remove unnecessary zero_point argument from constructors (#14323)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14323 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/24 As title says. Reviewed By: dskhudia Differential Revision: D13167073 fbshipit-source-id: 6d6c526fd6e29a14e97f71a0881f28ada8703107
2018-11-22	optimize for symmetric quantization in requantization (#18)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/18 Optimize requantize when A_zero_point == 0 or B_zero_point == 0 or bias == nullptr Reviewed By: jianyuh Differential Revision: D13152110 fbshipit-source-id: f36aa844bef9017ade2b32df1703947e4d53e3e3
2018-11-20	Optimize parallelization performance (#15)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/15 Better load balance the workload among different threads. Reviewed By: jspark1105 Differential Revision: D13108873 fbshipit-source-id: ae75971b5ff2cc7cf19907eb95cf2df071f7bbe3
2018-11-20	A function to check if we are running on a fbgemm supported cpu (#13)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/13 See title Reviewed By: jianyuh Differential Revision: D13131301 fbshipit-source-id: 2dafdf0fe3dfd26f1b944d550d6cce29f3653a74
2018-11-16	grouped (batched) gemm (#7)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7 This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 . Reviewed By: jianyuh Differential Revision: D13039210 fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d
2018-11-11	Fix the issue caused by isA(); Remove use of avx512 cast intrinsics for GCC ↵	Jianyu Huang
	4.9.2 compatibility
2018-11-08	Sync with internal copy: Asymmetric padding; fbgemm2 -> fbgemm	Jianyu Huang

2018-11-06	Add equals and metaEquals method to PackBMatrix	Jongsoo Park

2018-11-06	generalized conv_param_t and download third party libraries in build dir	dskhudia

2018-11-04	Syncing with internal version. Fixes for Mac/clang build. Other minor fixes	dskhudia

2018-10-31	Initial commit	Daya S Khudia