Age | Commit message (Collapse) | Author |
|
Summary: fbgemmPacked and fbgemmConv api changes to take float bias.
Reviewed By: jianyuh
Differential Revision: D17244262
fbshipit-source-id: 0531c829190d20e31cb957a3f1861d4a65645cee
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/114
Adding the VNNI support in FBGEMM.
Previously, we have the issue on CMake version. Currently PyTorch and FBGEMM OSS test has the CMake 3.5 test, while ASMJIT requires CMake to be 3.8+. This caused the build failure for some platforms. Now the CMake version issue is resolved by a PR to ASMJIT to downgrade the CMake requirement: https://github.com/asmjit/asmjit/pull/252.
Reviewed By: dskhudia
Differential Revision: D16720839
fbshipit-source-id: e5e5f2d26f924df8d9fb955f4a3758561fa73288
|
|
Summary:
Original commit changeset: fcaa13cc3159
ASMJIT requires the CMake version to be 3.8
However, FBGEMM and PyTorch only need the CMake version to be 3.5+.
This caused the build failure in FBGEMM:
https://circleci.com/gh/pytorch/FBGEMM/122#build-timing/containers/0
Reviewed By: dskhudia
Differential Revision: D16670547
fbshipit-source-id: 506714c3db1cb82cf98895f58f82f235128f5285
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/113
Adding the VNNI support in FBGEMM.
Reviewed By: dskhudia
Differential Revision: D16276574
fbshipit-source-id: 832ccdb27339489ebc138f3b2678e53d107c1b79
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/94
If we don't call cpuinfo_initialize before hand, fbgemmHasAvx2/512Support will always return false. We should really careful about this.
Reviewed By: jianyuh
Differential Revision: D14994129
fbshipit-source-id: b78028f0543d05595caaa627be2feb743d0694b1
|
|
(#90)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90
Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave)
Reviewed By: dskhudia
Differential Revision: D14358148
fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5
|
|
Summary: In D14507536 and D14516232 small N cases suffered if we increased the NR. This fixes those cases.
Reviewed By: jianyuh
Differential Revision: D14529494
fbshipit-source-id: 6f53797948de760d6ed24b767cbbe8d27768660f
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/84
Add AVX512BW Check:
AVX-512 Byte and Word Instructions add support for for 8-bit and 16-bit integer operations such as vpmaddubsw.
Similarly, add AVX512VL/DQ check.
Reviewed By: jspark1105
Differential Revision: D14321050
fbshipit-source-id: bd34745fd488ce4efe3248aeb78c54e1c2d91d47
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/77
As title
Reviewed By: protonu
Differential Revision: D14124479
fbshipit-source-id: 3a44a1de8bf5da75e0d69d98d93f55b6b058b7ce
|
|
multiple of NR
Summary:
Before this Diff:
we pass into the JIT kernel with nc = NCB ( packedB_.blockColSize() ) instead of nc = leftover size (packedB_.lastBcol() ) for the last block of B (diffusion/FBS/browse/master/fbcode/deeplearning/fbgemm/src/ExecuteKernelU8S8.cc;1adfe7977ef7ea2a1aee0ed785bd3fed5b7c4a20$102), which cause the additional computation when n is small.
After this Diff:
we pass into the JIT kernel with a small portion of NCB (still multiple of NR) for the last block of B.
The main performance gain is for Acc16, because NCB = 4 * NR for Acc16 and NCB = NR for Acc32 in our current settings (AVX2 and AVX512).
Reviewed By: jspark1105
Differential Revision: D14063628
fbshipit-source-id: 5829d06553daf617e2fefa7d26cb2d761af402c1
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/69
This diff prepares for D14013931 that folds column offsets into bias.
In depthwise convolution, we allow passing column_offsets == nullptr which means column_offsets are folded into bias. We bypass adding column_offset * A_zero_point if either column_offset == nullptr or A_zero_point == 0
Reviewed By: jianyuh
Differential Revision: D14017772
fbshipit-source-id: ad4a79402f43cbf78dbad68e1bff6d07c19dded0
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/51
Use Clang formatting with "clang-format -i *.cc *.h".
Reviewed By: dskhudia
Differential Revision: D13532121
fbshipit-source-id: 6792d008f3295c128942f4896e8221aebbf2566e
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/47
PackAMatrix (compared to PackAWithRowOffset) can be a faster alternative when B_zero_point = 0
Reviewed By: jianyuh
Differential Revision: D13413605
fbshipit-source-id: 2cac4560e8f166d19c58c65ae25400d1b0795b19
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/27
DoSpmdmOnInpBuffer can't be used together with PackAWithIm2Col because DoSpmdmOnInpBuffer expects im2col'ed A matrix. This diff implements DoSConvOnInpBuffer that does sparse convolution directly on A input without im2col. The performance is well optimized and need to see if this implementation is good enough to get good resnet50 performance.
Reviewed By: dskhudia
Differential Revision: D13192336
fbshipit-source-id: 2076555ba9749e111afbaec408a2bfa0f55bd5bc
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25
Per-group and per-channel quantization in fbgemm
This diff also cleans up explicit template instantiation using macro expansion
This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors.
Using this in DNNLOWP operators will be done in a separate diff.
Reviewed By: dskhudia
Differential Revision: D13176386
fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd
|
|
Test (#14)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14
This DIFF triggered a concurrent bug in the unit test.
It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest".
Update 1:
There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result.
I have a discussion with Daya. Now I understand the reason for the failure of this unit test
- For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code.
- If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below
int32_t* C_buffer_row_start = C_buffer_ +
((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_
: 0);
- If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations.
Update 2:
I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken.
Update 3:
I change the above code snippet to
int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_;
Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations.
Update 4:
Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_.
So I optimize the above code snippet to
// If the accumulation buffer C_buffer_ is the same as matC_ (inplace output
// processing), then each thread use the different parts of output buffer
// matC_;
// Otherwise, each thread uses different portions of the accumulation
// buffer C_buffer_. Note that each thread can use at most MC * n portion of
// C_buffer_. If the number of threads is 1, the only thread (thread 0) will
// always reuse the first rowblock of C_buffer_.
int32_t* C_buffer_row_start = C_buffer_ +
((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_
: std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_));
Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`.
Update 5:
Rebase, Also add the parts of D12937408 to remove the dependency.
Reviewed By: jspark1105
Differential Revision: D13001149
fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7
This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 .
Reviewed By: jianyuh
Differential Revision: D13039210
fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d
|
|
|
|
|
|
|