Age | Commit message (Collapse) | Author |
|
|
|
Summary:
Pass blocking params in to compute correct buffer size for each group.
Fix the bug for this CONV shape:
`conv_param_t<2>(1, 32, 16, {12, 14}, 4, {3, 3}, {1, 1}, {0, 0, 0, 0})`
Corresponding M, N, K = 120, 4, 288
with these params:
BlockingFactors params;
params.MCB = 48;
params.NCB = 16;
params.KCB = 256;
params.MR = 1;
params.NR = 16;
params.ROW_INTERLEAVE = 4;
params.NR_MIN = 16;
Reviewed By: jianyuh
Differential Revision: D16571367
fbshipit-source-id: 27c9b003d37c4d3d13767227e8343d44668823d6
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/103
In the same spirit of D16085552, we do the following in this Diff:
- Refactor the pack/unpack code for PackB: use the same ```pack_unpack_``` function for both ```pack``` and ```unpack``` function.
- Add a unit test.
Reviewed By: dskhudia
Differential Revision: D16160767
fbshipit-source-id: 7fb7006750537b0705a180f2014c786298a1c615
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/78
Increase test coverage like transposing A
Reviewed By: protonu
Differential Revision: D14121297
fbshipit-source-id: a6e21442dc47e8cd725b795dbaf8614719f013fb
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/51
Use Clang formatting with "clang-format -i *.cc *.h".
Reviewed By: dskhudia
Differential Revision: D13532121
fbshipit-source-id: 6792d008f3295c128942f4896e8221aebbf2566e
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25
Per-group and per-channel quantization in fbgemm
This diff also cleans up explicit template instantiation using macro expansion
This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors.
Using this in DNNLOWP operators will be done in a separate diff.
Reviewed By: dskhudia
Differential Revision: D13176386
fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/26
Set convention that group is the leading (slowest moving) dimension of B.
Reviewed By: dskhudia
Differential Revision: D13176477
fbshipit-source-id: 64d5f168434e7fa0f90b46b0a8559569804c844b
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14323
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/24
As title says.
Reviewed By: dskhudia
Differential Revision: D13167073
fbshipit-source-id: 6d6c526fd6e29a14e97f71a0881f28ada8703107
|
|
Test (#14)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14
This DIFF triggered a concurrent bug in the unit test.
It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest".
Update 1:
There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result.
I have a discussion with Daya. Now I understand the reason for the failure of this unit test
- For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code.
- If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below
int32_t* C_buffer_row_start = C_buffer_ +
((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_
: 0);
- If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations.
Update 2:
I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken.
Update 3:
I change the above code snippet to
int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_;
Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations.
Update 4:
Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_.
So I optimize the above code snippet to
// If the accumulation buffer C_buffer_ is the same as matC_ (inplace output
// processing), then each thread use the different parts of output buffer
// matC_;
// Otherwise, each thread uses different portions of the accumulation
// buffer C_buffer_. Note that each thread can use at most MC * n portion of
// C_buffer_. If the number of threads is 1, the only thread (thread 0) will
// always reuse the first rowblock of C_buffer_.
int32_t* C_buffer_row_start = C_buffer_ +
((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_
: std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_));
Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`.
Update 5:
Rebase, Also add the parts of D12937408 to remove the dependency.
Reviewed By: jspark1105
Differential Revision: D13001149
fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/11
clang format of fbgemm
Reviewed By: dskhudia
Differential Revision: D13115202
fbshipit-source-id: 6dab29cb8b5f4fabcc165019663351567a2a2952
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7
This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 .
Reviewed By: jianyuh
Differential Revision: D13039210
fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d
|
|
|
|
|