github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2018-11-29	sparse convolution output processing (#27)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/27 DoSpmdmOnInpBuffer can't be used together with PackAWithIm2Col because DoSpmdmOnInpBuffer expects im2col'ed A matrix. This diff implements DoSConvOnInpBuffer that does sparse convolution directly on A input without im2col. The performance is well optimized and need to see if this implementation is good enough to get good resnet50 performance. Reviewed By: dskhudia Differential Revision: D13192336 fbshipit-source-id: 2076555ba9749e111afbaec408a2bfa0f55bd5bc
2018-11-27	per-group and per-channel quantization (#14340)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14340 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/25 Per-group and per-channel quantization in fbgemm This diff also cleans up explicit template instantiation using macro expansion This diff also changes randFill interface which was easy to make mistakes of generating integer random numbers for floating point vectors. Using this in DNNLOWP operators will be done in a separate diff. Reviewed By: dskhudia Differential Revision: D13176386 fbshipit-source-id: e46c53e31e21520bded71b8ed86e8b19e010e2dd
2018-11-23	parallelization over groups (#23)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/23 First parallelize over group and then parallelize within each group. Reviewed By: jianyuh Differential Revision: D13166764 fbshipit-source-id: 58da644ec5fbd5d6e3e87d46790b9199dded6889
2018-11-20	Optimize parallelization performance (#15)	Jianyu Huang
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/15 Better load balance the workload among different threads. Reviewed By: jspark1105 Differential Revision: D13108873 fbshipit-source-id: ae75971b5ff2cc7cf19907eb95cf2df071f7bbe3
2018-11-20	Simple parallelism, add -openmp flags and omp parallel for Acc16/32 Unit ↵	Jianyu Huang
	Test (#14) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14 This DIFF triggered a concurrent bug in the unit test. It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest". Update 1: There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result. I have a discussion with Daya. Now I understand the reason for the failure of this unit test - For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code. - If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t>(matC_)) ? row_start_A ldc_ : 0); - If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations. Update 2: I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken. Update 3: I change the above code snippet to int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_; Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations. Update 4: Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_. So I optimize the above code snippet to // If the accumulation buffer C_buffer_ is the same as matC_ (inplace output // processing), then each thread use the different parts of output buffer // matC_; // Otherwise, each thread uses different portions of the accumulation // buffer C_buffer_. Note that each thread can use at most MC * n portion of // C_buffer_. If the number of threads is 1, the only thread (thread 0) will // always reuse the first rowblock of C_buffer_. int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t>(matC_)) ? row_start_A ldc_ : std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_)); Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`. Update 5: Rebase, Also add the parts of D12937408 to remove the dependency. Reviewed By: jspark1105 Differential Revision: D13001149 fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65
2018-11-20	A function to check if we are running on a fbgemm supported cpu (#13)	Daya S Khudia
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/13 See title Reviewed By: jianyuh Differential Revision: D13131301 fbshipit-source-id: 2dafdf0fe3dfd26f1b944d550d6cce29f3653a74
2018-11-19	clang-format (#11)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/11 clang format of fbgemm Reviewed By: dskhudia Differential Revision: D13115202 fbshipit-source-id: 6dab29cb8b5f4fabcc165019663351567a2a2952
2018-11-16	grouped (batched) gemm (#7)	Jongsoo Park
	Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/7 This diff allows groups > 1 . Will have a separate diff for im2col + gemm fusion and conv with group > 1 . Reviewed By: jianyuh Differential Revision: D13039210 fbshipit-source-id: f7b3b0dbdb67fc6bc865de88292f034b252d029d
2018-11-08	Sync with internal copy: Asymmetric padding; fbgemm2 -> fbgemm	Jianyu Huang

2018-11-06	generalized conv_param_t and download third party libraries in build dir	dskhudia

2018-10-31	Initial commit	Daya S Khudia