Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJianyu Huang <jianyuhuang@fb.com>2018-11-20 10:31:33 +0300
committerFacebook Github Bot <facebook-github-bot@users.noreply.github.com>2018-11-20 10:33:34 +0300
commit7346431f9a368573257587dea8b5c74c21c609b9 (patch)
tree2f34a3fd219abe7f0856cd0732d71bf9b07c03c7 /CMakeLists.txt
parent3b7936e008865c6c59773664c20dd98c73b55c82 (diff)
Simple parallelism, add -openmp flags and omp parallel for Acc16/32 Unit Test (#14)
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/14 This DIFF triggered a concurrent bug in the unit test. It is weird that there are no errors for "SpMDMTest", while errors are reported for "NoRequantizeTest". Update 1: There might be problems with "memCopy" function. Then I change "Cint32_buffer.data()" to "Cint32_fb.data()" (see my inline comment) so that the accumulation buffer and the output buffer are the same. It appears that we can output the correct result. I have a discussion with Daya. Now I understand the reason for the failure of this unit test - For the purpose of this unit test, we should just use the same buffer "Cint32_fb.data()" for the accumulation and output. Not sure why this issue is not found in the original code. - If the thread number is not 1, and we we use different buffers: "Cint32_buffer" for the accumulation buffer and "Cint32_fb" for the output buffer, then the pointers of "Cint32_buffer.data()" is actually shared by different threads. When doing the accumulation inside "ExecuteKernelU8S8.cc", different threads will just write to the same memory location: Check the code below int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_ : 0); - If the thread number is not 1, and we use the same buffers: "Cint32_fb.data()" for the accumulation and output. According to the above code, different threads will write to different memory locations. Update 2: I add a new test case "{1024, 512, 258}" in Acc16 and Acc32 unit tests. "PackedRequantizeAcc16Test" runs well, but "PackedRequantizeTest" is broken. Update 3: I change the above code snippet to int32_t* C_buffer_row_start = C_buffer_ + row_start_A * ldc_; Finally we get both Acc16 and Acc32 tests passed. Now different threads will always write to different memory locations. Update 4: Jongsoo comments that reusing the first row block of C_buffer_ is mostly to optimize for cache not for memory allocation size (this was making a big difference in xray ocr perf. don't remember exact number). A right thing to do is to have each thread to use different portion of C_buffer_. So I optimize the above code snippet to // If the accumulation buffer C_buffer_ is the same as matC_ (inplace output // processing), then each thread use the different parts of output buffer // matC_; // Otherwise, each thread uses different portions of the accumulation // buffer C_buffer_. Note that each thread can use at most MC * n portion of // C_buffer_. If the number of threads is 1, the only thread (thread 0) will // always reuse the first rowblock of C_buffer_. int32_t* C_buffer_row_start = C_buffer_ + ((C_buffer_ == reinterpret_cast<int32_t*>(matC_)) ? row_start_A * ldc_ : std::min(thread_id_ * mbSize_ * ldc_, row_start_A * ldc_)); Note that `thread_id` and `num_threads` is passed as the arguments into `ExecuteKernel`. Update 5: Rebase, Also add the parts of D12937408 to remove the dependency. Reviewed By: jspark1105 Differential Revision: D13001149 fbshipit-source-id: b16c20863dc467de6faaefcaf1134cf1036f8a65
Diffstat (limited to 'CMakeLists.txt')
-rw-r--r--CMakeLists.txt10
1 files changed, 10 insertions, 0 deletions
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d957bed..9b7a6c8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -51,6 +51,16 @@ if(NOT COMPILER_SUPPORTS_AVX512)
message(FATAL_ERROR "A compiler with AVX512 support is required.")
endif()
+#check if compiler supports openmp
+find_package(OpenMP)
+if (OPENMP_FOUND)
+ set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
+ set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
+ set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
+else()
+ message(WARNING "OpenMP is not supported by the compiler")
+endif()
+
#All the source files that use avx512 instructions statically
set(FBGEMM_AVX512_SRCS src/Utils_avx512.cc)