Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/FBGEMM.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/src
AgeCommit message (Collapse)Author
2019-08-06Integrate VNNI into FBGEMM master branch (#113)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/113 Adding the VNNI support in FBGEMM. Reviewed By: dskhudia Differential Revision: D16276574 fbshipit-source-id: 832ccdb27339489ebc138f3b2678e53d107c1b79
2019-08-02Pass blocking param pointer into packedBufferSize() in PackBMatrix.ccMike Tsai
Summary: Pass blocking params in to compute correct buffer size for each group. Fix the bug for this CONV shape: `conv_param_t<2>(1, 32, 16, {12, 14}, 4, {3, 3}, {1, 1}, {0, 0, 0, 0})` Corresponding M, N, K = 120, 4, 288 with these params: BlockingFactors params; params.MCB = 48; params.NCB = 16; params.KCB = 256; params.MR = 1; params.NR = 16; params.ROW_INTERLEAVE = 4; params.NR_MIN = 16; Reviewed By: jianyuh Differential Revision: D16571367 fbshipit-source-id: 27c9b003d37c4d3d13767227e8343d44668823d6
2019-07-19Fix fbgemm OSS failureJianyu Huang
Summary: std::multiplier is not found. Reviewed By: jspark1105 Differential Revision: D16373256 fbshipit-source-id: ae273a3f447f95e4b26d3f1a43e7ddad288b78ab
2019-07-19Support pointwise with unified convolution interface as well (#108)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/108 Pointwise gets converted to direct GEMM Reviewed By: jianyuh Differential Revision: D16296356 fbshipit-source-id: 68c88df90e5de669bfcddf426c6488e2a04d55d6
2019-07-18Fix missing blocking params in conv im2col code path.Mike Tsai
Summary: Add blocking params as argument of rowOffsetBufferSize() so the allocated vector will be sized correctlly. Reviewed By: dskhudia, jianyuh Differential Revision: D16348913 fbshipit-source-id: c70a05f2f69db3ce71ec2c27a8db4d143649ddd6
2019-07-17While calling fbgemmConv with packed weights, packed weights should be ↵Daya Khudia
compliant with convolution parameters Summary: This is to detect inadvertent calling for fbgemmConv with one set of conv parameters while packing was done with another set of parameters. Reviewed By: jspark1105 Differential Revision: D16269293 fbshipit-source-id: 9a166f5298d8246047e40fc880dd87e1037e0456
2019-07-16changes to remove warnings when building in opt modeProtonu Basu
Summary: Changes to remove warnings when building FBGEMM in opt mode. Cleanup to address initialization of MCB, KCB, NCBX Reviewed By: jianyuh Differential Revision: D16283443 fbshipit-source-id: 0829aee45ed1d262a18bcf4dd294393ef018a688
2019-07-16Add functions needed for unpacking in PackWeightsForConv (#106)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/106 The values returned by these functions is needed while unpacking weights. Reviewed By: jianyuh Differential Revision: D16193425 fbshipit-source-id: 8ee3a0dc46768d7cb572bf383be1ce2b450c44c9
2019-07-16unpack through unified convolution interface (#105)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/105 Support for calling unpack using unified interface for packing convolution weights Reviewed By: jianyuh Differential Revision: D16190534 fbshipit-source-id: daebd7b6d1846921232f8391c816e2f0678d813f
2019-07-16Assume input weights to be in transposed format for convUnified (#104)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/104 For consistency, we always assume that weights to PackWeightsForConv are in format K R S C/G, which is same as G K/G R S C/G cc: Huihan Liu: Please note this change. Reviewed By: jianyuh Differential Revision: D16186932 fbshipit-source-id: 9ca2562f213d6b296ef8bd2eca1e5b6e98c436ec
2019-07-10Refactoring unpack weight function (#103)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/103 In the same spirit of D16085552, we do the following in this Diff: - Refactor the pack/unpack code for PackB: use the same ```pack_unpack_``` function for both ```pack``` and ```unpack``` function. - Add a unit test. Reviewed By: dskhudia Differential Revision: D16160767 fbshipit-source-id: 7fb7006750537b0705a180f2014c786298a1c615
2019-07-06Unpack data for 3x3 (and 3x3x3) depthwise convolutionDaya Khudia
Summary: unpack weight for 3x3 depthwise and 3x3x3 depthwise convolutions. Reviewed By: jspark1105 Differential Revision: D16076463 fbshipit-source-id: 767749c1a10caefef4c76c2c51323d1a3041621a
2019-07-06Implement ::unpack() for PackWeightMatrixForGConvJaewon Lee
Summary: Implement ::unpack() for PackWeightMatrixForGConv. Unpack index calculation is the inverse of ::pack(). Reviewed By: dskhudia Differential Revision: D16085552 fbshipit-source-id: b8866365dc425fee2cb985b3e48c627198ebc29a
2019-07-01Refactor the code and avoid the duplication (#102)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/102 The Avx512 and Avx2 branches can be merged. Reviewed By: dskhudia Differential Revision: D16068952 fbshipit-source-id: b39beb32e80dc168d0c17db9dff8a67bb0fe976f
2019-07-01Clean up some code for JIT code generator (#101)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/101 Some code cleanup: - Both ```leadingDimCReg``` and ```leadingDimCRegAssign``` are used in ```GenerateKernelU8S8S32ACC32.c```. We should unify them to only use one variable name. - Remove some redundant register variable ```asmjit::X86Ymm tmpReg = x86::ymm14;```. Reviewed By: dskhudia Differential Revision: D15673269 fbshipit-source-id: 81eb3673d0ff97391557413a13f1972561a1f2db
2019-06-20Per channel and groupwise quantization (#99)Daya Khudia
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/99 A function to do per channel and groupwise quantization Reviewed By: jspark1105 Differential Revision: D15567272 fbshipit-source-id: e2f326ea7c7463b5c47b3f590e003344a9e41960
2019-06-15Update the logic of checking valid parameters.Mike Tsai
Summary: Add the check on NR_MIN and fix ymm/zmm register checks. Reviewed By: dskhudia Differential Revision: D15772144 fbshipit-source-id: 11e2c67fb3d47c5570b38ceaf9828ced0e60e65b
2019-06-12Print packed matrix for each group as wellDaya Khudia
Summary: same as title. We were only printing packed matrix for group 0 Reviewed By: jianyuh Differential Revision: D15775235 fbshipit-source-id: 747550c9ae229a2eeb912409897c1331ada81e2b
2019-06-07Remove duplicated header and undo some changes in D15399811Daya Khudia
Summary: Delete duplicated header Remove #ifndef and replace with pragma once. Reviewed By: jianyuh Differential Revision: D15669744 fbshipit-source-id: 8895f6c867e626ac5813a8952837435e76b09370
2019-06-05Unified convolution interfaceDaya Khudia
Summary: We want to combine three different convolution interfaces under one top level function. Reviewed By: protonu Differential Revision: D15399811 fbshipit-source-id: 7390616d92783506fc156f0f6017f10b5f7f8e30
2019-06-04Add quantized::fbgemm_linear_unpack operator for serialization (#97)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/97 Pull Request resolved: https://github.com/pytorch/pytorch/pull/20721 - FBGEMM: Add unpack function for PackBMatrix class: Unpack pmat buffer to the origin_buf (Used for the serialization to recover weight matrix). - PyTorch Quantizer: Add quantized::fbgemm_linear_unpack operator for serialization. Reviewed By: zafartahirov Differential Revision: D15314568 fbshipit-source-id: 12080c8887ce31dc849d23e132ae1766ac319407
2019-05-24Fix kernel loggingMike Tsai
Summary: Remove the extra line in ifdef block for kernel logging. Reviewed By: jianyuh Differential Revision: D15483193 fbshipit-source-id: 8ee25b07ab0a45e6f3d366876241599c87ab0c2d
2019-05-16fixing compiler warnings for uninitialized MR, NCB, KCBProtonu Basu
Summary: fixing compiler warnings for uninitialized MR, NCB, KCB Reviewed By: dskhudia Differential Revision: D15362047 fbshipit-source-id: 57428f0610c8c12f9ff1f07fe8e472e5ff56bc82
2019-04-19make sure cpuinfo_initialize called before fbgemmHasAvx2/512Support (#94)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/94 If we don't call cpuinfo_initialize before hand, fbgemmHasAvx2/512Support will always return false. We should really careful about this. Reviewed By: jianyuh Differential Revision: D14994129 fbshipit-source-id: b78028f0543d05595caaa627be2feb743d0694b1
2019-04-03optimize dw conv for symmetric quant (#73)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/73 Skip computing row_offset if B uses symmetric quantization. Skip adding col_offset if A uses symmetric quantization. Reviewed By: jianyuh Differential Revision: D14055973 fbshipit-source-id: 91da8f0755b2f90175e94a893b5a3ad6342c506d
2019-04-02Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) ↵Protonu Basu
(#90) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90 Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave) Reviewed By: dskhudia Differential Revision: D14358148 fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5
2019-03-25Packing B documentationDaya S Khudia
Summary: Packing B documentation Reviewed By: jianyuh Differential Revision: D14579163 fbshipit-source-id: e18cb1eea56024fbe54f654b15ca79d10c42e17c
2019-03-21Improves small N cases back to what they wereDaya S Khudia
Summary: In D14507536 and D14516232 small N cases suffered if we increased the NR. This fixes those cases. Reviewed By: jianyuh Differential Revision: D14529494 fbshipit-source-id: 6f53797948de760d6ed24b767cbbe8d27768660f
2019-03-21Allocate some registers for B matrix loading and reuse loaded resultsDaya S Khudia
Summary: Instead of loading B matrix values with every vpmaddubsw instruction, load once and reuse. The downside is we need to use some register for holding these B matrix values which could have been otherwise used for C accumulations. Reviewed By: jianyuh Differential Revision: D14529495 fbshipit-source-id: 54bd4bcdcf14ac2f25a433ac60bfc08b7359453f
2019-03-21Further optimize acc16 kernel and cache blocking dimension for B matrix is ↵Daya S Khudia
now free to be autotuned (#88) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/88 acc16 version We have one more loop (over NR tiles in NCB block) in the generated assembly kernel. This change also frees NCB as an independent dimension that can be auto-tuned. Reviewed By: jianyuh Differential Revision: D14516232 fbshipit-source-id: f9bac9e7cdd3c89135d35a61c59a275c9a76562b
2019-03-21Further optimize acc32 kernel and cache blocking dimension for B matrix is ↵Daya S Khudia
now free to be autotuned (#89) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/89 We have one more loop (over NR tiles in NCB block) in the generated assembly kernel. This change also frees NCB as an independent dimension that can be auto-tuned. ~~TODO: Similar changes for acc16 kernel. ~~ D14516232 Reviewed By: jspark1105 Differential Revision: D14507536 fbshipit-source-id: 6843fffdd0bcf9bb7cd0231163fbefd6e52d5bf7
2019-03-19Dump generated kernels in filesDaya S Khudia
Summary: Dump generated kernels in files for debugging purposes. Reviewed By: jianyuh Differential Revision: D14449803 fbshipit-source-id: 58d2b5bc8402ef800a6eeaf573abd2a9ee4f95f4
2019-03-18Add the Naive bfloat16 implementation based on MKLJianyu Huang
Summary: Add the Naive bfloat16 implemenetation based on MKL. For this Naive bfloat16 implementation for C += A * B (A, B, and C are all bfloat16 type), we do the following three steps: 1. Convert bfloat16 A, B, C to fp32; 2. Call cblas_sgemm from MKL/BLAS; 3. Convert fp32 C back to bfloat16 C. Reviewed By: jspark1105 Differential Revision: D14391444 fbshipit-source-id: 1147dd2a18c4bbdec6c15f1d0f15d698d3741afe
2019-03-13optimize requantize for float out processing (#85)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/85 Optimizing performance of output processing when output is dequantized right away. Reviewed By: protonu Differential Revision: D14433141 fbshipit-source-id: f99a8d82000c43e554461acf036462a4e8f7e300
2019-03-08No need for PackA when m==1 (#83)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/83 When m = 1, PackA is actually not necessary: PackA operations for FP16 in these two libraries are both simply matrix transposition. In this case, we don’t need to do the transposition. We can just pass the pointer of the original A matrix buffer to the packed A buffer. Reviewed By: zhengwy888 Differential Revision: D14299246 fbshipit-source-id: 78a62c5ff3a396b59afb15462efe38461cb71e15
2019-03-08Fixes for FBGEMM FP16 performance (#82)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/82 This is a quick fix for matching FBGEMM FP16 performance with SKINNY GEMM FP16. Basically, this Diff switches the register layout in C accumulation buffer inside micro-kernel from MR * 1 to MR * 2. Check the reasons in T40816746. Reviewed By: zhengwy888 Differential Revision: D14278430 fbshipit-source-id: 961dd681deee69e2b7fec6bcdba7920e0b09134a
2019-03-06Add Avx512BW/VL/DQ check (#84)Jianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/84 Add AVX512BW Check: AVX-512 Byte and Word Instructions add support for for 8-bit and 16-bit integer operations such as vpmaddubsw. Similarly, add AVX512VL/DQ check. Reviewed By: jspark1105 Differential Revision: D14321050 fbshipit-source-id: bd34745fd488ce4efe3248aeb78c54e1c2d91d47
2019-02-26barebone int8-acc16 and int8-acc32 benchmarksDaya S Khudia
Summary: adding barebone gemm benchmarks for comparisons **Performance on Skylake T6 (turbo off; single thread)** M, N, K, Type, GOPS 64, 800, 320, MKL_fp32, 91.1 64, 800, 320, FBGEMM_i8_acc32, 118.7 64, 800, 320, FBGEMM_i8_acc16, 137.0 64, 768, 512, MKL_fp32, 102.0 64, 768, 512, FBGEMM_i8_acc32, 132.2 64, 768, 512, FBGEMM_i8_acc16, 160.1 16, 256, 512, MKL_fp32, 39.8 16, 256, 512, FBGEMM_i8_acc32, 55.3 16, 256, 512, FBGEMM_i8_acc16, 63.4 128, 128, 128, MKL_fp32, 49.2 128, 128, 128, FBGEMM_i8_acc32, 54.1 128, 128, 128, FBGEMM_i8_acc16, 54.4 256, 512, 256, MKL_fp32, 97.7 256, 512, 256, FBGEMM_i8_acc32, 126.2 256, 512, 256, FBGEMM_i8_acc16, 170.1 1024, 1024, 1024, MKL_fp32, 114.3 1024, 1024, 1024, FBGEMM_i8_acc32, 150.8 1024, 1024, 1024, FBGEMM_i8_acc16, 202.9 **Breakdown** M, N, K, Type, Packing (us), Kernel (us), Postproc (us), Total (us), GOPs 64, 800, 320, MKL_fp32, 0, 0, 0, 0, 95.7 64, 800, 320, FBGEMM_i8_acc32, 5.9, 261.9, 2.0, 275.9, 115.5 64, 800, 320, FBGEMM_i8_acc16, 17.4, 210.6, 3.3, 238.2, 132.1 64, 768, 512, MKL_fp32, 0, 0, 0, 0, 103.2 64, 768, 512, FBGEMM_i8_acc32, 9.0, 366.2, 1.9, 383.2, 128.0 64, 768, 512, FBGEMM_i8_acc16, 9.9, 298.3, 1.5, 314.8, 155.4 16, 256, 512, MKL_fp32, 0, 0, 0, 0, 40.8 16, 256, 512, FBGEMM_i8_acc32, 3.3, 60.5, 1.0, 68.3, 54.3 16, 256, 512, FBGEMM_i8_acc16, 3.2, 55.2, 0.5, 61.2, 60.6 128, 128, 128, MKL_fp32, 0, 0, 0, 0, 51.3 128, 128, 128, FBGEMM_i8_acc32, 8.1, 60.4, 0.6, 71.0, 52.4 128, 128, 128, FBGEMM_i8_acc16, 16.0, 44.8, 0.4, 64.6, 56.4 256, 512, 256, MKL_fp32, 0, 0, 0, 0, 95.0 256, 512, 256, FBGEMM_i8_acc32, 12.9, 512.1, 3.9, 542.1, 122.1 256, 512, 256, FBGEMM_i8_acc16, 12.1, 376.4, 2.3, 396.2, 165.8 1024, 1024, 1024, MKL_fp32, 0, 0, 0, 0, 114.9 1024, 1024, 1024, FBGEMM_i8_acc32, 116.9, 13999.2, 47.9, 14276.1, 150.3 1024, 1024, 1024, FBGEMM_i8_acc16, 125.7, 10490.3, 31.8, 10730.1, 200.0 TODO: add mkl-dnn as well. Reviewed By: jianyuh Differential Revision: D14196397 fbshipit-source-id: 4cfb22374a6553a774d2f92ef37e295b7296de8d
2019-02-23specialization for first conv (#80)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/80 Specialize PackAWithIm2Col for common shapes of strided convolution. TODO: will also add specialization for resnext101 I see PackAWithIm2Col can be also a good target for JIT Reviewed By: protonu Differential Revision: D14197118 fbshipit-source-id: 77201ce17d0e4e2e33a80b4c99b757c378a61018
2019-02-22Optimize PackB routine by removing addr functionJianyu Huang
Summary: James Reed had a use case where the B matrix must be packed online and proposed the Diff here: https://github.com/pytorch/FBGEMM/issues/79 We previously had a Hackmonth task: T35337506. The benchmark routine is here: D13828191. Before this Diff: P60866503 M, N, K, Packing A (ms), Packing B (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs 1, 4096, 1024, 0.0, 11.7, 0.5, 0.0, 0.6, FBGEMM_i8_acc32, 13.0 For this case, 11.7 ms is spent on PackBMatrix, while only 0.5 ms is spent on the kernels. After this Diff: P60975064 M, N, K, Packing A (ms), Packing B (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs 1, 4096, 1024, 0.0, 2.3, 0.6, 0.0, 0.7, FBGEMM_i8_acc32, 10.9 For this case, only 2.3 ms is spent on PackBMatrix, while only 0.6 ms is spent on the kernels. Note that you should only care about the time for "Packing B (ms)". The final reported GOPs doesn't take PackB into account because for the traditional inference PackB can always be prepacked (one time overhead). Reviewed By: jamesr66a Differential Revision: D14159749 fbshipit-source-id: e61c68380fc3db729c300bc5b65ac9cec99adc8c
2019-02-20optimize PackAWithIm2Col for symmetric b quantJongsoo Park
Summary: Add additional option b_symmetric and skip row offset computation if it's true Reviewed By: jianyuh Differential Revision: D14119128 fbshipit-source-id: fa079347562b7f75727b3a1414e9bdda3f9c65dd
2019-02-20increase test coverage (#78)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/78 Increase test coverage like transposing A Reviewed By: protonu Differential Revision: D14121297 fbshipit-source-id: a6e21442dc47e8cd725b795dbaf8614719f013fb
2019-02-19remove unused member var kBlock_ (#77)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/77 As title Reviewed By: protonu Differential Revision: D14124479 fbshipit-source-id: 3a44a1de8bf5da75e0d69d98d93f55b6b058b7ce
2019-02-15simple spmdm optimization (#76)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/76 Create a temp buffer for accumulating results instead of directly accessing C matrix with strides. This speeds up hyper-sparse case implemented w/o transpose so we adjust the threshold between the implementation w/o transpose and w/ transpose accordingly. Reviewed By: jianyuh Differential Revision: D14097154 fbshipit-source-id: 22e37d0a9f38ccb3d15813edcd96f3d341eacf1c
2019-02-14clean up depthwise conv interface (#72)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/72 depthwise conv without requantization is not really useful and was generating more template parameter options Reviewed By: jianyuh Differential Revision: D14021514 fbshipit-source-id: 61f646373fcd902fdb2854a96d003a548f29f8eb
2019-02-14fix bug in group conv + avx512 (#75)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/75 In group convolution we're always using avx2 but there was still code that assumed avx512 will be used if cpuid has avx512 Reviewed By: protonu Differential Revision: D14073329 fbshipit-source-id: cd66075f90930c09ba6eb099c1f2146c2761b2bc
2019-02-14JIT kernel should only handle a small portion of NCB for the last block: ↵Jianyu Huang
multiple of NR Summary: Before this Diff: we pass into the JIT kernel with nc = NCB ( packedB_.blockColSize() ) instead of nc = leftover size (packedB_.lastBcol() ) for the last block of B (diffusion/FBS/browse/master/fbcode/deeplearning/fbgemm/src/ExecuteKernelU8S8.cc;1adfe7977ef7ea2a1aee0ed785bd3fed5b7c4a20$102), which cause the additional computation when n is small. After this Diff: we pass into the JIT kernel with a small portion of NCB (still multiple of NR) for the last block of B. The main performance gain is for Acc16, because NCB = 4 * NR for Acc16 and NCB = NR for Acc32 in our current settings (AVX2 and AVX512). Reviewed By: jspark1105 Differential Revision: D14063628 fbshipit-source-id: 5829d06553daf617e2fefa7d26cb2d761af402c1
2019-02-14Fix PackBMatrix<T, accT>::printPackedMatrix issuesJianyu Huang
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/74 Reviewed By: jspark1105 Differential Revision: D14063593 fbshipit-source-id: 4c4a23df21e2d66eb3b6d3bee7196c6ad1935362
2019-02-13optimize gconv for b symmetric quantization (#70)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/70 Skip row offset computation if B_zero_point == 0 . Reviewed By: jianyuh Differential Revision: D14020675 fbshipit-source-id: 88a6e225671762c67afefc15538b79f879d125a6
2019-02-13no need to subtract col offset if a_zp is 0 (#69)Jongsoo Park
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/69 This diff prepares for D14013931 that folds column offsets into bias. In depthwise convolution, we allow passing column_offsets == nullptr which means column_offsets are folded into bias. We bypass adding column_offset * A_zero_point if either column_offset == nullptr or A_zero_point == 0 Reviewed By: jianyuh Differential Revision: D14017772 fbshipit-source-id: ad4a79402f43cbf78dbad68e1bff6d07c19dded0