Age | Commit message (Collapse) | Author |
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/113
Adding the VNNI support in FBGEMM.
Reviewed By: dskhudia
Differential Revision: D16276574
fbshipit-source-id: 832ccdb27339489ebc138f3b2678e53d107c1b79
|
|
Summary:
Pass blocking params in to compute correct buffer size for each group.
Fix the bug for this CONV shape:
`conv_param_t<2>(1, 32, 16, {12, 14}, 4, {3, 3}, {1, 1}, {0, 0, 0, 0})`
Corresponding M, N, K = 120, 4, 288
with these params:
BlockingFactors params;
params.MCB = 48;
params.NCB = 16;
params.KCB = 256;
params.MR = 1;
params.NR = 16;
params.ROW_INTERLEAVE = 4;
params.NR_MIN = 16;
Reviewed By: jianyuh
Differential Revision: D16571367
fbshipit-source-id: 27c9b003d37c4d3d13767227e8343d44668823d6
|
|
Summary: std::multiplier is not found.
Reviewed By: jspark1105
Differential Revision: D16373256
fbshipit-source-id: ae273a3f447f95e4b26d3f1a43e7ddad288b78ab
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/108
Pointwise gets converted to direct GEMM
Reviewed By: jianyuh
Differential Revision: D16296356
fbshipit-source-id: 68c88df90e5de669bfcddf426c6488e2a04d55d6
|
|
Summary: Add blocking params as argument of rowOffsetBufferSize() so the allocated vector will be sized correctlly.
Reviewed By: dskhudia, jianyuh
Differential Revision: D16348913
fbshipit-source-id: c70a05f2f69db3ce71ec2c27a8db4d143649ddd6
|
|
compliant with convolution parameters
Summary: This is to detect inadvertent calling for fbgemmConv with one set of conv parameters while packing was done with another set of parameters.
Reviewed By: jspark1105
Differential Revision: D16269293
fbshipit-source-id: 9a166f5298d8246047e40fc880dd87e1037e0456
|
|
Summary:
Changes to remove warnings when building FBGEMM in opt mode.
Cleanup to address initialization of MCB, KCB, NCBX
Reviewed By: jianyuh
Differential Revision: D16283443
fbshipit-source-id: 0829aee45ed1d262a18bcf4dd294393ef018a688
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/106
The values returned by these functions is needed while unpacking weights.
Reviewed By: jianyuh
Differential Revision: D16193425
fbshipit-source-id: 8ee3a0dc46768d7cb572bf383be1ce2b450c44c9
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/105
Support for calling unpack using unified interface for packing convolution weights
Reviewed By: jianyuh
Differential Revision: D16190534
fbshipit-source-id: daebd7b6d1846921232f8391c816e2f0678d813f
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/104
For consistency, we always assume that weights to PackWeightsForConv are in format K R S C/G, which is same as G K/G R S C/G
cc: Huihan Liu: Please note this change.
Reviewed By: jianyuh
Differential Revision: D16186932
fbshipit-source-id: 9ca2562f213d6b296ef8bd2eca1e5b6e98c436ec
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/103
In the same spirit of D16085552, we do the following in this Diff:
- Refactor the pack/unpack code for PackB: use the same ```pack_unpack_``` function for both ```pack``` and ```unpack``` function.
- Add a unit test.
Reviewed By: dskhudia
Differential Revision: D16160767
fbshipit-source-id: 7fb7006750537b0705a180f2014c786298a1c615
|
|
Summary: unpack weight for 3x3 depthwise and 3x3x3 depthwise convolutions.
Reviewed By: jspark1105
Differential Revision: D16076463
fbshipit-source-id: 767749c1a10caefef4c76c2c51323d1a3041621a
|
|
Summary: Implement ::unpack() for PackWeightMatrixForGConv. Unpack index calculation is the inverse of ::pack().
Reviewed By: dskhudia
Differential Revision: D16085552
fbshipit-source-id: b8866365dc425fee2cb985b3e48c627198ebc29a
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/102
The Avx512 and Avx2 branches can be merged.
Reviewed By: dskhudia
Differential Revision: D16068952
fbshipit-source-id: b39beb32e80dc168d0c17db9dff8a67bb0fe976f
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/101
Some code cleanup:
- Both ```leadingDimCReg``` and ```leadingDimCRegAssign``` are used in ```GenerateKernelU8S8S32ACC32.c```. We should unify them to only use one variable name.
- Remove some redundant register variable ```asmjit::X86Ymm tmpReg = x86::ymm14;```.
Reviewed By: dskhudia
Differential Revision: D15673269
fbshipit-source-id: 81eb3673d0ff97391557413a13f1972561a1f2db
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/99
A function to do per channel and groupwise quantization
Reviewed By: jspark1105
Differential Revision: D15567272
fbshipit-source-id: e2f326ea7c7463b5c47b3f590e003344a9e41960
|
|
Summary: Add the check on NR_MIN and fix ymm/zmm register checks.
Reviewed By: dskhudia
Differential Revision: D15772144
fbshipit-source-id: 11e2c67fb3d47c5570b38ceaf9828ced0e60e65b
|
|
Summary: same as title. We were only printing packed matrix for group 0
Reviewed By: jianyuh
Differential Revision: D15775235
fbshipit-source-id: 747550c9ae229a2eeb912409897c1331ada81e2b
|
|
Summary:
Delete duplicated header
Remove #ifndef and replace with pragma once.
Reviewed By: jianyuh
Differential Revision: D15669744
fbshipit-source-id: 8895f6c867e626ac5813a8952837435e76b09370
|
|
Summary: We want to combine three different convolution interfaces under one top level function.
Reviewed By: protonu
Differential Revision: D15399811
fbshipit-source-id: 7390616d92783506fc156f0f6017f10b5f7f8e30
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/97
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20721
- FBGEMM: Add unpack function for PackBMatrix class: Unpack pmat buffer to the origin_buf (Used for the serialization to recover weight matrix).
- PyTorch Quantizer: Add quantized::fbgemm_linear_unpack operator for serialization.
Reviewed By: zafartahirov
Differential Revision: D15314568
fbshipit-source-id: 12080c8887ce31dc849d23e132ae1766ac319407
|
|
Summary: Remove the extra line in ifdef block for kernel logging.
Reviewed By: jianyuh
Differential Revision: D15483193
fbshipit-source-id: 8ee25b07ab0a45e6f3d366876241599c87ab0c2d
|
|
Summary: fixing compiler warnings for uninitialized MR, NCB, KCB
Reviewed By: dskhudia
Differential Revision: D15362047
fbshipit-source-id: 57428f0610c8c12f9ff1f07fe8e472e5ff56bc82
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/94
If we don't call cpuinfo_initialize before hand, fbgemmHasAvx2/512Support will always return false. We should really careful about this.
Reviewed By: jianyuh
Differential Revision: D14994129
fbshipit-source-id: b78028f0543d05595caaa627be2feb743d0694b1
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/73
Skip computing row_offset if B uses symmetric quantization. Skip adding col_offset if A uses symmetric quantization.
Reviewed By: jianyuh
Differential Revision: D14055973
fbshipit-source-id: 91da8f0755b2f90175e94a893b5a3ad6342c506d
|
|
(#90)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/90
Exposing tuning parameters in FBGEMM (MCB, NCB, KCB, MR, NR, Row Interleave)
Reviewed By: dskhudia
Differential Revision: D14358148
fbshipit-source-id: 783fb4653fd696dbbd4075ad56cb8682db3011a5
|
|
Summary: Packing B documentation
Reviewed By: jianyuh
Differential Revision: D14579163
fbshipit-source-id: e18cb1eea56024fbe54f654b15ca79d10c42e17c
|
|
Summary: In D14507536 and D14516232 small N cases suffered if we increased the NR. This fixes those cases.
Reviewed By: jianyuh
Differential Revision: D14529494
fbshipit-source-id: 6f53797948de760d6ed24b767cbbe8d27768660f
|
|
Summary: Instead of loading B matrix values with every vpmaddubsw instruction, load once and reuse. The downside is we need to use some register for holding these B matrix values which could have been otherwise used for C accumulations.
Reviewed By: jianyuh
Differential Revision: D14529495
fbshipit-source-id: 54bd4bcdcf14ac2f25a433ac60bfc08b7359453f
|
|
now free to be autotuned (#88)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/88
acc16 version
We have one more loop (over NR tiles in NCB block) in the generated assembly kernel. This change also frees NCB as an independent dimension that can be auto-tuned.
Reviewed By: jianyuh
Differential Revision: D14516232
fbshipit-source-id: f9bac9e7cdd3c89135d35a61c59a275c9a76562b
|
|
now free to be autotuned (#89)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/89
We have one more loop (over NR tiles in NCB block) in the generated assembly kernel. This change also frees NCB as an independent dimension that can be auto-tuned.
~~TODO: Similar changes for acc16 kernel. ~~ D14516232
Reviewed By: jspark1105
Differential Revision: D14507536
fbshipit-source-id: 6843fffdd0bcf9bb7cd0231163fbefd6e52d5bf7
|
|
Summary: Dump generated kernels in files for debugging purposes.
Reviewed By: jianyuh
Differential Revision: D14449803
fbshipit-source-id: 58d2b5bc8402ef800a6eeaf573abd2a9ee4f95f4
|
|
Summary:
Add the Naive bfloat16 implemenetation based on MKL.
For this Naive bfloat16 implementation for C += A * B (A, B, and C are all bfloat16 type), we do the following three steps:
1. Convert bfloat16 A, B, C to fp32;
2. Call cblas_sgemm from MKL/BLAS;
3. Convert fp32 C back to bfloat16 C.
Reviewed By: jspark1105
Differential Revision: D14391444
fbshipit-source-id: 1147dd2a18c4bbdec6c15f1d0f15d698d3741afe
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/85
Optimizing performance of output processing when output is dequantized right away.
Reviewed By: protonu
Differential Revision: D14433141
fbshipit-source-id: f99a8d82000c43e554461acf036462a4e8f7e300
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/83
When m = 1, PackA is actually not necessary: PackA operations for FP16 in these two libraries are both simply matrix transposition. In this case, we don’t need to do the transposition. We can just pass the pointer of the original A matrix buffer to the packed A buffer.
Reviewed By: zhengwy888
Differential Revision: D14299246
fbshipit-source-id: 78a62c5ff3a396b59afb15462efe38461cb71e15
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/82
This is a quick fix for matching FBGEMM FP16 performance with SKINNY GEMM FP16.
Basically, this Diff switches the register layout in C accumulation buffer inside micro-kernel from MR * 1 to MR * 2. Check the reasons in T40816746.
Reviewed By: zhengwy888
Differential Revision: D14278430
fbshipit-source-id: 961dd681deee69e2b7fec6bcdba7920e0b09134a
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/84
Add AVX512BW Check:
AVX-512 Byte and Word Instructions add support for for 8-bit and 16-bit integer operations such as vpmaddubsw.
Similarly, add AVX512VL/DQ check.
Reviewed By: jspark1105
Differential Revision: D14321050
fbshipit-source-id: bd34745fd488ce4efe3248aeb78c54e1c2d91d47
|
|
Summary:
adding barebone gemm benchmarks for comparisons
**Performance on Skylake T6 (turbo off; single thread)**
M, N, K, Type, GOPS
64, 800, 320, MKL_fp32, 91.1
64, 800, 320, FBGEMM_i8_acc32, 118.7
64, 800, 320, FBGEMM_i8_acc16, 137.0
64, 768, 512, MKL_fp32, 102.0
64, 768, 512, FBGEMM_i8_acc32, 132.2
64, 768, 512, FBGEMM_i8_acc16, 160.1
16, 256, 512, MKL_fp32, 39.8
16, 256, 512, FBGEMM_i8_acc32, 55.3
16, 256, 512, FBGEMM_i8_acc16, 63.4
128, 128, 128, MKL_fp32, 49.2
128, 128, 128, FBGEMM_i8_acc32, 54.1
128, 128, 128, FBGEMM_i8_acc16, 54.4
256, 512, 256, MKL_fp32, 97.7
256, 512, 256, FBGEMM_i8_acc32, 126.2
256, 512, 256, FBGEMM_i8_acc16, 170.1
1024, 1024, 1024, MKL_fp32, 114.3
1024, 1024, 1024, FBGEMM_i8_acc32, 150.8
1024, 1024, 1024, FBGEMM_i8_acc16, 202.9
**Breakdown**
M, N, K, Type, Packing (us), Kernel (us), Postproc (us), Total (us), GOPs
64, 800, 320, MKL_fp32, 0, 0, 0, 0, 95.7
64, 800, 320, FBGEMM_i8_acc32, 5.9, 261.9, 2.0, 275.9, 115.5
64, 800, 320, FBGEMM_i8_acc16, 17.4, 210.6, 3.3, 238.2, 132.1
64, 768, 512, MKL_fp32, 0, 0, 0, 0, 103.2
64, 768, 512, FBGEMM_i8_acc32, 9.0, 366.2, 1.9, 383.2, 128.0
64, 768, 512, FBGEMM_i8_acc16, 9.9, 298.3, 1.5, 314.8, 155.4
16, 256, 512, MKL_fp32, 0, 0, 0, 0, 40.8
16, 256, 512, FBGEMM_i8_acc32, 3.3, 60.5, 1.0, 68.3, 54.3
16, 256, 512, FBGEMM_i8_acc16, 3.2, 55.2, 0.5, 61.2, 60.6
128, 128, 128, MKL_fp32, 0, 0, 0, 0, 51.3
128, 128, 128, FBGEMM_i8_acc32, 8.1, 60.4, 0.6, 71.0, 52.4
128, 128, 128, FBGEMM_i8_acc16, 16.0, 44.8, 0.4, 64.6, 56.4
256, 512, 256, MKL_fp32, 0, 0, 0, 0, 95.0
256, 512, 256, FBGEMM_i8_acc32, 12.9, 512.1, 3.9, 542.1, 122.1
256, 512, 256, FBGEMM_i8_acc16, 12.1, 376.4, 2.3, 396.2, 165.8
1024, 1024, 1024, MKL_fp32, 0, 0, 0, 0, 114.9
1024, 1024, 1024, FBGEMM_i8_acc32, 116.9, 13999.2, 47.9, 14276.1, 150.3
1024, 1024, 1024, FBGEMM_i8_acc16, 125.7, 10490.3, 31.8, 10730.1, 200.0
TODO: add mkl-dnn as well.
Reviewed By: jianyuh
Differential Revision: D14196397
fbshipit-source-id: 4cfb22374a6553a774d2f92ef37e295b7296de8d
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/80
Specialize PackAWithIm2Col for common shapes of strided convolution.
TODO: will also add specialization for resnext101
I see PackAWithIm2Col can be also a good target for JIT
Reviewed By: protonu
Differential Revision: D14197118
fbshipit-source-id: 77201ce17d0e4e2e33a80b4c99b757c378a61018
|
|
Summary:
James Reed had a use case where the B matrix must be packed online and proposed the Diff here:
https://github.com/pytorch/FBGEMM/issues/79
We previously had a Hackmonth task: T35337506.
The benchmark routine is here: D13828191.
Before this Diff:
P60866503
M, N, K, Packing A (ms), Packing B (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs
1, 4096, 1024, 0.0, 11.7, 0.5, 0.0, 0.6, FBGEMM_i8_acc32, 13.0
For this case, 11.7 ms is spent on PackBMatrix, while only 0.5 ms is spent on the kernels.
After this Diff:
P60975064
M, N, K, Packing A (ms), Packing B (ms), Kernel (ms), Postprocessing (ms), Total (ms), GOPs
1, 4096, 1024, 0.0, 2.3, 0.6, 0.0, 0.7, FBGEMM_i8_acc32, 10.9
For this case, only 2.3 ms is spent on PackBMatrix, while only 0.6 ms is spent on the kernels.
Note that you should only care about the time for "Packing B (ms)". The final reported GOPs doesn't take PackB into account because for the traditional inference PackB can always be prepacked (one time overhead).
Reviewed By: jamesr66a
Differential Revision: D14159749
fbshipit-source-id: e61c68380fc3db729c300bc5b65ac9cec99adc8c
|
|
Summary: Add additional option b_symmetric and skip row offset computation if it's true
Reviewed By: jianyuh
Differential Revision: D14119128
fbshipit-source-id: fa079347562b7f75727b3a1414e9bdda3f9c65dd
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/78
Increase test coverage like transposing A
Reviewed By: protonu
Differential Revision: D14121297
fbshipit-source-id: a6e21442dc47e8cd725b795dbaf8614719f013fb
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/77
As title
Reviewed By: protonu
Differential Revision: D14124479
fbshipit-source-id: 3a44a1de8bf5da75e0d69d98d93f55b6b058b7ce
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/76
Create a temp buffer for accumulating results instead of directly accessing C matrix with strides.
This speeds up hyper-sparse case implemented w/o transpose so we adjust the threshold between the implementation w/o transpose and w/ transpose accordingly.
Reviewed By: jianyuh
Differential Revision: D14097154
fbshipit-source-id: 22e37d0a9f38ccb3d15813edcd96f3d341eacf1c
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/72
depthwise conv without requantization is not really useful and was generating more template parameter options
Reviewed By: jianyuh
Differential Revision: D14021514
fbshipit-source-id: 61f646373fcd902fdb2854a96d003a548f29f8eb
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/75
In group convolution we're always using avx2 but there was still code that assumed avx512 will be used if cpuid has avx512
Reviewed By: protonu
Differential Revision: D14073329
fbshipit-source-id: cd66075f90930c09ba6eb099c1f2146c2761b2bc
|
|
multiple of NR
Summary:
Before this Diff:
we pass into the JIT kernel with nc = NCB ( packedB_.blockColSize() ) instead of nc = leftover size (packedB_.lastBcol() ) for the last block of B (diffusion/FBS/browse/master/fbcode/deeplearning/fbgemm/src/ExecuteKernelU8S8.cc;1adfe7977ef7ea2a1aee0ed785bd3fed5b7c4a20$102), which cause the additional computation when n is small.
After this Diff:
we pass into the JIT kernel with a small portion of NCB (still multiple of NR) for the last block of B.
The main performance gain is for Acc16, because NCB = 4 * NR for Acc16 and NCB = NR for Acc32 in our current settings (AVX2 and AVX512).
Reviewed By: jspark1105
Differential Revision: D14063628
fbshipit-source-id: 5829d06553daf617e2fefa7d26cb2d761af402c1
|
|
Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/74
Reviewed By: jspark1105
Differential Revision: D14063593
fbshipit-source-id: 4c4a23df21e2d66eb3b6d3bee7196c6ad1935362
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/70
Skip row offset computation if B_zero_point == 0 .
Reviewed By: jianyuh
Differential Revision: D14020675
fbshipit-source-id: 88a6e225671762c67afefc15538b79f879d125a6
|
|
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/69
This diff prepares for D14013931 that folds column offsets into bias.
In depthwise convolution, we allow passing column_offsets == nullptr which means column_offsets are folded into bias. We bypass adding column_offset * A_zero_point if either column_offset == nullptr or A_zero_point == 0
Reviewed By: jianyuh
Differential Revision: D14017772
fbshipit-source-id: ad4a79402f43cbf78dbad68e1bff6d07c19dded0
|