Age | Commit message (Collapse) | Author |
|
This fixes conformance with the argon test samples, in particular
with these samples:
profile0_core/streams/test10100_579_8614.obu
profile0_core/streams/test10218_6914.obu
This gives a pretty notable slowdown to these transforms - some
examples:
Before: Cortex A53 A72 A73 Apple M1
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4
After:
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0
Thus, for the transforms alone, it makes them around 10-13% slower
(the Apple M1 measurements are too noisy to be conclusive here).
Measured on actual full decoding, it makes decoding of 10 bpc
Chimera around maybe 1% slower on an Apple M1 - close to measurement
noise anyway.
|
|
In 8-bit adst, it's possible that the final Round2(x[0], 12) can exceed
16-bits signed
Specifically, in 7.13.2.6. Inverse ADST4 process, the precision requirement is:
"It is a requirement of bitstream conformance that all values stored in the
s and x arrays by this process are representable by a signed integer using
r + 12 bits of precision."
For 8 bits, r is 16 for both row and column, so x[] can be 28-bit signed.
For values [134215680, 134217727] (within 2047 of the maximum 28-bit value),
the final Round2(x[0], 12) evaluates to 32768, exceeding 16-bits signed.
So switch to using sqrshrn, which saturates to 16-bits signed
This is a continuation of: Commit b53ff29d80a21180e5ad9bbe39a02541151f4f53
arm: itx: Do clipping in all narrowing downshifts
|
|
The width parameter is used directly as a pointer offset, so ensure
that it has an appropriately sized data type.
This has been done previously for luma, but chroma was overlooked.
|
|
|
|
The NEON loop filter's innermost asm function can return to a different
location than the address that called it. This messes up the return stack
predictor, causing returns to be mispredicted
Instead, rework the function to always return to the address that calls it,
and instead return the information needed for the caller to short-circuit
storing pixels
|
|
When compiling with asm enabled there's no point in compiling
C versions of DSP functions that have asm implementations using
instruction sets that the compiler can unconditionally use.
E.g. when compiling with -mssse3 we can remove the C version
of all functions with SSSE3 implementations.
This is accomplished using the compiler's dead code elimination
functionality.
Can be configured using the new 'trim_dsp' meson option, which
by default is enabled when compiling in release mode.
|
|
This avoids build errors if such features are enabled while targeting
another binary format. (Using such features on other platforms
might require some other form of signaling/setup though, but
the ELF specific .note section isn't applicable at least.)
|
|
|
|
This should avoid the risk of unexpected wraparound. This shouldn't
technically be needed for spec compliant bitstreams. In practice,
this fixes the mismatch observed in issue !388 (in checkasm generated
input data).
|
|
|
|
This patch adds optional support for Arm Pointer Authentication Codes.
PAC support is turned on or off at compile time using additional
compiler flags. Unless any of these is enabled explicitly, no additional
code will be emitted at all.
|
|
OBMC
Before: Cortex A7 A8 A9 A53 A72 A73
mc_8tap_regular_w2_v_16bpc_neon: 384.4 194.0 242.9 193.2 134.1 140.0
mc_8tap_regular_w4_v_16bpc_neon: 578.2 242.2 282.7 263.1 171.2 168.9
After:
mc_8tap_regular_w2_v_16bpc_neon: 397.1 207.7 250.6 212.9 136.9 140.8
mc_8tap_regular_w4_v_16bpc_neon: 575.2 240.4 277.9 263.0 171.9 167.4
|
|
For 8tap, unroll the vertical filters slightly less (by 4 instead of
8 elements) and add a special case trailer that handles only 2 elements
(for 2x6 and 4x6). By unrolling less, performance on in-order cores is
somewhat impacted.
Before: Cortex A7 A8 A9 A53 A72 A73
mc_8tap_regular_w2_v_8bpc_neon: 340.0 305.4 336.5 196.5 160.5 167.8
mc_8tap_regular_w4_v_8bpc_neon: 400.4 319.5 391.5 210.3 189.7 188.8
After:
mc_8tap_regular_w2_v_8bpc_neon: 364.6 268.5 340.1 223.7 161.7 175.2
mc_8tap_regular_w4_v_8bpc_neon: 408.7 328.4 380.4 219.8 190.7 183.8
|
|
OBMC
Before: Cortex A53 A72 A73
mc_8tap_regular_w2_v_16bpc_neon: 164.0 125.3 122.6
mc_8tap_regular_w4_v_16bpc_neon: 232.5 164.0 166.6
After:
mc_8tap_regular_w2_v_16bpc_neon: 192.4 131.0 121.4
mc_8tap_regular_w4_v_16bpc_neon: 235.6 162.9 163.7
|
|
For 8tap, unroll the vertical filters slightly less (by 4 instead of
8 elements) and add a special case trailer that handles only 2 elements
(for 2x6 and 4x6). By unrolling less, performance on in-order cores is
somewhat impacted.
Before: Cortex A53 A72 A73
mc_8tap_regular_w2_v_8bpc_neon: 146.5 141.3 145.6
mc_8tap_regular_w4_v_8bpc_neon: 175.2 180.3 162.4
After:
mc_8tap_regular_w2_v_8bpc_neon: 175.7 142.7 150.5
mc_8tap_regular_w4_v_8bpc_neon: 183.3 176.0 154.6
|
|
It is often necessary to narrow the elements in a pair of Neon
vectors to half the current width, before combining the results. This
is usually achieved with a pair of XTN/XTN2 instructions. However, it
is possible to achieve the same outcome with a single 'unzip' (UZP1)
instruction.
This patch changes all sequential AArch64 Neon XTN, XTN2 instruction
pairs to use a single UZP1 instruction.
Change-Id: I2a9fad3082d2cf363b1edce9ef0b8d547ec6c41a
|
|
The CMLT instruction has twice the throughput of SSHR on all modern
out-of-order Arm cores. The Software Optimization Guides (SWOG) for
the Cortex-A76, Cortex-A77 and Neoverse-N1 cores are being updated to
reflect this. (The current version of the SWOG for these cores states
that CMLT and SSHR both have the same execution throughput.)
This patch changes all instances of sign computation to use CMLT
instead of SSHR.
Change-Id: Ice5747fee4e3bdd98ae8fbc036d735f55e492249
|
|
|
|
|
|
Add Branch Target Identifiers (BTIs) to all functions defined in
AArch64 assembly files.
BTI support is turned on or off at compile time based on the presence
of the __ARM_FEATURE_BTI_DEFAULT feature macro.
A binary compiled with BTI support can be executed on an Armv8-A
processor without BTI support because the instructions are defined in
NOP space.
Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Matthew Dalzell <matthew.dalzell@arm.com>
Signed-off-by: Salome Thirot <salome.thirot@arm.com>
|
|
Using ret x<n> instead of br x<n> removes the need for a BTI landing pad
at the target address in x<n>.
Using 'ret' instead of 'br' does not have any performance implications.
Signed-off-by: Jonathan Wright <jonathan.wright@arm.com>
Signed-off-by: Matthew Dalzell <matthew.dalzell@arm.com>
Signed-off-by: Salome Thirot <salome.thirot@arm.com>
|
|
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
gen_grain_uv_ar0_16bpc_420_neon: 5.05 6.71 5.42 4.95 6.45 9.59
gen_grain_uv_ar0_16bpc_422_neon: 5.54 7.18 6.29 5.45 6.55 8.80
gen_grain_uv_ar0_16bpc_444_neon: 6.64 8.07 6.70 6.89 7.16 9.98
gen_grain_uv_ar1_16bpc_420_neon: 3.22 2.16 2.58 3.51 3.16 4.68
gen_grain_uv_ar1_16bpc_422_neon: 3.24 2.26 2.73 3.83 3.36 4.65
gen_grain_uv_ar1_16bpc_444_neon: 3.48 2.41 2.85 4.32 3.69 4.90
gen_grain_uv_ar2_16bpc_420_neon: 3.29 2.90 2.92 4.14 3.48 4.59
gen_grain_uv_ar2_16bpc_422_neon: 3.35 3.01 3.13 4.50 3.61 4.50
gen_grain_uv_ar2_16bpc_444_neon: 3.66 3.55 3.32 5.15 3.87 4.93
gen_grain_uv_ar3_16bpc_420_neon: 3.39 3.79 3.60 4.67 4.04 4.70
gen_grain_uv_ar3_16bpc_422_neon: 3.39 4.04 3.96 4.93 4.16 4.65
gen_grain_uv_ar3_16bpc_444_neon: 3.79 4.47 4.36 5.54 4.59 5.07
gen_grain_y_ar0_16bpc_neon: 5.05 5.26 6.97 5.47 5.95 8.59
gen_grain_y_ar1_16bpc_neon: 2.35 1.72 2.07 3.53 3.16 3.47
gen_grain_y_ar2_16bpc_neon: 3.02 2.70 2.88 4.19 3.57 4.03
gen_grain_y_ar3_16bpc_neon: 3.49 3.18 3.69 5.01 3.99 4.50
|
|
|
|
This makes it correctly hit some conditions that avoid duplicated code,
shrinking the text section by 1524 bytes.
|
|
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
gen_grain_uv_ar0_8bpc_420_neon: 6.13 7.81 8.17 6.78 6.62 11.13
gen_grain_uv_ar0_8bpc_422_neon: 6.34 7.64 8.00 6.83 6.93 10.31
gen_grain_uv_ar0_8bpc_444_neon: 7.09 8.29 8.55 7.95 7.89 11.05
gen_grain_uv_ar1_8bpc_420_neon: 3.39 2.26 3.06 4.13 3.41 4.95
gen_grain_uv_ar1_8bpc_422_neon: 3.40 2.23 3.02 4.18 3.36 4.73
gen_grain_uv_ar1_8bpc_444_neon: 3.46 2.18 2.95 4.46 3.57 4.91
gen_grain_uv_ar2_8bpc_420_neon: 3.88 3.00 3.32 4.74 3.57 5.31
gen_grain_uv_ar2_8bpc_422_neon: 3.92 3.04 3.36 4.82 3.57 5.06
gen_grain_uv_ar2_8bpc_444_neon: 4.32 3.14 3.62 5.56 3.90 5.43
gen_grain_uv_ar3_8bpc_420_neon: 4.35 3.53 4.05 5.35 4.44 5.56
gen_grain_uv_ar3_8bpc_422_neon: 4.38 3.49 4.17 5.41 4.48 5.36
gen_grain_uv_ar3_8bpc_444_neon: 4.84 3.70 4.36 5.95 4.87 5.82
gen_grain_y_ar0_8bpc_neon: 5.18 5.57 7.65 5.93 7.13 9.01
gen_grain_y_ar1_8bpc_neon: 2.64 1.66 2.48 3.32 3.15 3.77
gen_grain_y_ar2_8bpc_neon: 3.57 2.64 3.21 4.59 3.68 4.64
gen_grain_y_ar3_8bpc_neon: 4.27 3.93 4.12 5.41 4.63 5.17
(A73 is benched against C code compiled with a different C compiler,
which can explain the slightly differing numbers there.)
Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
gen_grain_uv_ar0_8bpc_420_neon: 19614.6 13396.4 12320.4 15030.7 8288.1 8754.4
gen_grain_uv_ar0_8bpc_422_neon: 34660.9 24315.5 22225.3 26809.2 14549.8 15804.6
gen_grain_uv_ar0_8bpc_444_neon: 55625.6 39914.5 37100.2 44658.3 22917.3 27369.6
gen_grain_uv_ar1_8bpc_420_neon: 50049.5 63179.4 44793.1 36406.7 22690.3 25401.9
gen_grain_uv_ar1_8bpc_422_neon: 93289.5 117755.0 82815.4 67081.4 43133.1 46698.0
gen_grain_uv_ar1_8bpc_444_neon: 170880.0 223259.2 156241.5 122760.0 78655.6 85604.9
gen_grain_uv_ar2_8bpc_420_neon: 68185.5 78123.2 61457.3 47886.7 31526.2 36519.6
gen_grain_uv_ar2_8bpc_422_neon: 129195.2 148653.9 114133.2 89822.7 60242.6 70160.1
gen_grain_uv_ar2_8bpc_444_neon: 233133.7 272277.4 214108.7 161589.5 109069.3 127763.7
gen_grain_uv_ar3_8bpc_420_neon: 96374.4 94372.2 79663.8 70832.0 43065.3 50593.9
gen_grain_uv_ar3_8bpc_422_neon: 186324.8 184321.8 151490.1 136200.1 83758.0 98378.7
gen_grain_uv_ar3_8bpc_444_neon: 335596.6 336811.6 279755.5 247251.5 151657.2 178906.0
gen_grain_y_ar0_8bpc_neon: 46109.3 36022.2 28476.2 36478.5 18740.1 20660.4
gen_grain_y_ar1_8bpc_neon: 165054.2 217090.4 152578.9 118409.4 74357.2 83794.5
gen_grain_y_ar2_8bpc_neon: 226576.9 268320.3 210924.6 157829.4 105956.5 124293.2
gen_grain_y_ar3_8bpc_neon: 328337.2 330421.3 275110.1 242097.3 148538.7 177270.8
Corresponding numbers for the original arm64 version:
Cortex A53 A72 A73
gen_grain_uv_ar0_8bpc_420_neon: 14874.7 7765.5 8536.0
gen_grain_uv_ar0_8bpc_422_neon: 26510.9 13685.3 15308.2
gen_grain_uv_ar0_8bpc_444_neon: 43189.6 21565.3 24312.0
gen_grain_uv_ar1_8bpc_420_neon: 33715.7 21669.8 22758.3
gen_grain_uv_ar1_8bpc_422_neon: 63955.3 41581.4 42852.5
gen_grain_uv_ar1_8bpc_444_neon: 117390.1 76503.5 78446.4
gen_grain_uv_ar2_8bpc_420_neon: 42779.0 27794.3 29677.9
gen_grain_uv_ar2_8bpc_422_neon: 82283.8 53446.7 58232.2
gen_grain_uv_ar2_8bpc_444_neon: 147773.8 98492.7 103754.1
gen_grain_uv_ar3_8bpc_420_neon: 56698.8 35697.1 40695.9
gen_grain_uv_ar3_8bpc_422_neon: 110132.4 69829.1 79196.8
gen_grain_uv_ar3_8bpc_444_neon: 196642.7 124174.9 141812.5
gen_grain_y_ar0_8bpc_neon: 36461.0 17782.0 19827.0
gen_grain_y_ar1_8bpc_neon: 113202.7 72457.7 75995.8
gen_grain_y_ar2_8bpc_neon: 142894.0 94450.9 100304.5
gen_grain_y_ar3_8bpc_neon: 191697.7 120674.9 137223.8
|
|
|
|
|
|
This should improve scheduling on in-order cores.
|
|
Relative speedup over C code, for arm64:
Cortex A53 A72 A73 Apple M1
splat_mv_w1_neon: 1.09 0.95 1.22 -
splat_mv_w2_neon: 1.76 1.32 1.74 -
splat_mv_w4_neon: 2.78 2.19 2.19 15.00
splat_mv_w8_neon: 3.59 2.06 2.59 12.00
splat_mv_w16_neon: 4.12 1.72 2.53 3.14
splat_mv_w32_neon: 4.07 1.60 2.40 3.00
(The resolution of the timer used on Apple M1 isn't enough to
measure the small versions of this function.)
Relative speedup over C code, for arm32:
Cortex A7 A8 A9 A53 A72 A73
splat_mv_w1_neon: 0.70 1.12 0.91 0.65 1.01 1.06
splat_mv_w2_neon: 0.94 2.16 2.01 0.99 2.52 1.63
splat_mv_w4_neon: 1.27 2.04 1.49 1.52 1.75 2.18
splat_mv_w8_neon: 1.75 2.47 1.16 2.88 1.95 2.58
splat_mv_w16_neon: 2.00 2.44 1.12 3.25 1.85 2.65
splat_mv_w32_neon: 1.43 2.28 1.19 3.55 1.77 2.65
|
|
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_16bpc_420_neon: 2.90 4.13 5.43 5.80
gen_grain_uv_ar0_16bpc_422_neon: 3.23 4.51 5.52 5.83
gen_grain_uv_ar0_16bpc_444_neon: 4.01 4.97 6.08 5.87
gen_grain_uv_ar1_16bpc_420_neon: 2.94 2.80 3.56 3.48
gen_grain_uv_ar1_16bpc_422_neon: 3.14 3.07 3.68 3.47
gen_grain_uv_ar1_16bpc_444_neon: 3.54 3.51 3.93 2.61
gen_grain_uv_ar2_16bpc_420_neon: 3.92 3.69 4.40 3.98
gen_grain_uv_ar2_16bpc_422_neon: 4.13 3.96 4.42 3.92
gen_grain_uv_ar2_16bpc_444_neon: 4.69 4.33 4.84 3.25
gen_grain_uv_ar3_16bpc_420_neon: 5.05 5.39 5.42 4.74
gen_grain_uv_ar3_16bpc_422_neon: 5.25 5.68 5.57 4.67
gen_grain_uv_ar3_16bpc_444_neon: 6.02 6.33 6.35 4.38
gen_grain_y_ar0_16bpc_neon: 4.67 5.23 5.22 10.11
gen_grain_y_ar1_16bpc_neon: 3.32 3.03 3.28 2.24
gen_grain_y_ar2_16bpc_neon: 4.59 3.95 4.64 3.52
gen_grain_y_ar3_16bpc_neon: 5.89 5.93 6.36 4.79
Absolute numbers:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_16bpc_420_neon: 19797.2 9725.0 9234.0 29.7
gen_grain_uv_ar0_16bpc_422_neon: 34899.4 16875.3 17021.6 57.7
gen_grain_uv_ar0_16bpc_444_neon: 53776.6 28470.1 28773.1 107.8
gen_grain_uv_ar1_16bpc_420_neon: 37998.2 24631.2 24754.0 84.2
gen_grain_uv_ar1_16bpc_422_neon: 70817.5 44642.5 46323.1 166.3
gen_grain_uv_ar1_16bpc_444_neon: 123333.0 77316.4 83523.1 427.5
gen_grain_uv_ar2_16bpc_420_neon: 49115.8 33053.7 33249.9 93.6
gen_grain_uv_ar2_16bpc_422_neon: 92965.3 59663.8 64741.9 187.9
gen_grain_uv_ar2_16bpc_444_neon: 160899.7 108845.6 115422.4 441.8
gen_grain_uv_ar3_16bpc_420_neon: 65786.6 41924.3 45562.1 108.1
gen_grain_uv_ar3_16bpc_422_neon: 126232.3 78691.6 87351.5 217.6
gen_grain_uv_ar3_16bpc_444_neon: 218702.6 140197.8 151294.8 454.3
gen_grain_y_ar0_16bpc_neon: 35867.9 17653.6 20770.7 108.0
gen_grain_y_ar1_16bpc_neon: 118781.8 74777.1 81338.6 426.0
gen_grain_y_ar2_16bpc_neon: 155919.9 102145.8 109698.1 438.5
gen_grain_y_ar3_16bpc_neon: 213348.1 133054.8 144726.0 447.9
Corresponding numbers for 8bpc:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_8bpc_420_neon: 15086.1 8384.7 8556.6 29.4
gen_grain_uv_ar0_8bpc_422_neon: 26800.6 14354.4 15526.5 56.6
gen_grain_uv_ar0_8bpc_444_neon: 43749.6 22408.6 24627.9 108.3
gen_grain_uv_ar1_8bpc_420_neon: 33706.3 21892.6 22835.9 87.1
gen_grain_uv_ar1_8bpc_422_neon: 63897.0 41820.1 43468.9 171.8
gen_grain_uv_ar1_8bpc_444_neon: 117345.1 76372.5 79938.3 370.0
gen_grain_uv_ar2_8bpc_420_neon: 42808.8 28493.8 29932.8 92.2
gen_grain_uv_ar2_8bpc_422_neon: 82282.5 53969.4 58191.1 181.8
gen_grain_uv_ar2_8bpc_444_neon: 147641.4 98136.4 103157.6 430.2
gen_grain_uv_ar3_8bpc_420_neon: 56784.3 36342.0 40812.3 102.2
gen_grain_uv_ar3_8bpc_422_neon: 110249.7 70215.6 79716.0 200.5
gen_grain_uv_ar3_8bpc_444_neon: 196461.7 125802.8 141781.5 440.1
gen_grain_y_ar0_8bpc_neon: 36451.7 17794.4 19839.3 109.5
gen_grain_y_ar1_8bpc_neon: 113155.6 71811.9 77296.8 370.2
gen_grain_y_ar2_8bpc_neon: 142812.3 95042.4 100434.4 431.8
gen_grain_y_ar3_8bpc_neon: 191608.6 121199.5 136946.4 437.2
|
|
No difference in genereated code, but >210 lines less of duplicated
source code.
|
|
No practical difference in generated code (or the size of it), but
less source code to handle.
|
|
These are never executed as they come after an unconditional branch.
|
|
This shrinks the code section by 288 bytes.
|
|
|
|
|
|
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 3.47 1.72 2.99 4.18 2.68 6.19
fguv_32x32xn_16bpc_420_csfl1_neon: 3.24 1.36 2.58 3.78 2.73 5.27
fguv_32x32xn_16bpc_422_csfl0_neon: 3.57 2.07 3.05 4.32 2.74 6.20
fguv_32x32xn_16bpc_422_csfl1_neon: 3.33 1.44 2.62 3.89 2.71 5.28
fguv_32x32xn_16bpc_444_csfl0_neon: 3.48 1.69 3.06 4.48 2.97 6.69
fguv_32x32xn_16bpc_444_csfl1_neon: 3.06 1.16 2.36 3.85 2.75 5.19
fgy_32x32xn_16bpc_neon: 2.89 1.05 2.29 3.49 2.49 3.15
Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 6237.3 12701.0 6687.1 4525.8 3220.8 3195.4
fguv_32x32xn_16bpc_420_csfl1_neon: 5143.2 11684.8 5926.4 3857.2 2604.7 2556.5
fguv_32x32xn_16bpc_422_csfl0_neon: 6347.3 11005.2 6797.5 4582.4 3300.4 3250.5
fguv_32x32xn_16bpc_422_csfl1_neon: 5275.2 11594.8 5992.6 3931.1 2668.7 2607.3
fguv_32x32xn_16bpc_444_csfl0_neon: 5181.6 11310.0 5575.4 3629.7 2383.8 2530.0
fguv_32x32xn_16bpc_444_csfl1_neon: 4081.9 10958.8 4868.5 2962.9 1870.3 2034.2
fgy_32x32xn_16bpc_neon: 15439.1 43129.0 19406.6 11542.3 7463.9 7827.8
Corresponding numbers for arm64:
Cortex A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 4019.2 3247.4 3259.6
fguv_32x32xn_16bpc_420_csfl1_neon: 3460.1 2628.7 2640.8
fguv_32x32xn_16bpc_422_csfl0_neon: 4034.4 3329.9 3287.5
fguv_32x32xn_16bpc_422_csfl1_neon: 3468.3 2749.3 2686.6
fguv_32x32xn_16bpc_444_csfl0_neon: 3117.7 2447.4 2539.8
fguv_32x32xn_16bpc_444_csfl1_neon: 2641.2 1977.2 2132.8
fgy_32x32xn_16bpc_neon: 9873.5 7605.7 7656.2
|
|
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 4.20 2.19 3.48 4.93 3.60 5.93
fguv_32x32xn_8bpc_420_csfl1_neon: 3.92 1.52 2.84 4.34 3.82 5.93
fguv_32x32xn_8bpc_422_csfl0_neon: 4.27 2.13 3.58 5.02 4.04 5.95
fguv_32x32xn_8bpc_422_csfl1_neon: 3.99 1.56 2.91 4.43 3.89 6.00
fguv_32x32xn_8bpc_444_csfl0_neon: 4.48 2.08 3.89 5.66 4.07 6.51
fguv_32x32xn_8bpc_444_csfl1_neon: 4.45 1.41 2.99 5.28 3.63 6.09
fgy_32x32xn_8bpc_neon: 3.61 1.10 2.62 4.35 3.06 3.74
Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 5318.8 11167.7 6024.6 3909.9 2945.2 2993.5
fguv_32x32xn_8bpc_420_csfl1_neon: 4351.0 10929.7 5269.5 3316.8 2166.5 2256.9
fguv_32x32xn_8bpc_422_csfl0_neon: 5387.9 11746.7 6080.0 3945.8 2988.1 3046.3
fguv_32x32xn_8bpc_422_csfl1_neon: 4396.0 11083.2 5300.8 3354.9 2216.4 2291.4
fguv_32x32xn_8bpc_444_csfl0_neon: 4347.9 10595.0 5134.4 3079.1 2277.7 2392.9
fguv_32x32xn_8bpc_444_csfl1_neon: 3295.0 10518.2 4442.6 2476.3 1716.3 1829.2
fgy_32x32xn_8bpc_neon: 12376.2 41046.9 17259.7 9153.1 6610.4 7005.3
Corresponding numbers for arm64: Cortex A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 3822.9 2920.0 2935.7
fguv_32x32xn_8bpc_420_csfl1_neon: 3209.7 2231.7 2335.4
fguv_32x32xn_8bpc_422_csfl0_neon: 3807.9 2886.5 2966.7
fguv_32x32xn_8bpc_422_csfl1_neon: 3197.1 2187.9 2355.9
fguv_32x32xn_8bpc_444_csfl0_neon: 2757.8 2227.4 2334.4
fguv_32x32xn_8bpc_444_csfl1_neon: 2244.6 1719.1 1786.7
fgy_32x32xn_8bpc_neon: 8192.2 6563.3 6969.1
|
|
Relative speedup over C code:
Cortex A53 A82 A83 Apple M1
fguv_32x32xn_16bpc_420_csfl0_neon: 4.57 2.08 3.57 7.61
fguv_32x32xn_16bpc_420_csfl1_neon: 4.92 2.89 3.96 4.26
fguv_32x32xn_16bpc_422_csfl0_neon: 4.59 2.14 3.61 5.88
fguv_32x32xn_16bpc_422_csfl1_neon: 4.92 2.90 3.90 5.00
fguv_32x32xn_16bpc_444_csfl0_neon: 3.64 1.89 2.86 4.72
fguv_32x32xn_16bpc_444_csfl1_neon: 3.59 2.26 2.76 3.22
|
|
|
|
|
|
Previously we did 32 gathers even though only 16 are
needed.
Before: Cortex A53 A72 A73 Apple M1
fguv_32x32xn_8bpc_420_csfl0_neon: 5352.1 3985.0 4068.9 8.3
fguv_32x32xn_8bpc_420_csfl1_neon: 4738.2 3297.8 3633.0 8.2
fguv_32x32xn_8bpc_422_csfl0_neon: 5386.0 4036.8 4093.5 8.3
fguv_32x32xn_8bpc_422_csfl1_neon: 4779.9 3392.6 3641.6 8.2
fguv_32x32xn_8bpc_444_csfl0_neon: 3068.4 2422.0 2436.5 4.9
fguv_32x32xn_8bpc_444_csfl1_neon: 2558.3 1908.4 1926.6 4.4
After:
fguv_32x32xn_8bpc_420_csfl0_neon: 4330.4 3118.5 3224.6 5.3
fguv_32x32xn_8bpc_420_csfl1_neon: 3731.8 2416.9 2619.6 4.7
fguv_32x32xn_8bpc_422_csfl0_neon: 4364.7 3129.3 3247.6 5.4
fguv_32x32xn_8bpc_422_csfl1_neon: 3762.5 2450.2 2661.8 4.7
fguv_32x32xn_8bpc_444_csfl0_neon: 3075.1 2376.4 2429.4 4.9
fguv_32x32xn_8bpc_444_csfl1_neon: 2564.5 1865.9 1952.8 4.4
|
|
Before: Cortex A53 A72 A73 Apple M1
fgy_32x32xn_16bpc_neon: 10396.8 8150.8 8718.3 19.5
After:
fgy_32x32xn_16bpc_neon: 9665.1 7558.8 7652.8 19.5
|
|
After multiplying two int8_t, the maximum possible output is
-128*-128 = 16384. One can't add two such values in an int16_t (even if
all the products of all other int8_t combinations can be).
Previously the summing used 16 bit intermediates for the sum of two
products and only lengtheted the result to 32 bit when accumulating
three or more products.
Before: Cortex A53 A72 A73 Apple M1
gen_grain_y_ar1_8bpc_neon: 112598.5 71309.2 74889.8 372.2
gen_grain_y_ar2_8bpc_neon: 139932.4 91442.3 95788.4 387.3
gen_grain_y_ar3_8bpc_neon: 185607.6 115691.6 131655.8 403.0
After:
gen_grain_y_ar1_8bpc_neon: 112968.8 71897.9 76171.2 371.2
gen_grain_y_ar2_8bpc_neon: 142768.8 94517.9 97934.4 387.5
gen_grain_y_ar3_8bpc_neon: 191625.2 121083.0 135975.3 405.6
|
|
Use the mvni instruction instead of setting the constant in a GPR
first.
|
|
In 16 bpc, the pixels are 16 bit integers, but valid pixels only
are up to 12 bits, and the scaling buffer only contains 4096
elements.
The src pixels are, normally, supposed to be valid pixels, but when
processing blocks of 32 pixels at a time, it can operate on
uninitialized pixels past the right edge.
Before: Cortex A53 A72 A73 Apple M1
fgy_32x32xn_16bpc_neon: 10372.5 8194.4 8612.1 24.2
After:
fgy_32x32xn_16bpc_neon: 10837.9 8469.5 8885.1 24.6
|
|
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
fgy_32x32xn_16bpc_neon: 3.87 2.28 2.78 3.45
|
|
The existing functions/macros for generate_grain_y are templated
for adding in the a final coefficient from the y buffer, while
trying to keep the binary size down.
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_8bpc_420_neon: 4.62 4.55 5.27 9.08
gen_grain_uv_ar0_8bpc_422_neon: 4.81 4.90 5.33 7.25
gen_grain_uv_ar0_8bpc_444_neon: 5.05 5.17 5.69 7.04
gen_grain_uv_ar1_8bpc_420_neon: 3.61 3.09 3.68 3.92
gen_grain_uv_ar1_8bpc_422_neon: 3.71 3.22 3.64 3.46
gen_grain_uv_ar1_8bpc_444_neon: 3.59 3.40 3.67 3.11
gen_grain_uv_ar2_8bpc_420_neon: 4.77 3.85 4.81 4.55
gen_grain_uv_ar2_8bpc_422_neon: 4.88 3.96 4.85 4.15
gen_grain_uv_ar2_8bpc_444_neon: 5.18 4.65 5.18 3.83
gen_grain_uv_ar3_8bpc_420_neon: 6.14 5.25 6.14 5.64
gen_grain_uv_ar3_8bpc_422_neon: 6.27 5.27 6.28 5.42
gen_grain_uv_ar3_8bpc_444_neon: 6.84 6.40 6.79 5.18
|
|
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
gen_grain_y_ar0_8bpc_neon: 5.03 5.17 5.59 5.55
gen_grain_y_ar1_8bpc_neon: 3.38 3.38 3.56 2.77
gen_grain_y_ar2_8bpc_neon: 5.00 4.64 5.06 3.38
gen_grain_y_ar3_8bpc_neon: 6.74 6.53 6.67 4.66
|
|
|