Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/videolan/dav1d.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/src/arm
AgeCommit message (Collapse)Author
2022-09-19arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepathsMartin Storsjö
This fixes conformance with the argon test samples, in particular with these samples: profile0_core/streams/test10100_579_8614.obu profile0_core/streams/test10218_6914.obu This gives a pretty notable slowdown to these transforms - some examples: Before: Cortex A53 A72 A73 Apple M1 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4 After: inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0 Thus, for the transforms alone, it makes them around 10-13% slower (the Apple M1 measurements are too noisy to be conclusive here). Measured on actual full decoding, it makes decoding of 10 bpc Chimera around maybe 1% slower on an Apple M1 - close to measurement noise anyway.
2022-09-15Fix overflow in 8-bit NEON ADSTDavid Conrad
In 8-bit adst, it's possible that the final Round2(x[0], 12) can exceed 16-bits signed Specifically, in 7.13.2.6. Inverse ADST4 process, the precision requirement is: "It is a requirement of bitstream conformance that all values stored in the s and x arrays by this process are representable by a signed integer using r + 12 bits of precision." For 8 bits, r is 16 for both row and column, so x[] can be 28-bit signed. For values [134215680, 134217727] (within 2047 of the maximum 28-bit value), the final Round2(x[0], 12) evaluates to 32768, exceeding 16-bits signed. So switch to using sqrshrn, which saturates to 16-bits signed This is a continuation of: Commit b53ff29d80a21180e5ad9bbe39a02541151f4f53 arm: itx: Do clipping in all narrowing downshifts
2022-09-08x86: Fix rare crash in chroma film grain asmHenrik Gramner
The width parameter is used directly as a pointer offset, so ensure that it has an appropriately sized data type. This has been done previously for luma, but chroma was overlooked.
2022-07-14Enable pointer authentication in assembly when building arm64eDavid Conrad
2022-07-11Don't trash the return stack buffer in the NEON loop filterDavid Conrad
The NEON loop filter's innermost asm function can return to a different location than the address that called it. This messes up the return stack predictor, causing returns to be mispredicted Instead, rework the function to always return to the address that calls it, and instead return the information needed for the caller to short-circuit storing pixels
2022-07-06Eliminate unused C DSP functions at compile timeHenrik Gramner
When compiling with asm enabled there's no point in compiling C versions of DSP functions that have asm implementations using instruction sets that the compiler can unconditionally use. E.g. when compiling with -mssse3 we can remove the C version of all functions with SSSE3 implementations. This is accomplished using the compiler's dead code elimination functionality. Can be configured using the new 'trim_dsp' meson option, which by default is enabled when compiling in release mode.
2022-03-10arm: Only produce the PAC/BTI .note section when targeting ELFMartin Storsjö
This avoids build errors if such features are enabled while targeting another binary format. (Using such features on other platforms might require some other form of signaling/setup though, but the ELF specific .note section isn't applicable at least.)
2022-03-10arm: Add comments to #endif and #else in nonobvious casesMartin Storsjö
2022-03-02arm: itx: Do clipping in all narrowing downshiftsMartin Storsjö
This should avoid the risk of unexpected wraparound. This shouldn't technically be needed for spec compliant bitstreams. In practice, this fixes the mismatch observed in issue !388 (in checkasm generated input data).
2022-02-28build: Make "film_grain" vs "filmgrain" DSP file names consistentHenrik Gramner
2022-02-09arm64: Add Armv8.3-A PAC support to assembly filesAndré Kempe
This patch adds optional support for Arm Pointer Authentication Codes. PAC support is turned on or off at compile time using additional compiler flags. Unless any of these is enabled explicitly, no additional code will be emitted at all.
2022-01-13arm32: mc16: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical ↵Martin Storsjö
OBMC Before: Cortex A7 A8 A9 A53 A72 A73 mc_8tap_regular_w2_v_16bpc_neon: 384.4 194.0 242.9 193.2 134.1 140.0 mc_8tap_regular_w4_v_16bpc_neon: 578.2 242.2 282.7 263.1 171.2 168.9 After: mc_8tap_regular_w2_v_16bpc_neon: 397.1 207.7 250.6 212.9 136.9 140.8 mc_8tap_regular_w4_v_16bpc_neon: 575.2 240.4 277.9 263.0 171.9 167.4
2022-01-13arm32: mc: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical OBMCMartin Storsjö
For 8tap, unroll the vertical filters slightly less (by 4 instead of 8 elements) and add a special case trailer that handles only 2 elements (for 2x6 and 4x6). By unrolling less, performance on in-order cores is somewhat impacted. Before: Cortex A7 A8 A9 A53 A72 A73 mc_8tap_regular_w2_v_8bpc_neon: 340.0 305.4 336.5 196.5 160.5 167.8 mc_8tap_regular_w4_v_8bpc_neon: 400.4 319.5 391.5 210.3 189.7 188.8 After: mc_8tap_regular_w2_v_8bpc_neon: 364.6 268.5 340.1 223.7 161.7 175.2 mc_8tap_regular_w4_v_8bpc_neon: 408.7 328.4 380.4 219.8 190.7 183.8
2022-01-13arm64: mc16: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical ↵Martin Storsjö
OBMC Before: Cortex A53 A72 A73 mc_8tap_regular_w2_v_16bpc_neon: 164.0 125.3 122.6 mc_8tap_regular_w4_v_16bpc_neon: 232.5 164.0 166.6 After: mc_8tap_regular_w2_v_16bpc_neon: 192.4 131.0 121.4 mc_8tap_regular_w4_v_16bpc_neon: 235.6 162.9 163.7
2022-01-13arm64: mc: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical OBMCMartin Storsjö
For 8tap, unroll the vertical filters slightly less (by 4 instead of 8 elements) and add a special case trailer that handles only 2 elements (for 2x6 and 4x6). By unrolling less, performance on in-order cores is somewhat impacted. Before: Cortex A53 A72 A73 mc_8tap_regular_w2_v_8bpc_neon: 146.5 141.3 145.6 mc_8tap_regular_w4_v_8bpc_neon: 175.2 180.3 162.4 After: mc_8tap_regular_w2_v_8bpc_neon: 175.7 142.7 150.5 mc_8tap_regular_w4_v_8bpc_neon: 183.3 176.0 154.6
2021-12-03AArch64 Neon: Replace XTN, XTN2 pairs with single UZP1Jonathan Wright
It is often necessary to narrow the elements in a pair of Neon vectors to half the current width, before combining the results. This is usually achieved with a pair of XTN/XTN2 instructions. However, it is possible to achieve the same outcome with a single 'unzip' (UZP1) instruction. This patch changes all sequential AArch64 Neon XTN, XTN2 instruction pairs to use a single UZP1 instruction. Change-Id: I2a9fad3082d2cf363b1edce9ef0b8d547ec6c41a
2021-12-03AArch64 Neon: Use CMLT instead of SSHR to compute signJonathan Wright
The CMLT instruction has twice the throughput of SSHR on all modern out-of-order Arm cores. The Software Optimization Guides (SWOG) for the Cortex-A76, Cortex-A77 and Neoverse-N1 cores are being updated to reflect this. (The current version of the SWOG for these cores states that CMLT and SSHR both have the same execution throughput.) This patch changes all instances of sign computation to use CMLT instead of SSHR. Change-Id: Ice5747fee4e3bdd98ae8fbc036d735f55e492249
2021-10-29Remove lpf_stride parameter from LR filtersVictorien Le Couviour--Tuffet
2021-10-29Allow CDEF and LR to run sbrows in parallelVictorien Le Couviour--Tuffet
2021-10-27arm64: Add Armv8.5-A BTI support to assembly filesSalome Thirot
Add Branch Target Identifiers (BTIs) to all functions defined in AArch64 assembly files. BTI support is turned on or off at compile time based on the presence of the __ARM_FEATURE_BTI_DEFAULT feature macro. A binary compiled with BTI support can be executed on an Armv8-A processor without BTI support because the instructions are defined in NOP space. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Matthew Dalzell <matthew.dalzell@arm.com> Signed-off-by: Salome Thirot <salome.thirot@arm.com>
2021-10-27arm64: Change br instructions to ret for function returnsSalome Thirot
Using ret x<n> instead of br x<n> removes the need for a BTI landing pad at the target address in x<n>. Using 'ret' instead of 'br' does not have any performance implications. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Matthew Dalzell <matthew.dalzell@arm.com> Signed-off-by: Salome Thirot <salome.thirot@arm.com>
2021-09-03arm32: filmgrain: Add NEON implementation of gen_grain for 16 bpcMartin Storsjö
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_16bpc_420_neon: 5.05 6.71 5.42 4.95 6.45 9.59 gen_grain_uv_ar0_16bpc_422_neon: 5.54 7.18 6.29 5.45 6.55 8.80 gen_grain_uv_ar0_16bpc_444_neon: 6.64 8.07 6.70 6.89 7.16 9.98 gen_grain_uv_ar1_16bpc_420_neon: 3.22 2.16 2.58 3.51 3.16 4.68 gen_grain_uv_ar1_16bpc_422_neon: 3.24 2.26 2.73 3.83 3.36 4.65 gen_grain_uv_ar1_16bpc_444_neon: 3.48 2.41 2.85 4.32 3.69 4.90 gen_grain_uv_ar2_16bpc_420_neon: 3.29 2.90 2.92 4.14 3.48 4.59 gen_grain_uv_ar2_16bpc_422_neon: 3.35 3.01 3.13 4.50 3.61 4.50 gen_grain_uv_ar2_16bpc_444_neon: 3.66 3.55 3.32 5.15 3.87 4.93 gen_grain_uv_ar3_16bpc_420_neon: 3.39 3.79 3.60 4.67 4.04 4.70 gen_grain_uv_ar3_16bpc_422_neon: 3.39 4.04 3.96 4.93 4.16 4.65 gen_grain_uv_ar3_16bpc_444_neon: 3.79 4.47 4.36 5.54 4.59 5.07 gen_grain_y_ar0_16bpc_neon: 5.05 5.26 6.97 5.47 5.95 8.59 gen_grain_y_ar1_16bpc_neon: 2.35 1.72 2.07 3.53 3.16 3.47 gen_grain_y_ar2_16bpc_neon: 3.02 2.70 2.88 4.19 3.57 4.03 gen_grain_y_ar3_16bpc_neon: 3.49 3.18 3.69 5.01 3.99 4.50
2021-09-03arm64: filmgrain16: Remove a leftover unused macroMartin Storsjö
2021-09-03arm64: filmgrain16: Fix the default elems parameter of sum_lag2/3_funcMartin Storsjö
This makes it correctly hit some conditions that avoid duplicated code, shrinking the text section by 1524 bytes.
2021-09-01arm32: filmgrain: Add NEON implementation of gen_grain for 8 bpcMartin Storsjö
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 6.13 7.81 8.17 6.78 6.62 11.13 gen_grain_uv_ar0_8bpc_422_neon: 6.34 7.64 8.00 6.83 6.93 10.31 gen_grain_uv_ar0_8bpc_444_neon: 7.09 8.29 8.55 7.95 7.89 11.05 gen_grain_uv_ar1_8bpc_420_neon: 3.39 2.26 3.06 4.13 3.41 4.95 gen_grain_uv_ar1_8bpc_422_neon: 3.40 2.23 3.02 4.18 3.36 4.73 gen_grain_uv_ar1_8bpc_444_neon: 3.46 2.18 2.95 4.46 3.57 4.91 gen_grain_uv_ar2_8bpc_420_neon: 3.88 3.00 3.32 4.74 3.57 5.31 gen_grain_uv_ar2_8bpc_422_neon: 3.92 3.04 3.36 4.82 3.57 5.06 gen_grain_uv_ar2_8bpc_444_neon: 4.32 3.14 3.62 5.56 3.90 5.43 gen_grain_uv_ar3_8bpc_420_neon: 4.35 3.53 4.05 5.35 4.44 5.56 gen_grain_uv_ar3_8bpc_422_neon: 4.38 3.49 4.17 5.41 4.48 5.36 gen_grain_uv_ar3_8bpc_444_neon: 4.84 3.70 4.36 5.95 4.87 5.82 gen_grain_y_ar0_8bpc_neon: 5.18 5.57 7.65 5.93 7.13 9.01 gen_grain_y_ar1_8bpc_neon: 2.64 1.66 2.48 3.32 3.15 3.77 gen_grain_y_ar2_8bpc_neon: 3.57 2.64 3.21 4.59 3.68 4.64 gen_grain_y_ar3_8bpc_neon: 4.27 3.93 4.12 5.41 4.63 5.17 (A73 is benched against C code compiled with a different C compiler, which can explain the slightly differing numbers there.) Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 19614.6 13396.4 12320.4 15030.7 8288.1 8754.4 gen_grain_uv_ar0_8bpc_422_neon: 34660.9 24315.5 22225.3 26809.2 14549.8 15804.6 gen_grain_uv_ar0_8bpc_444_neon: 55625.6 39914.5 37100.2 44658.3 22917.3 27369.6 gen_grain_uv_ar1_8bpc_420_neon: 50049.5 63179.4 44793.1 36406.7 22690.3 25401.9 gen_grain_uv_ar1_8bpc_422_neon: 93289.5 117755.0 82815.4 67081.4 43133.1 46698.0 gen_grain_uv_ar1_8bpc_444_neon: 170880.0 223259.2 156241.5 122760.0 78655.6 85604.9 gen_grain_uv_ar2_8bpc_420_neon: 68185.5 78123.2 61457.3 47886.7 31526.2 36519.6 gen_grain_uv_ar2_8bpc_422_neon: 129195.2 148653.9 114133.2 89822.7 60242.6 70160.1 gen_grain_uv_ar2_8bpc_444_neon: 233133.7 272277.4 214108.7 161589.5 109069.3 127763.7 gen_grain_uv_ar3_8bpc_420_neon: 96374.4 94372.2 79663.8 70832.0 43065.3 50593.9 gen_grain_uv_ar3_8bpc_422_neon: 186324.8 184321.8 151490.1 136200.1 83758.0 98378.7 gen_grain_uv_ar3_8bpc_444_neon: 335596.6 336811.6 279755.5 247251.5 151657.2 178906.0 gen_grain_y_ar0_8bpc_neon: 46109.3 36022.2 28476.2 36478.5 18740.1 20660.4 gen_grain_y_ar1_8bpc_neon: 165054.2 217090.4 152578.9 118409.4 74357.2 83794.5 gen_grain_y_ar2_8bpc_neon: 226576.9 268320.3 210924.6 157829.4 105956.5 124293.2 gen_grain_y_ar3_8bpc_neon: 328337.2 330421.3 275110.1 242097.3 148538.7 177270.8 Corresponding numbers for the original arm64 version: Cortex A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 14874.7 7765.5 8536.0 gen_grain_uv_ar0_8bpc_422_neon: 26510.9 13685.3 15308.2 gen_grain_uv_ar0_8bpc_444_neon: 43189.6 21565.3 24312.0 gen_grain_uv_ar1_8bpc_420_neon: 33715.7 21669.8 22758.3 gen_grain_uv_ar1_8bpc_422_neon: 63955.3 41581.4 42852.5 gen_grain_uv_ar1_8bpc_444_neon: 117390.1 76503.5 78446.4 gen_grain_uv_ar2_8bpc_420_neon: 42779.0 27794.3 29677.9 gen_grain_uv_ar2_8bpc_422_neon: 82283.8 53446.7 58232.2 gen_grain_uv_ar2_8bpc_444_neon: 147773.8 98492.7 103754.1 gen_grain_uv_ar3_8bpc_420_neon: 56698.8 35697.1 40695.9 gen_grain_uv_ar3_8bpc_422_neon: 110132.4 69829.1 79196.8 gen_grain_uv_ar3_8bpc_444_neon: 196642.7 124174.9 141812.5 gen_grain_y_ar0_8bpc_neon: 36461.0 17782.0 19827.0 gen_grain_y_ar1_8bpc_neon: 113202.7 72457.7 75995.8 gen_grain_y_ar2_8bpc_neon: 142894.0 94450.9 100304.5 gen_grain_y_ar3_8bpc_neon: 191697.7 120674.9 137223.8
2021-09-01arm64: filmgrain: Remove some unnecessary backups/restores of x30Martin Storsjö
2021-09-01arm64: filmgrain: Simplify loading coefficients for the lag3 variantMartin Storsjö
2021-09-01arm64: filmgrain: Reorder two instructions in the inner loopMartin Storsjö
This should improve scheduling on in-order cores.
2021-08-24arm: Add NEON implementations of splat_mvMartin Storsjö
Relative speedup over C code, for arm64: Cortex A53 A72 A73 Apple M1 splat_mv_w1_neon: 1.09 0.95 1.22 - splat_mv_w2_neon: 1.76 1.32 1.74 - splat_mv_w4_neon: 2.78 2.19 2.19 15.00 splat_mv_w8_neon: 3.59 2.06 2.59 12.00 splat_mv_w16_neon: 4.12 1.72 2.53 3.14 splat_mv_w32_neon: 4.07 1.60 2.40 3.00 (The resolution of the timer used on Apple M1 isn't enough to measure the small versions of this function.) Relative speedup over C code, for arm32: Cortex A7 A8 A9 A53 A72 A73 splat_mv_w1_neon: 0.70 1.12 0.91 0.65 1.01 1.06 splat_mv_w2_neon: 0.94 2.16 2.01 0.99 2.52 1.63 splat_mv_w4_neon: 1.27 2.04 1.49 1.52 1.75 2.18 splat_mv_w8_neon: 1.75 2.47 1.16 2.88 1.95 2.58 splat_mv_w16_neon: 2.00 2.44 1.12 3.25 1.85 2.65 splat_mv_w32_neon: 1.43 2.28 1.19 3.55 1.77 2.65
2021-08-13arm64: filmgrain16: Add NEON implementation of gen_grain for 16 bpcMartin Storsjö
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_16bpc_420_neon: 2.90 4.13 5.43 5.80 gen_grain_uv_ar0_16bpc_422_neon: 3.23 4.51 5.52 5.83 gen_grain_uv_ar0_16bpc_444_neon: 4.01 4.97 6.08 5.87 gen_grain_uv_ar1_16bpc_420_neon: 2.94 2.80 3.56 3.48 gen_grain_uv_ar1_16bpc_422_neon: 3.14 3.07 3.68 3.47 gen_grain_uv_ar1_16bpc_444_neon: 3.54 3.51 3.93 2.61 gen_grain_uv_ar2_16bpc_420_neon: 3.92 3.69 4.40 3.98 gen_grain_uv_ar2_16bpc_422_neon: 4.13 3.96 4.42 3.92 gen_grain_uv_ar2_16bpc_444_neon: 4.69 4.33 4.84 3.25 gen_grain_uv_ar3_16bpc_420_neon: 5.05 5.39 5.42 4.74 gen_grain_uv_ar3_16bpc_422_neon: 5.25 5.68 5.57 4.67 gen_grain_uv_ar3_16bpc_444_neon: 6.02 6.33 6.35 4.38 gen_grain_y_ar0_16bpc_neon: 4.67 5.23 5.22 10.11 gen_grain_y_ar1_16bpc_neon: 3.32 3.03 3.28 2.24 gen_grain_y_ar2_16bpc_neon: 4.59 3.95 4.64 3.52 gen_grain_y_ar3_16bpc_neon: 5.89 5.93 6.36 4.79 Absolute numbers: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_16bpc_420_neon: 19797.2 9725.0 9234.0 29.7 gen_grain_uv_ar0_16bpc_422_neon: 34899.4 16875.3 17021.6 57.7 gen_grain_uv_ar0_16bpc_444_neon: 53776.6 28470.1 28773.1 107.8 gen_grain_uv_ar1_16bpc_420_neon: 37998.2 24631.2 24754.0 84.2 gen_grain_uv_ar1_16bpc_422_neon: 70817.5 44642.5 46323.1 166.3 gen_grain_uv_ar1_16bpc_444_neon: 123333.0 77316.4 83523.1 427.5 gen_grain_uv_ar2_16bpc_420_neon: 49115.8 33053.7 33249.9 93.6 gen_grain_uv_ar2_16bpc_422_neon: 92965.3 59663.8 64741.9 187.9 gen_grain_uv_ar2_16bpc_444_neon: 160899.7 108845.6 115422.4 441.8 gen_grain_uv_ar3_16bpc_420_neon: 65786.6 41924.3 45562.1 108.1 gen_grain_uv_ar3_16bpc_422_neon: 126232.3 78691.6 87351.5 217.6 gen_grain_uv_ar3_16bpc_444_neon: 218702.6 140197.8 151294.8 454.3 gen_grain_y_ar0_16bpc_neon: 35867.9 17653.6 20770.7 108.0 gen_grain_y_ar1_16bpc_neon: 118781.8 74777.1 81338.6 426.0 gen_grain_y_ar2_16bpc_neon: 155919.9 102145.8 109698.1 438.5 gen_grain_y_ar3_16bpc_neon: 213348.1 133054.8 144726.0 447.9 Corresponding numbers for 8bpc: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_8bpc_420_neon: 15086.1 8384.7 8556.6 29.4 gen_grain_uv_ar0_8bpc_422_neon: 26800.6 14354.4 15526.5 56.6 gen_grain_uv_ar0_8bpc_444_neon: 43749.6 22408.6 24627.9 108.3 gen_grain_uv_ar1_8bpc_420_neon: 33706.3 21892.6 22835.9 87.1 gen_grain_uv_ar1_8bpc_422_neon: 63897.0 41820.1 43468.9 171.8 gen_grain_uv_ar1_8bpc_444_neon: 117345.1 76372.5 79938.3 370.0 gen_grain_uv_ar2_8bpc_420_neon: 42808.8 28493.8 29932.8 92.2 gen_grain_uv_ar2_8bpc_422_neon: 82282.5 53969.4 58191.1 181.8 gen_grain_uv_ar2_8bpc_444_neon: 147641.4 98136.4 103157.6 430.2 gen_grain_uv_ar3_8bpc_420_neon: 56784.3 36342.0 40812.3 102.2 gen_grain_uv_ar3_8bpc_422_neon: 110249.7 70215.6 79716.0 200.5 gen_grain_uv_ar3_8bpc_444_neon: 196461.7 125802.8 141781.5 440.1 gen_grain_y_ar0_8bpc_neon: 36451.7 17794.4 19839.3 109.5 gen_grain_y_ar1_8bpc_neon: 113155.6 71811.9 77296.8 370.2 gen_grain_y_ar2_8bpc_neon: 142812.3 95042.4 100434.4 431.8 gen_grain_y_ar3_8bpc_neon: 191608.6 121199.5 136946.4 437.2
2021-08-13arm64: filmgrain: Deduplicate the sum_lagN functionsMartin Storsjö
No difference in genereated code, but >210 lines less of duplicated source code.
2021-08-13arm64: filmgrain: Deduplicate the output_lag functionsMartin Storsjö
No practical difference in generated code (or the size of it), but less source code to handle.
2021-08-13arm64: filmgrain: Remove two stray ret instructionsMartin Storsjö
These are never executed as they come after an unconditional branch.
2021-08-13arm64: filmgrain: Uninline the get_grain_2 macroMartin Storsjö
This shrinks the code section by 288 bytes.
2021-08-13arm64: filmgrain: Fix some cases of vertical whitespace alignmentMartin Storsjö
2021-08-13arm64: filmgrain: Fix some comments in gen_grainMartin Storsjö
2021-06-12arm32: filmgrain: Add NEON implementation of fgy and fguv for 16 bpcMartin Storsjö
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 3.47 1.72 2.99 4.18 2.68 6.19 fguv_32x32xn_16bpc_420_csfl1_neon: 3.24 1.36 2.58 3.78 2.73 5.27 fguv_32x32xn_16bpc_422_csfl0_neon: 3.57 2.07 3.05 4.32 2.74 6.20 fguv_32x32xn_16bpc_422_csfl1_neon: 3.33 1.44 2.62 3.89 2.71 5.28 fguv_32x32xn_16bpc_444_csfl0_neon: 3.48 1.69 3.06 4.48 2.97 6.69 fguv_32x32xn_16bpc_444_csfl1_neon: 3.06 1.16 2.36 3.85 2.75 5.19 fgy_32x32xn_16bpc_neon: 2.89 1.05 2.29 3.49 2.49 3.15 Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 6237.3 12701.0 6687.1 4525.8 3220.8 3195.4 fguv_32x32xn_16bpc_420_csfl1_neon: 5143.2 11684.8 5926.4 3857.2 2604.7 2556.5 fguv_32x32xn_16bpc_422_csfl0_neon: 6347.3 11005.2 6797.5 4582.4 3300.4 3250.5 fguv_32x32xn_16bpc_422_csfl1_neon: 5275.2 11594.8 5992.6 3931.1 2668.7 2607.3 fguv_32x32xn_16bpc_444_csfl0_neon: 5181.6 11310.0 5575.4 3629.7 2383.8 2530.0 fguv_32x32xn_16bpc_444_csfl1_neon: 4081.9 10958.8 4868.5 2962.9 1870.3 2034.2 fgy_32x32xn_16bpc_neon: 15439.1 43129.0 19406.6 11542.3 7463.9 7827.8 Corresponding numbers for arm64: Cortex A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 4019.2 3247.4 3259.6 fguv_32x32xn_16bpc_420_csfl1_neon: 3460.1 2628.7 2640.8 fguv_32x32xn_16bpc_422_csfl0_neon: 4034.4 3329.9 3287.5 fguv_32x32xn_16bpc_422_csfl1_neon: 3468.3 2749.3 2686.6 fguv_32x32xn_16bpc_444_csfl0_neon: 3117.7 2447.4 2539.8 fguv_32x32xn_16bpc_444_csfl1_neon: 2641.2 1977.2 2132.8 fgy_32x32xn_16bpc_neon: 9873.5 7605.7 7656.2
2021-06-11arm32: filmgrain: Add NEON implementations of fgy and fguv for 8 bpcMartin Storsjö
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 4.20 2.19 3.48 4.93 3.60 5.93 fguv_32x32xn_8bpc_420_csfl1_neon: 3.92 1.52 2.84 4.34 3.82 5.93 fguv_32x32xn_8bpc_422_csfl0_neon: 4.27 2.13 3.58 5.02 4.04 5.95 fguv_32x32xn_8bpc_422_csfl1_neon: 3.99 1.56 2.91 4.43 3.89 6.00 fguv_32x32xn_8bpc_444_csfl0_neon: 4.48 2.08 3.89 5.66 4.07 6.51 fguv_32x32xn_8bpc_444_csfl1_neon: 4.45 1.41 2.99 5.28 3.63 6.09 fgy_32x32xn_8bpc_neon: 3.61 1.10 2.62 4.35 3.06 3.74 Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 5318.8 11167.7 6024.6 3909.9 2945.2 2993.5 fguv_32x32xn_8bpc_420_csfl1_neon: 4351.0 10929.7 5269.5 3316.8 2166.5 2256.9 fguv_32x32xn_8bpc_422_csfl0_neon: 5387.9 11746.7 6080.0 3945.8 2988.1 3046.3 fguv_32x32xn_8bpc_422_csfl1_neon: 4396.0 11083.2 5300.8 3354.9 2216.4 2291.4 fguv_32x32xn_8bpc_444_csfl0_neon: 4347.9 10595.0 5134.4 3079.1 2277.7 2392.9 fguv_32x32xn_8bpc_444_csfl1_neon: 3295.0 10518.2 4442.6 2476.3 1716.3 1829.2 fgy_32x32xn_8bpc_neon: 12376.2 41046.9 17259.7 9153.1 6610.4 7005.3 Corresponding numbers for arm64: Cortex A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 3822.9 2920.0 2935.7 fguv_32x32xn_8bpc_420_csfl1_neon: 3209.7 2231.7 2335.4 fguv_32x32xn_8bpc_422_csfl0_neon: 3807.9 2886.5 2966.7 fguv_32x32xn_8bpc_422_csfl1_neon: 3197.1 2187.9 2355.9 fguv_32x32xn_8bpc_444_csfl0_neon: 2757.8 2227.4 2334.4 fguv_32x32xn_8bpc_444_csfl1_neon: 2244.6 1719.1 1786.7 fgy_32x32xn_8bpc_neon: 8192.2 6563.3 6969.1
2021-06-10arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpcMartin Storsjö
Relative speedup over C code: Cortex A53 A82 A83 Apple M1 fguv_32x32xn_16bpc_420_csfl0_neon: 4.57 2.08 3.57 7.61 fguv_32x32xn_16bpc_420_csfl1_neon: 4.92 2.89 3.96 4.26 fguv_32x32xn_16bpc_422_csfl0_neon: 4.59 2.14 3.61 5.88 fguv_32x32xn_16bpc_422_csfl1_neon: 4.92 2.90 3.90 5.00 fguv_32x32xn_16bpc_444_csfl0_neon: 3.64 1.89 2.86 4.72 fguv_32x32xn_16bpc_444_csfl1_neon: 3.59 2.26 2.76 3.22
2021-06-10arm64: filmgrain: Back up and restore one register fewer in fguv 8bpcMartin Storsjö
2021-06-10arm64: filmgrain: Stray cosmetic fixesMartin Storsjö
2021-06-10arm64: filmgrain: Do the right amount of gathers for subsampled fguvMartin Storsjö
Previously we did 32 gathers even though only 16 are needed. Before: Cortex A53 A72 A73 Apple M1 fguv_32x32xn_8bpc_420_csfl0_neon: 5352.1 3985.0 4068.9 8.3 fguv_32x32xn_8bpc_420_csfl1_neon: 4738.2 3297.8 3633.0 8.2 fguv_32x32xn_8bpc_422_csfl0_neon: 5386.0 4036.8 4093.5 8.3 fguv_32x32xn_8bpc_422_csfl1_neon: 4779.9 3392.6 3641.6 8.2 fguv_32x32xn_8bpc_444_csfl0_neon: 3068.4 2422.0 2436.5 4.9 fguv_32x32xn_8bpc_444_csfl1_neon: 2558.3 1908.4 1926.6 4.4 After: fguv_32x32xn_8bpc_420_csfl0_neon: 4330.4 3118.5 3224.6 5.3 fguv_32x32xn_8bpc_420_csfl1_neon: 3731.8 2416.9 2619.6 4.7 fguv_32x32xn_8bpc_422_csfl0_neon: 4364.7 3129.3 3247.6 5.4 fguv_32x32xn_8bpc_422_csfl1_neon: 3762.5 2450.2 2661.8 4.7 fguv_32x32xn_8bpc_444_csfl0_neon: 3075.1 2376.4 2429.4 4.9 fguv_32x32xn_8bpc_444_csfl1_neon: 2564.5 1865.9 1952.8 4.4
2021-06-05arm64: filmgrain16: Use sqrdmulh for the scaling*grain multiplicationMartin Storsjö
Before: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 10396.8 8150.8 8718.3 19.5 After: fgy_32x32xn_16bpc_neon: 9665.1 7558.8 7652.8 19.5
2021-05-25arm64: filmgrain: Fix overflows in gen_grainMartin Storsjö
After multiplying two int8_t, the maximum possible output is -128*-128 = 16384. One can't add two such values in an int16_t (even if all the products of all other int8_t combinations can be). Previously the summing used 16 bit intermediates for the sum of two products and only lengtheted the result to 32 bit when accumulating three or more products. Before: Cortex A53 A72 A73 Apple M1 gen_grain_y_ar1_8bpc_neon: 112598.5 71309.2 74889.8 372.2 gen_grain_y_ar2_8bpc_neon: 139932.4 91442.3 95788.4 387.3 gen_grain_y_ar3_8bpc_neon: 185607.6 115691.6 131655.8 403.0 After: gen_grain_y_ar1_8bpc_neon: 112968.8 71897.9 76171.2 371.2 gen_grain_y_ar2_8bpc_neon: 142768.8 94517.9 97934.4 387.5 gen_grain_y_ar3_8bpc_neon: 191625.2 121083.0 135975.3 405.6
2021-05-14arm64: filmgrain16: Simplify constructing the constant 0x0fffMartin Storsjö
Use the mvni instruction instead of setting the constant in a GPR first.
2021-05-13arm64: filmgrain16: Guard against out of range pixels in the gather functionMartin Storsjö
In 16 bpc, the pixels are 16 bit integers, but valid pixels only are up to 12 bits, and the scaling buffer only contains 4096 elements. The src pixels are, normally, supposed to be valid pixels, but when processing blocks of 32 pixels at a time, it can operate on uninitialized pixels past the right edge. Before: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 10372.5 8194.4 8612.1 24.2 After: fgy_32x32xn_16bpc_neon: 10837.9 8469.5 8885.1 24.6
2021-05-12arm64: filmgrain: Add a NEON implementation of fgy_32x32xn for 16 bpcMartin Storsjö
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 3.87 2.28 2.78 3.45
2021-05-04arm64: filmgrain: Add NEON implementation of the generate_grain_uv functionsMartin Storsjö
The existing functions/macros for generate_grain_y are templated for adding in the a final coefficient from the y buffer, while trying to keep the binary size down. Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_8bpc_420_neon: 4.62 4.55 5.27 9.08 gen_grain_uv_ar0_8bpc_422_neon: 4.81 4.90 5.33 7.25 gen_grain_uv_ar0_8bpc_444_neon: 5.05 5.17 5.69 7.04 gen_grain_uv_ar1_8bpc_420_neon: 3.61 3.09 3.68 3.92 gen_grain_uv_ar1_8bpc_422_neon: 3.71 3.22 3.64 3.46 gen_grain_uv_ar1_8bpc_444_neon: 3.59 3.40 3.67 3.11 gen_grain_uv_ar2_8bpc_420_neon: 4.77 3.85 4.81 4.55 gen_grain_uv_ar2_8bpc_422_neon: 4.88 3.96 4.85 4.15 gen_grain_uv_ar2_8bpc_444_neon: 5.18 4.65 5.18 3.83 gen_grain_uv_ar3_8bpc_420_neon: 6.14 5.25 6.14 5.64 gen_grain_uv_ar3_8bpc_422_neon: 6.27 5.27 6.28 5.42 gen_grain_uv_ar3_8bpc_444_neon: 6.84 6.40 6.79 5.18
2021-04-27arm64: filmgrain: Add NEON implementation of the generate_grain_y functionMartin Storsjö
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_y_ar0_8bpc_neon: 5.03 5.17 5.59 5.55 gen_grain_y_ar1_8bpc_neon: 3.38 3.38 3.56 2.77 gen_grain_y_ar2_8bpc_neon: 5.00 4.64 5.06 3.38 gen_grain_y_ar3_8bpc_neon: 6.74 6.53 6.67 4.66
2021-04-27arm64: filmgrain: Add the missing HIGHBD_DECL_SUFFIX for the fguv functionsMartin Storsjö