github.com/videolan/dav1d.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2022-09-19	arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths	Martin Storsjö
	This fixes conformance with the argon test samples, in particular with these samples: profile0_core/streams/test10100_579_8614.obu profile0_core/streams/test10218_6914.obu This gives a pretty notable slowdown to these transforms - some examples: Before: Cortex A53 A72 A73 Apple M1 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4 After: inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0 Thus, for the transforms alone, it makes them around 10-13% slower (the Apple M1 measurements are too noisy to be conclusive here). Measured on actual full decoding, it makes decoding of 10 bpc Chimera around maybe 1% slower on an Apple M1 - close to measurement noise anyway.
2022-09-15	Fix overflow in 8-bit NEON ADST	David Conrad
	In 8-bit adst, it's possible that the final Round2(x[0], 12) can exceed 16-bits signed Specifically, in 7.13.2.6. Inverse ADST4 process, the precision requirement is: "It is a requirement of bitstream conformance that all values stored in the s and x arrays by this process are representable by a signed integer using r + 12 bits of precision." For 8 bits, r is 16 for both row and column, so x[] can be 28-bit signed. For values [134215680, 134217727] (within 2047 of the maximum 28-bit value), the final Round2(x[0], 12) evaluates to 32768, exceeding 16-bits signed. So switch to using sqrshrn, which saturates to 16-bits signed This is a continuation of: Commit b53ff29d80a21180e5ad9bbe39a02541151f4f53 arm: itx: Do clipping in all narrowing downshifts
2022-09-08	x86: Fix rare crash in chroma film grain asm	Henrik Gramner
	The width parameter is used directly as a pointer offset, so ensure that it has an appropriately sized data type. This has been done previously for luma, but chroma was overlooked.
2022-07-14	Enable pointer authentication in assembly when building arm64e	David Conrad

2022-07-11	Don't trash the return stack buffer in the NEON loop filter	David Conrad
	The NEON loop filter's innermost asm function can return to a different location than the address that called it. This messes up the return stack predictor, causing returns to be mispredicted Instead, rework the function to always return to the address that calls it, and instead return the information needed for the caller to short-circuit storing pixels
2022-07-06	Eliminate unused C DSP functions at compile time	Henrik Gramner
	When compiling with asm enabled there's no point in compiling C versions of DSP functions that have asm implementations using instruction sets that the compiler can unconditionally use. E.g. when compiling with -mssse3 we can remove the C version of all functions with SSSE3 implementations. This is accomplished using the compiler's dead code elimination functionality. Can be configured using the new 'trim_dsp' meson option, which by default is enabled when compiling in release mode.
2022-03-10	arm: Only produce the PAC/BTI .note section when targeting ELF	Martin Storsjö
	This avoids build errors if such features are enabled while targeting another binary format. (Using such features on other platforms might require some other form of signaling/setup though, but the ELF specific .note section isn't applicable at least.)
2022-03-10	arm: Add comments to #endif and #else in nonobvious cases	Martin Storsjö

2022-03-02	arm: itx: Do clipping in all narrowing downshifts	Martin Storsjö
	This should avoid the risk of unexpected wraparound. This shouldn't technically be needed for spec compliant bitstreams. In practice, this fixes the mismatch observed in issue !388 (in checkasm generated input data).
2022-02-28	build: Make "film_grain" vs "filmgrain" DSP file names consistent	Henrik Gramner

2022-02-09	arm64: Add Armv8.3-A PAC support to assembly files	André Kempe
	This patch adds optional support for Arm Pointer Authentication Codes. PAC support is turned on or off at compile time using additional compiler flags. Unless any of these is enabled explicitly, no additional code will be emitted at all.
2022-01-13	arm32: mc16: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical ↵	Martin Storsjö
	OBMC Before: Cortex A7 A8 A9 A53 A72 A73 mc_8tap_regular_w2_v_16bpc_neon: 384.4 194.0 242.9 193.2 134.1 140.0 mc_8tap_regular_w4_v_16bpc_neon: 578.2 242.2 282.7 263.1 171.2 168.9 After: mc_8tap_regular_w2_v_16bpc_neon: 397.1 207.7 250.6 212.9 136.9 140.8 mc_8tap_regular_w4_v_16bpc_neon: 575.2 240.4 277.9 263.0 171.9 167.4
2022-01-13	arm32: mc: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical OBMC	Martin Storsjö
	For 8tap, unroll the vertical filters slightly less (by 4 instead of 8 elements) and add a special case trailer that handles only 2 elements (for 2x6 and 4x6). By unrolling less, performance on in-order cores is somewhat impacted. Before: Cortex A7 A8 A9 A53 A72 A73 mc_8tap_regular_w2_v_8bpc_neon: 340.0 305.4 336.5 196.5 160.5 167.8 mc_8tap_regular_w4_v_8bpc_neon: 400.4 319.5 391.5 210.3 189.7 188.8 After: mc_8tap_regular_w2_v_8bpc_neon: 364.6 268.5 340.1 223.7 161.7 175.2 mc_8tap_regular_w4_v_8bpc_neon: 408.7 328.4 380.4 219.8 190.7 183.8
2022-01-13	arm64: mc16: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical ↵	Martin Storsjö
	OBMC Before: Cortex A53 A72 A73 mc_8tap_regular_w2_v_16bpc_neon: 164.0 125.3 122.6 mc_8tap_regular_w4_v_16bpc_neon: 232.5 164.0 166.6 After: mc_8tap_regular_w2_v_16bpc_neon: 192.4 131.0 121.4 mc_8tap_regular_w4_v_16bpc_neon: 235.6 162.9 163.7
2022-01-13	arm64: mc: Fix out of bounds reads/writes in 8tap/bilin w2/w4 for vertical OBMC	Martin Storsjö
	For 8tap, unroll the vertical filters slightly less (by 4 instead of 8 elements) and add a special case trailer that handles only 2 elements (for 2x6 and 4x6). By unrolling less, performance on in-order cores is somewhat impacted. Before: Cortex A53 A72 A73 mc_8tap_regular_w2_v_8bpc_neon: 146.5 141.3 145.6 mc_8tap_regular_w4_v_8bpc_neon: 175.2 180.3 162.4 After: mc_8tap_regular_w2_v_8bpc_neon: 175.7 142.7 150.5 mc_8tap_regular_w4_v_8bpc_neon: 183.3 176.0 154.6
2021-12-03	AArch64 Neon: Replace XTN, XTN2 pairs with single UZP1	Jonathan Wright
	It is often necessary to narrow the elements in a pair of Neon vectors to half the current width, before combining the results. This is usually achieved with a pair of XTN/XTN2 instructions. However, it is possible to achieve the same outcome with a single 'unzip' (UZP1) instruction. This patch changes all sequential AArch64 Neon XTN, XTN2 instruction pairs to use a single UZP1 instruction. Change-Id: I2a9fad3082d2cf363b1edce9ef0b8d547ec6c41a
2021-12-03	AArch64 Neon: Use CMLT instead of SSHR to compute sign	Jonathan Wright
	The CMLT instruction has twice the throughput of SSHR on all modern out-of-order Arm cores. The Software Optimization Guides (SWOG) for the Cortex-A76, Cortex-A77 and Neoverse-N1 cores are being updated to reflect this. (The current version of the SWOG for these cores states that CMLT and SSHR both have the same execution throughput.) This patch changes all instances of sign computation to use CMLT instead of SSHR. Change-Id: Ice5747fee4e3bdd98ae8fbc036d735f55e492249
2021-10-29	Remove lpf_stride parameter from LR filters	Victorien Le Couviour--Tuffet

2021-10-29	Allow CDEF and LR to run sbrows in parallel	Victorien Le Couviour--Tuffet

2021-10-27	arm64: Add Armv8.5-A BTI support to assembly files	Salome Thirot
	Add Branch Target Identifiers (BTIs) to all functions defined in AArch64 assembly files. BTI support is turned on or off at compile time based on the presence of the __ARM_FEATURE_BTI_DEFAULT feature macro. A binary compiled with BTI support can be executed on an Armv8-A processor without BTI support because the instructions are defined in NOP space. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Matthew Dalzell <matthew.dalzell@arm.com> Signed-off-by: Salome Thirot <salome.thirot@arm.com>
2021-10-27	arm64: Change br instructions to ret for function returns	Salome Thirot
	Using ret x<n> instead of br x<n> removes the need for a BTI landing pad at the target address in x<n>. Using 'ret' instead of 'br' does not have any performance implications. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Matthew Dalzell <matthew.dalzell@arm.com> Signed-off-by: Salome Thirot <salome.thirot@arm.com>
2021-09-03	arm32: filmgrain: Add NEON implementation of gen_grain for 16 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_16bpc_420_neon: 5.05 6.71 5.42 4.95 6.45 9.59 gen_grain_uv_ar0_16bpc_422_neon: 5.54 7.18 6.29 5.45 6.55 8.80 gen_grain_uv_ar0_16bpc_444_neon: 6.64 8.07 6.70 6.89 7.16 9.98 gen_grain_uv_ar1_16bpc_420_neon: 3.22 2.16 2.58 3.51 3.16 4.68 gen_grain_uv_ar1_16bpc_422_neon: 3.24 2.26 2.73 3.83 3.36 4.65 gen_grain_uv_ar1_16bpc_444_neon: 3.48 2.41 2.85 4.32 3.69 4.90 gen_grain_uv_ar2_16bpc_420_neon: 3.29 2.90 2.92 4.14 3.48 4.59 gen_grain_uv_ar2_16bpc_422_neon: 3.35 3.01 3.13 4.50 3.61 4.50 gen_grain_uv_ar2_16bpc_444_neon: 3.66 3.55 3.32 5.15 3.87 4.93 gen_grain_uv_ar3_16bpc_420_neon: 3.39 3.79 3.60 4.67 4.04 4.70 gen_grain_uv_ar3_16bpc_422_neon: 3.39 4.04 3.96 4.93 4.16 4.65 gen_grain_uv_ar3_16bpc_444_neon: 3.79 4.47 4.36 5.54 4.59 5.07 gen_grain_y_ar0_16bpc_neon: 5.05 5.26 6.97 5.47 5.95 8.59 gen_grain_y_ar1_16bpc_neon: 2.35 1.72 2.07 3.53 3.16 3.47 gen_grain_y_ar2_16bpc_neon: 3.02 2.70 2.88 4.19 3.57 4.03 gen_grain_y_ar3_16bpc_neon: 3.49 3.18 3.69 5.01 3.99 4.50
2021-09-03	arm64: filmgrain16: Remove a leftover unused macro	Martin Storsjö

2021-09-03	arm64: filmgrain16: Fix the default elems parameter of sum_lag2/3_func	Martin Storsjö
	This makes it correctly hit some conditions that avoid duplicated code, shrinking the text section by 1524 bytes.
2021-09-01	arm32: filmgrain: Add NEON implementation of gen_grain for 8 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 6.13 7.81 8.17 6.78 6.62 11.13 gen_grain_uv_ar0_8bpc_422_neon: 6.34 7.64 8.00 6.83 6.93 10.31 gen_grain_uv_ar0_8bpc_444_neon: 7.09 8.29 8.55 7.95 7.89 11.05 gen_grain_uv_ar1_8bpc_420_neon: 3.39 2.26 3.06 4.13 3.41 4.95 gen_grain_uv_ar1_8bpc_422_neon: 3.40 2.23 3.02 4.18 3.36 4.73 gen_grain_uv_ar1_8bpc_444_neon: 3.46 2.18 2.95 4.46 3.57 4.91 gen_grain_uv_ar2_8bpc_420_neon: 3.88 3.00 3.32 4.74 3.57 5.31 gen_grain_uv_ar2_8bpc_422_neon: 3.92 3.04 3.36 4.82 3.57 5.06 gen_grain_uv_ar2_8bpc_444_neon: 4.32 3.14 3.62 5.56 3.90 5.43 gen_grain_uv_ar3_8bpc_420_neon: 4.35 3.53 4.05 5.35 4.44 5.56 gen_grain_uv_ar3_8bpc_422_neon: 4.38 3.49 4.17 5.41 4.48 5.36 gen_grain_uv_ar3_8bpc_444_neon: 4.84 3.70 4.36 5.95 4.87 5.82 gen_grain_y_ar0_8bpc_neon: 5.18 5.57 7.65 5.93 7.13 9.01 gen_grain_y_ar1_8bpc_neon: 2.64 1.66 2.48 3.32 3.15 3.77 gen_grain_y_ar2_8bpc_neon: 3.57 2.64 3.21 4.59 3.68 4.64 gen_grain_y_ar3_8bpc_neon: 4.27 3.93 4.12 5.41 4.63 5.17 (A73 is benched against C code compiled with a different C compiler, which can explain the slightly differing numbers there.) Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 19614.6 13396.4 12320.4 15030.7 8288.1 8754.4 gen_grain_uv_ar0_8bpc_422_neon: 34660.9 24315.5 22225.3 26809.2 14549.8 15804.6 gen_grain_uv_ar0_8bpc_444_neon: 55625.6 39914.5 37100.2 44658.3 22917.3 27369.6 gen_grain_uv_ar1_8bpc_420_neon: 50049.5 63179.4 44793.1 36406.7 22690.3 25401.9 gen_grain_uv_ar1_8bpc_422_neon: 93289.5 117755.0 82815.4 67081.4 43133.1 46698.0 gen_grain_uv_ar1_8bpc_444_neon: 170880.0 223259.2 156241.5 122760.0 78655.6 85604.9 gen_grain_uv_ar2_8bpc_420_neon: 68185.5 78123.2 61457.3 47886.7 31526.2 36519.6 gen_grain_uv_ar2_8bpc_422_neon: 129195.2 148653.9 114133.2 89822.7 60242.6 70160.1 gen_grain_uv_ar2_8bpc_444_neon: 233133.7 272277.4 214108.7 161589.5 109069.3 127763.7 gen_grain_uv_ar3_8bpc_420_neon: 96374.4 94372.2 79663.8 70832.0 43065.3 50593.9 gen_grain_uv_ar3_8bpc_422_neon: 186324.8 184321.8 151490.1 136200.1 83758.0 98378.7 gen_grain_uv_ar3_8bpc_444_neon: 335596.6 336811.6 279755.5 247251.5 151657.2 178906.0 gen_grain_y_ar0_8bpc_neon: 46109.3 36022.2 28476.2 36478.5 18740.1 20660.4 gen_grain_y_ar1_8bpc_neon: 165054.2 217090.4 152578.9 118409.4 74357.2 83794.5 gen_grain_y_ar2_8bpc_neon: 226576.9 268320.3 210924.6 157829.4 105956.5 124293.2 gen_grain_y_ar3_8bpc_neon: 328337.2 330421.3 275110.1 242097.3 148538.7 177270.8 Corresponding numbers for the original arm64 version: Cortex A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 14874.7 7765.5 8536.0 gen_grain_uv_ar0_8bpc_422_neon: 26510.9 13685.3 15308.2 gen_grain_uv_ar0_8bpc_444_neon: 43189.6 21565.3 24312.0 gen_grain_uv_ar1_8bpc_420_neon: 33715.7 21669.8 22758.3 gen_grain_uv_ar1_8bpc_422_neon: 63955.3 41581.4 42852.5 gen_grain_uv_ar1_8bpc_444_neon: 117390.1 76503.5 78446.4 gen_grain_uv_ar2_8bpc_420_neon: 42779.0 27794.3 29677.9 gen_grain_uv_ar2_8bpc_422_neon: 82283.8 53446.7 58232.2 gen_grain_uv_ar2_8bpc_444_neon: 147773.8 98492.7 103754.1 gen_grain_uv_ar3_8bpc_420_neon: 56698.8 35697.1 40695.9 gen_grain_uv_ar3_8bpc_422_neon: 110132.4 69829.1 79196.8 gen_grain_uv_ar3_8bpc_444_neon: 196642.7 124174.9 141812.5 gen_grain_y_ar0_8bpc_neon: 36461.0 17782.0 19827.0 gen_grain_y_ar1_8bpc_neon: 113202.7 72457.7 75995.8 gen_grain_y_ar2_8bpc_neon: 142894.0 94450.9 100304.5 gen_grain_y_ar3_8bpc_neon: 191697.7 120674.9 137223.8
2021-09-01	arm64: filmgrain: Remove some unnecessary backups/restores of x30	Martin Storsjö

2021-09-01	arm64: filmgrain: Simplify loading coefficients for the lag3 variant	Martin Storsjö

2021-09-01	arm64: filmgrain: Reorder two instructions in the inner loop	Martin Storsjö
	This should improve scheduling on in-order cores.
2021-08-24	arm: Add NEON implementations of splat_mv	Martin Storsjö
	Relative speedup over C code, for arm64: Cortex A53 A72 A73 Apple M1 splat_mv_w1_neon: 1.09 0.95 1.22 - splat_mv_w2_neon: 1.76 1.32 1.74 - splat_mv_w4_neon: 2.78 2.19 2.19 15.00 splat_mv_w8_neon: 3.59 2.06 2.59 12.00 splat_mv_w16_neon: 4.12 1.72 2.53 3.14 splat_mv_w32_neon: 4.07 1.60 2.40 3.00 (The resolution of the timer used on Apple M1 isn't enough to measure the small versions of this function.) Relative speedup over C code, for arm32: Cortex A7 A8 A9 A53 A72 A73 splat_mv_w1_neon: 0.70 1.12 0.91 0.65 1.01 1.06 splat_mv_w2_neon: 0.94 2.16 2.01 0.99 2.52 1.63 splat_mv_w4_neon: 1.27 2.04 1.49 1.52 1.75 2.18 splat_mv_w8_neon: 1.75 2.47 1.16 2.88 1.95 2.58 splat_mv_w16_neon: 2.00 2.44 1.12 3.25 1.85 2.65 splat_mv_w32_neon: 1.43 2.28 1.19 3.55 1.77 2.65
2021-08-13	arm64: filmgrain16: Add NEON implementation of gen_grain for 16 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_16bpc_420_neon: 2.90 4.13 5.43 5.80 gen_grain_uv_ar0_16bpc_422_neon: 3.23 4.51 5.52 5.83 gen_grain_uv_ar0_16bpc_444_neon: 4.01 4.97 6.08 5.87 gen_grain_uv_ar1_16bpc_420_neon: 2.94 2.80 3.56 3.48 gen_grain_uv_ar1_16bpc_422_neon: 3.14 3.07 3.68 3.47 gen_grain_uv_ar1_16bpc_444_neon: 3.54 3.51 3.93 2.61 gen_grain_uv_ar2_16bpc_420_neon: 3.92 3.69 4.40 3.98 gen_grain_uv_ar2_16bpc_422_neon: 4.13 3.96 4.42 3.92 gen_grain_uv_ar2_16bpc_444_neon: 4.69 4.33 4.84 3.25 gen_grain_uv_ar3_16bpc_420_neon: 5.05 5.39 5.42 4.74 gen_grain_uv_ar3_16bpc_422_neon: 5.25 5.68 5.57 4.67 gen_grain_uv_ar3_16bpc_444_neon: 6.02 6.33 6.35 4.38 gen_grain_y_ar0_16bpc_neon: 4.67 5.23 5.22 10.11 gen_grain_y_ar1_16bpc_neon: 3.32 3.03 3.28 2.24 gen_grain_y_ar2_16bpc_neon: 4.59 3.95 4.64 3.52 gen_grain_y_ar3_16bpc_neon: 5.89 5.93 6.36 4.79 Absolute numbers: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_16bpc_420_neon: 19797.2 9725.0 9234.0 29.7 gen_grain_uv_ar0_16bpc_422_neon: 34899.4 16875.3 17021.6 57.7 gen_grain_uv_ar0_16bpc_444_neon: 53776.6 28470.1 28773.1 107.8 gen_grain_uv_ar1_16bpc_420_neon: 37998.2 24631.2 24754.0 84.2 gen_grain_uv_ar1_16bpc_422_neon: 70817.5 44642.5 46323.1 166.3 gen_grain_uv_ar1_16bpc_444_neon: 123333.0 77316.4 83523.1 427.5 gen_grain_uv_ar2_16bpc_420_neon: 49115.8 33053.7 33249.9 93.6 gen_grain_uv_ar2_16bpc_422_neon: 92965.3 59663.8 64741.9 187.9 gen_grain_uv_ar2_16bpc_444_neon: 160899.7 108845.6 115422.4 441.8 gen_grain_uv_ar3_16bpc_420_neon: 65786.6 41924.3 45562.1 108.1 gen_grain_uv_ar3_16bpc_422_neon: 126232.3 78691.6 87351.5 217.6 gen_grain_uv_ar3_16bpc_444_neon: 218702.6 140197.8 151294.8 454.3 gen_grain_y_ar0_16bpc_neon: 35867.9 17653.6 20770.7 108.0 gen_grain_y_ar1_16bpc_neon: 118781.8 74777.1 81338.6 426.0 gen_grain_y_ar2_16bpc_neon: 155919.9 102145.8 109698.1 438.5 gen_grain_y_ar3_16bpc_neon: 213348.1 133054.8 144726.0 447.9 Corresponding numbers for 8bpc: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_8bpc_420_neon: 15086.1 8384.7 8556.6 29.4 gen_grain_uv_ar0_8bpc_422_neon: 26800.6 14354.4 15526.5 56.6 gen_grain_uv_ar0_8bpc_444_neon: 43749.6 22408.6 24627.9 108.3 gen_grain_uv_ar1_8bpc_420_neon: 33706.3 21892.6 22835.9 87.1 gen_grain_uv_ar1_8bpc_422_neon: 63897.0 41820.1 43468.9 171.8 gen_grain_uv_ar1_8bpc_444_neon: 117345.1 76372.5 79938.3 370.0 gen_grain_uv_ar2_8bpc_420_neon: 42808.8 28493.8 29932.8 92.2 gen_grain_uv_ar2_8bpc_422_neon: 82282.5 53969.4 58191.1 181.8 gen_grain_uv_ar2_8bpc_444_neon: 147641.4 98136.4 103157.6 430.2 gen_grain_uv_ar3_8bpc_420_neon: 56784.3 36342.0 40812.3 102.2 gen_grain_uv_ar3_8bpc_422_neon: 110249.7 70215.6 79716.0 200.5 gen_grain_uv_ar3_8bpc_444_neon: 196461.7 125802.8 141781.5 440.1 gen_grain_y_ar0_8bpc_neon: 36451.7 17794.4 19839.3 109.5 gen_grain_y_ar1_8bpc_neon: 113155.6 71811.9 77296.8 370.2 gen_grain_y_ar2_8bpc_neon: 142812.3 95042.4 100434.4 431.8 gen_grain_y_ar3_8bpc_neon: 191608.6 121199.5 136946.4 437.2
2021-08-13	arm64: filmgrain: Deduplicate the sum_lagN functions	Martin Storsjö
	No difference in genereated code, but >210 lines less of duplicated source code.
2021-08-13	arm64: filmgrain: Deduplicate the output_lag functions	Martin Storsjö
	No practical difference in generated code (or the size of it), but less source code to handle.
2021-08-13	arm64: filmgrain: Remove two stray ret instructions	Martin Storsjö
	These are never executed as they come after an unconditional branch.
2021-08-13	arm64: filmgrain: Uninline the get_grain_2 macro	Martin Storsjö
	This shrinks the code section by 288 bytes.
2021-08-13	arm64: filmgrain: Fix some cases of vertical whitespace alignment	Martin Storsjö

2021-08-13	arm64: filmgrain: Fix some comments in gen_grain	Martin Storsjö

2021-06-12	arm32: filmgrain: Add NEON implementation of fgy and fguv for 16 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 3.47 1.72 2.99 4.18 2.68 6.19 fguv_32x32xn_16bpc_420_csfl1_neon: 3.24 1.36 2.58 3.78 2.73 5.27 fguv_32x32xn_16bpc_422_csfl0_neon: 3.57 2.07 3.05 4.32 2.74 6.20 fguv_32x32xn_16bpc_422_csfl1_neon: 3.33 1.44 2.62 3.89 2.71 5.28 fguv_32x32xn_16bpc_444_csfl0_neon: 3.48 1.69 3.06 4.48 2.97 6.69 fguv_32x32xn_16bpc_444_csfl1_neon: 3.06 1.16 2.36 3.85 2.75 5.19 fgy_32x32xn_16bpc_neon: 2.89 1.05 2.29 3.49 2.49 3.15 Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 6237.3 12701.0 6687.1 4525.8 3220.8 3195.4 fguv_32x32xn_16bpc_420_csfl1_neon: 5143.2 11684.8 5926.4 3857.2 2604.7 2556.5 fguv_32x32xn_16bpc_422_csfl0_neon: 6347.3 11005.2 6797.5 4582.4 3300.4 3250.5 fguv_32x32xn_16bpc_422_csfl1_neon: 5275.2 11594.8 5992.6 3931.1 2668.7 2607.3 fguv_32x32xn_16bpc_444_csfl0_neon: 5181.6 11310.0 5575.4 3629.7 2383.8 2530.0 fguv_32x32xn_16bpc_444_csfl1_neon: 4081.9 10958.8 4868.5 2962.9 1870.3 2034.2 fgy_32x32xn_16bpc_neon: 15439.1 43129.0 19406.6 11542.3 7463.9 7827.8 Corresponding numbers for arm64: Cortex A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 4019.2 3247.4 3259.6 fguv_32x32xn_16bpc_420_csfl1_neon: 3460.1 2628.7 2640.8 fguv_32x32xn_16bpc_422_csfl0_neon: 4034.4 3329.9 3287.5 fguv_32x32xn_16bpc_422_csfl1_neon: 3468.3 2749.3 2686.6 fguv_32x32xn_16bpc_444_csfl0_neon: 3117.7 2447.4 2539.8 fguv_32x32xn_16bpc_444_csfl1_neon: 2641.2 1977.2 2132.8 fgy_32x32xn_16bpc_neon: 9873.5 7605.7 7656.2
2021-06-11	arm32: filmgrain: Add NEON implementations of fgy and fguv for 8 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 4.20 2.19 3.48 4.93 3.60 5.93 fguv_32x32xn_8bpc_420_csfl1_neon: 3.92 1.52 2.84 4.34 3.82 5.93 fguv_32x32xn_8bpc_422_csfl0_neon: 4.27 2.13 3.58 5.02 4.04 5.95 fguv_32x32xn_8bpc_422_csfl1_neon: 3.99 1.56 2.91 4.43 3.89 6.00 fguv_32x32xn_8bpc_444_csfl0_neon: 4.48 2.08 3.89 5.66 4.07 6.51 fguv_32x32xn_8bpc_444_csfl1_neon: 4.45 1.41 2.99 5.28 3.63 6.09 fgy_32x32xn_8bpc_neon: 3.61 1.10 2.62 4.35 3.06 3.74 Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 5318.8 11167.7 6024.6 3909.9 2945.2 2993.5 fguv_32x32xn_8bpc_420_csfl1_neon: 4351.0 10929.7 5269.5 3316.8 2166.5 2256.9 fguv_32x32xn_8bpc_422_csfl0_neon: 5387.9 11746.7 6080.0 3945.8 2988.1 3046.3 fguv_32x32xn_8bpc_422_csfl1_neon: 4396.0 11083.2 5300.8 3354.9 2216.4 2291.4 fguv_32x32xn_8bpc_444_csfl0_neon: 4347.9 10595.0 5134.4 3079.1 2277.7 2392.9 fguv_32x32xn_8bpc_444_csfl1_neon: 3295.0 10518.2 4442.6 2476.3 1716.3 1829.2 fgy_32x32xn_8bpc_neon: 12376.2 41046.9 17259.7 9153.1 6610.4 7005.3 Corresponding numbers for arm64: Cortex A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 3822.9 2920.0 2935.7 fguv_32x32xn_8bpc_420_csfl1_neon: 3209.7 2231.7 2335.4 fguv_32x32xn_8bpc_422_csfl0_neon: 3807.9 2886.5 2966.7 fguv_32x32xn_8bpc_422_csfl1_neon: 3197.1 2187.9 2355.9 fguv_32x32xn_8bpc_444_csfl0_neon: 2757.8 2227.4 2334.4 fguv_32x32xn_8bpc_444_csfl1_neon: 2244.6 1719.1 1786.7 fgy_32x32xn_8bpc_neon: 8192.2 6563.3 6969.1
2021-06-10	arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A53 A82 A83 Apple M1 fguv_32x32xn_16bpc_420_csfl0_neon: 4.57 2.08 3.57 7.61 fguv_32x32xn_16bpc_420_csfl1_neon: 4.92 2.89 3.96 4.26 fguv_32x32xn_16bpc_422_csfl0_neon: 4.59 2.14 3.61 5.88 fguv_32x32xn_16bpc_422_csfl1_neon: 4.92 2.90 3.90 5.00 fguv_32x32xn_16bpc_444_csfl0_neon: 3.64 1.89 2.86 4.72 fguv_32x32xn_16bpc_444_csfl1_neon: 3.59 2.26 2.76 3.22
2021-06-10	arm64: filmgrain: Back up and restore one register fewer in fguv 8bpc	Martin Storsjö

2021-06-10	arm64: filmgrain: Stray cosmetic fixes	Martin Storsjö

2021-06-10	arm64: filmgrain: Do the right amount of gathers for subsampled fguv	Martin Storsjö
	Previously we did 32 gathers even though only 16 are needed. Before: Cortex A53 A72 A73 Apple M1 fguv_32x32xn_8bpc_420_csfl0_neon: 5352.1 3985.0 4068.9 8.3 fguv_32x32xn_8bpc_420_csfl1_neon: 4738.2 3297.8 3633.0 8.2 fguv_32x32xn_8bpc_422_csfl0_neon: 5386.0 4036.8 4093.5 8.3 fguv_32x32xn_8bpc_422_csfl1_neon: 4779.9 3392.6 3641.6 8.2 fguv_32x32xn_8bpc_444_csfl0_neon: 3068.4 2422.0 2436.5 4.9 fguv_32x32xn_8bpc_444_csfl1_neon: 2558.3 1908.4 1926.6 4.4 After: fguv_32x32xn_8bpc_420_csfl0_neon: 4330.4 3118.5 3224.6 5.3 fguv_32x32xn_8bpc_420_csfl1_neon: 3731.8 2416.9 2619.6 4.7 fguv_32x32xn_8bpc_422_csfl0_neon: 4364.7 3129.3 3247.6 5.4 fguv_32x32xn_8bpc_422_csfl1_neon: 3762.5 2450.2 2661.8 4.7 fguv_32x32xn_8bpc_444_csfl0_neon: 3075.1 2376.4 2429.4 4.9 fguv_32x32xn_8bpc_444_csfl1_neon: 2564.5 1865.9 1952.8 4.4
2021-06-05	arm64: filmgrain16: Use sqrdmulh for the scaling*grain multiplication	Martin Storsjö
	Before: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 10396.8 8150.8 8718.3 19.5 After: fgy_32x32xn_16bpc_neon: 9665.1 7558.8 7652.8 19.5
2021-05-25	arm64: filmgrain: Fix overflows in gen_grain	Martin Storsjö
	After multiplying two int8_t, the maximum possible output is -128*-128 = 16384. One can't add two such values in an int16_t (even if all the products of all other int8_t combinations can be). Previously the summing used 16 bit intermediates for the sum of two products and only lengtheted the result to 32 bit when accumulating three or more products. Before: Cortex A53 A72 A73 Apple M1 gen_grain_y_ar1_8bpc_neon: 112598.5 71309.2 74889.8 372.2 gen_grain_y_ar2_8bpc_neon: 139932.4 91442.3 95788.4 387.3 gen_grain_y_ar3_8bpc_neon: 185607.6 115691.6 131655.8 403.0 After: gen_grain_y_ar1_8bpc_neon: 112968.8 71897.9 76171.2 371.2 gen_grain_y_ar2_8bpc_neon: 142768.8 94517.9 97934.4 387.5 gen_grain_y_ar3_8bpc_neon: 191625.2 121083.0 135975.3 405.6
2021-05-14	arm64: filmgrain16: Simplify constructing the constant 0x0fff	Martin Storsjö
	Use the mvni instruction instead of setting the constant in a GPR first.
2021-05-13	arm64: filmgrain16: Guard against out of range pixels in the gather function	Martin Storsjö
	In 16 bpc, the pixels are 16 bit integers, but valid pixels only are up to 12 bits, and the scaling buffer only contains 4096 elements. The src pixels are, normally, supposed to be valid pixels, but when processing blocks of 32 pixels at a time, it can operate on uninitialized pixels past the right edge. Before: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 10372.5 8194.4 8612.1 24.2 After: fgy_32x32xn_16bpc_neon: 10837.9 8469.5 8885.1 24.6
2021-05-12	arm64: filmgrain: Add a NEON implementation of fgy_32x32xn for 16 bpc	Martin Storsjö
	Relative speedup over C code: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 3.87 2.28 2.78 3.45
2021-05-04	arm64: filmgrain: Add NEON implementation of the generate_grain_uv functions	Martin Storsjö
	The existing functions/macros for generate_grain_y are templated for adding in the a final coefficient from the y buffer, while trying to keep the binary size down. Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_8bpc_420_neon: 4.62 4.55 5.27 9.08 gen_grain_uv_ar0_8bpc_422_neon: 4.81 4.90 5.33 7.25 gen_grain_uv_ar0_8bpc_444_neon: 5.05 5.17 5.69 7.04 gen_grain_uv_ar1_8bpc_420_neon: 3.61 3.09 3.68 3.92 gen_grain_uv_ar1_8bpc_422_neon: 3.71 3.22 3.64 3.46 gen_grain_uv_ar1_8bpc_444_neon: 3.59 3.40 3.67 3.11 gen_grain_uv_ar2_8bpc_420_neon: 4.77 3.85 4.81 4.55 gen_grain_uv_ar2_8bpc_422_neon: 4.88 3.96 4.85 4.15 gen_grain_uv_ar2_8bpc_444_neon: 5.18 4.65 5.18 3.83 gen_grain_uv_ar3_8bpc_420_neon: 6.14 5.25 6.14 5.64 gen_grain_uv_ar3_8bpc_422_neon: 6.27 5.27 6.28 5.42 gen_grain_uv_ar3_8bpc_444_neon: 6.84 6.40 6.79 5.18
2021-04-27	arm64: filmgrain: Add NEON implementation of the generate_grain_y function	Martin Storsjö
	Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_y_ar0_8bpc_neon: 5.03 5.17 5.59 5.55 gen_grain_y_ar1_8bpc_neon: 3.38 3.38 3.56 2.77 gen_grain_y_ar2_8bpc_neon: 5.00 4.64 5.06 3.38 gen_grain_y_ar3_8bpc_neon: 6.74 6.53 6.67 4.66
2021-04-27	arm64: filmgrain: Add the missing HIGHBD_DECL_SUFFIX for the fguv functions	Martin Storsjö