github.com/videolan/dav1d.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2020-05-10	arm64: itx: Add NEON implementation of itx for 10 bpc	Martin Storsjö
	Add an element size specifier to the existing individual transform functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify that they operate on input vectors of 8h, and make the symbols public, to let the 10 bpc case call them from a different object file. The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon. Make the existing itx.S compiled regardless of whether 8 bpc support is enabled. For builds with 8 bpc support disabled, this does include the unused frontend functions though, but this is hopefully tolerable to avoid having to split the file into a sharable file for transforms and a separate one for frontends. This only implements the 10 bpc case, as that case can use transforms operating on 16 bit coefficients in the second pass. Relative speedup vs C for a few functions: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 4.14 4.06 4.49 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 6.51 6.49 6.42 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 5.02 4.63 6.23 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 8.54 7.13 11.96 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.52 6.60 8.03 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 11.27 9.62 12.22 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 9.60 6.97 8.59 inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.60 3.48 3.19 inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 14.65 12.64 16.86 inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 11.57 8.80 12.68 inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 8.79 8.00 9.21 inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 7.58 6.21 7.80 inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 2.41 2.85 2.75 inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 12.91 10.27 12.24 inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 10.96 7.97 10.31 inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 8.95 7.42 9.55 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 7.97 6.12 7.82
2019-06-26	arm64: itx: Add NEON optimized inverse transforms	Martin Storsjö
	The speedup for most non-dc-only dct functions is around 9-12x over the C code generated by GCC 7.3. Relative speedups vs C for a few functions: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_8bpc_neon: 3.90 4.16 5.65 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 7.20 8.05 11.19 inv_txfm_add_8x8_dct_dct_0_8bpc_neon: 5.09 6.73 6.45 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 12.18 10.80 13.05 inv_txfm_add_16x16_dct_dct_0_8bpc_neon: 7.31 9.35 11.17 inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 14.36 13.06 15.93 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 11.00 10.09 12.05 inv_txfm_add_32x32_dct_dct_0_8bpc_neon: 4.41 5.40 5.77 inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 13.84 13.81 18.04 inv_txfm_add_32x32_dct_dct_2_8bpc_neon: 11.75 11.87 15.22 inv_txfm_add_32x32_dct_dct_3_8bpc_neon: 10.20 10.40 13.13 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 9.01 9.21 11.56 inv_txfm_add_64x64_dct_dct_0_8bpc_neon: 3.84 4.82 5.28 inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 14.40 12.69 16.71 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 10.91 9.63 12.67 Some of the specialcased identity_identity transforms for 32x32 give insane speedups over the generic C code: inv_txfm_add_32x32_identity_identity_0_8bpc_neon: 225.26 238.11 247.07 inv_txfm_add_32x32_identity_identity_1_8bpc_neon: 225.33 238.53 247.69 inv_txfm_add_32x32_identity_identity_2_8bpc_neon: 59.60 61.94 64.63 inv_txfm_add_32x32_identity_identity_3_8bpc_neon: 26.98 27.99 29.21 inv_txfm_add_32x32_identity_identity_4_8bpc_neon: 15.08 15.93 16.56
2019-06-20	arm64: Consistently name macro arguments tX for temporaries in transposes	Martin Storsjö

2019-04-16	arm64: loopfilter: Implement NEON loop filters	Martin Storsjö
	The exact relative speedup compared to C code is a bit vague and hard to measure, depending on eactly how many filtered blocks are skipped, as the NEON version always filters 16 pixels at a time, while the C code can skip processing individual 4 pixel blocks. Additionally, the checkasm benchmarking code runs the same function repeatedly on the same buffer, which can make the filter take different codepaths on each run, as the function updates the buffer which will be used as input for the next run. If tweaking the checkasm test data to try to avoid skipped blocks, the relative speedups compared to C is between 2x and 5x, while it is around 1x to 4x with the current checkasm test as such. Benchmark numbers from a tweaked checkasm that avoids skipped blocks: Cortex A53 A72 A73 lpf_h_sb_uv_w4_8bpc_c: 2954.7 1399.3 1655.3 lpf_h_sb_uv_w4_8bpc_neon: 895.5 650.8 692.0 lpf_h_sb_uv_w6_8bpc_c: 3879.2 1917.2 2257.7 lpf_h_sb_uv_w6_8bpc_neon: 1125.6 759.5 838.4 lpf_h_sb_y_w4_8bpc_c: 6711.0 3275.5 3913.7 lpf_h_sb_y_w4_8bpc_neon: 1744.0 1342.1 1351.5 lpf_h_sb_y_w8_8bpc_c: 10695.7 6155.8 6638.9 lpf_h_sb_y_w8_8bpc_neon: 2146.5 1560.4 1609.1 lpf_h_sb_y_w16_8bpc_c: 11355.8 6292.0 6995.9 lpf_h_sb_y_w16_8bpc_neon: 2475.4 1949.6 1968.4 lpf_v_sb_uv_w4_8bpc_c: 2639.7 1204.8 1425.9 lpf_v_sb_uv_w4_8bpc_neon: 510.7 351.4 334.7 lpf_v_sb_uv_w6_8bpc_c: 3468.3 1757.1 2021.5 lpf_v_sb_uv_w6_8bpc_neon: 625.0 415.0 397.8 lpf_v_sb_y_w4_8bpc_c: 5428.7 2731.7 3068.5 lpf_v_sb_y_w4_8bpc_neon: 1172.6 792.1 768.0 lpf_v_sb_y_w8_8bpc_c: 8946.1 4412.8 5121.0 lpf_v_sb_y_w8_8bpc_neon: 1565.5 1063.6 1062.7 lpf_v_sb_y_w16_8bpc_c: 8978.9 4411.7 5112.0 lpf_v_sb_y_w16_8bpc_neon: 1775.0 1288.1 1236.7
2019-04-04	arm: Consistently use 8/24 columns indentation for assembly	Martin Storsjö
	For cases with indented, nested .if/.macro in asm.S, ident those by 4 chars. Some initial assembly files were indented to 4/16 columns, while all the actual implementation files, starting with src/arm/64/mc.S, have used 8/24 for indentation.
2019-02-26	fix dav1d spelling	Janne Grunau

2019-02-14	arm64: mc: NEON implementation of warp8x8{,t}	Martin Storsjö
	Relative speedup vs C code: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 3.19 2.60 3.66 warp_8x8t_8bpc_neon: 3.09 2.50 3.58
2019-02-13	Remove leading double underscores from include guard defines	Martin Storsjö
	A symbol starting with two leading underscores is reserved for the compiler/standard library implementation. Also remove the trailing two double underscores for consistency and symmetry.
2018-09-30	aarch64: Always use the PIC version of movrel for iOS	Martin Storsjö
	Building without PIC isn't allowed for iOS. This fixes linker errors like these: ld: Absolute addressing not allowed in arm64 code but used in '_checkasm_checked_call' referencing 'error_message' for architecture arm64
2018-09-29	build: add support for arm/aarch64 asm and integrate checkasm	Janne Grunau