Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/videolan/dav1d.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/src/arm/64
AgeCommit message (Collapse)Author
2020-05-10msac: Avoid attempting to refill after eob has already been reachedHenrik Gramner
Utilize the unsigned representation of a signed integer to skip the refill code if the count was already negative to begin with, which saves a few clock cycles at the end of each tile.
2020-05-10arm64: itx: Add NEON implementation of itx for 10 bpcMartin Storsjö
Add an element size specifier to the existing individual transform functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify that they operate on input vectors of 8h, and make the symbols public, to let the 10 bpc case call them from a different object file. The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon. Make the existing itx.S compiled regardless of whether 8 bpc support is enabled. For builds with 8 bpc support disabled, this does include the unused frontend functions though, but this is hopefully tolerable to avoid having to split the file into a sharable file for transforms and a separate one for frontends. This only implements the 10 bpc case, as that case can use transforms operating on 16 bit coefficients in the second pass. Relative speedup vs C for a few functions: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 4.14 4.06 4.49 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 6.51 6.49 6.42 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 5.02 4.63 6.23 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 8.54 7.13 11.96 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.52 6.60 8.03 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 11.27 9.62 12.22 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 9.60 6.97 8.59 inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.60 3.48 3.19 inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 14.65 12.64 16.86 inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 11.57 8.80 12.68 inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 8.79 8.00 9.21 inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 7.58 6.21 7.80 inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 2.41 2.85 2.75 inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 12.91 10.27 12.24 inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 10.96 7.97 10.31 inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 8.95 7.42 9.55 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 7.97 6.12 7.82
2020-05-10arm64: itx: Prepare for other bitdepthsMartin Storsjö
2020-05-10arm64: itx: Share code for the three horz_16x8 functionsMartin Storsjö
2020-05-10arm64: itx: Fix the eob checking for dct_dct_64x16Martin Storsjö
Before this, we never did the early exit from the first pass. Before: Cortex A53 A72 A73 inv_txfm_add_64x16_dct_dct_1_8bpc_neon: 7275.7 5198.3 5250.9 inv_txfm_add_64x16_dct_dct_2_8bpc_neon: 7276.1 5197.0 5251.3 inv_txfm_add_64x16_dct_dct_3_8bpc_neon: 7275.8 5196.2 5254.5 inv_txfm_add_64x16_dct_dct_4_8bpc_neon: 7273.6 5198.8 5254.2 After: inv_txfm_add_64x16_dct_dct_1_8bpc_neon: 5187.8 3763.8 3735.0 inv_txfm_add_64x16_dct_dct_2_8bpc_neon: 7280.6 5185.6 5256.3 inv_txfm_add_64x16_dct_dct_3_8bpc_neon: 7270.7 5179.8 5250.3 inv_txfm_add_64x16_dct_dct_4_8bpc_neon: 7271.7 5212.4 5256.4 The other related variants didn't have this bug and properly exited early when possible.
2020-05-10arm64: itx: Simplify inv_txfm_horz_dct_32x8Martin Storsjö
Unify some loads and stores, avoiding some extra pointer moving.
2020-05-10arm64: itx: Minor optimizations for the 8x32 functionsMartin Storsjö
This gives a couple cycles speedup.
2020-05-10arm64: itx: Cosmetic fix upMartin Storsjö
2020-05-10arm64: itx: Remove an unused constantMartin Storsjö
This isn't used for a sqrdmulh in its current form here. The one left in idct_coeffs[1] isn't used within the idct itself, but inv_txfm_horz_scale_dct_32x8 relies on it being left there for use with sqrdmulh scaling later.
2020-05-10arm64: itx: Remove a todo comment about more special cased functionsMartin Storsjö
These cases were removed from x86 to save space and simplify the code in e0b88bd2b2c97a2695edcc498485e1cb3003e7f1, as those cases were essentially unused in real world bitstreams.
2020-05-10arm64: itx: Remove a now unused macroMartin Storsjö
The macro became unused in 9f084b0d2.
2020-04-05arm64: mc: NEON implementation of emu_edge for 16bpcMartin Storsjö
Relative speedup over C code: Cortex A53 A72 A73 emu_edge_w4_16bpc_neon: 2.49 1.53 1.91 emu_edge_w8_16bpc_neon: 2.27 1.55 1.90 emu_edge_w16_16bpc_neon: 2.46 1.46 2.09 emu_edge_w32_16bpc_neon: 2.20 1.39 1.73 emu_edge_w64_16bpc_neon: 1.65 1.00 1.46 emu_edge_w128_16bpc_neon: 1.55 1.44 1.54
2020-04-04arm64: mc: NEON implementation of emu_edge for 8bpcMartin Storsjö
Relative speedups over C code: Cortex A53 A72 A73 emu_edge_w4_8bpc_neon: 3.82 2.93 2.41 emu_edge_w8_8bpc_neon: 3.28 2.86 2.51 emu_edge_w16_8bpc_neon: 3.58 3.27 2.63 emu_edge_w32_8bpc_neon: 3.04 1.68 2.12 emu_edge_w64_8bpc_neon: 2.58 1.45 1.48 emu_edge_w128_8bpc_neon: 1.79 1.02 1.57 The benchmark numbers for the larger size on A72 fluctuate a whole lot and thus seem very unreliable.
2020-03-26arm64: ipred: Add NEON implementation of ipred for 16 bpcMartin Storsjö
The FILTER_PRED function is templated and has two separate instantations for 10 and 12 bit separately. (They're switched between using a runtime check on entry to the function.)
2020-03-26arm: ipred: Prepare for 16 bpcMartin Storsjö
2020-03-26arm: ipred: Remove stray leftover instructionsMartin Storsjö
2020-03-26arm64: ipred: Integrate aggregation into the first pass of cfl_acMartin Storsjö
Before: Cortex A53 A72 A73 cfl_ac_420_w4_8bpc_neon: 131.8 75.6 70.8 cfl_ac_420_w8_8bpc_neon: 199.4 106.4 117.8 cfl_ac_420_w16_8bpc_neon: 370.6 194.6 213.3 cfl_ac_422_w4_8bpc_neon: 98.4 61.4 56.6 cfl_ac_422_w8_8bpc_neon: 237.7 134.2 141.0 cfl_ac_422_w16_8bpc_neon: 456.5 256.2 279.5 After: cfl_ac_420_w4_8bpc_neon: 121.1 76.3 67.2 cfl_ac_420_w8_8bpc_neon: 188.7 106.6 115.3 cfl_ac_420_w16_8bpc_neon: 331.7 177.4 199.8 cfl_ac_422_w4_8bpc_neon: 88.7 57.3 51.6 cfl_ac_422_w8_8bpc_neon: 208.2 121.2 130.7 cfl_ac_422_w16_8bpc_neon: 393.8 226.3 239.3
2020-03-26arm64: ipred: Use rounded shifts instead of a separate additionMartin Storsjö
2020-03-26arm64: ipred: Do shifts on only half the register width when possibleMartin Storsjö
In these cases, we only need the value of the first element.
2020-03-26arm64: ipred: Avoid data dependencies with consecutive dup instructionsMartin Storsjö
This is around one cycle faster.
2020-03-26arm64: ipred: Remove a superfluous postincrementMartin Storsjö
2020-03-05arm64: mc: NEON implementation of w_mask for 16 bpcMartin Storsjö
Checkasm numbers: Cortex A53 A72 A73 w_mask_420_w4_16bpc_neon: 173.6 123.5 120.3 w_mask_420_w8_16bpc_neon: 484.2 344.1 329.5 w_mask_420_w16_16bpc_neon: 1411.2 1027.4 1035.1 w_mask_420_w32_16bpc_neon: 5561.5 4093.2 3980.1 w_mask_420_w64_16bpc_neon: 13809.6 9856.5 9581.0 w_mask_420_w128_16bpc_neon: 35614.7 25553.8 24284.4 w_mask_422_w4_16bpc_neon: 159.4 112.2 114.2 w_mask_422_w8_16bpc_neon: 453.4 326.1 326.7 w_mask_422_w16_16bpc_neon: 1394.6 1062.3 1050.2 w_mask_422_w32_16bpc_neon: 5485.8 4219.6 4027.3 w_mask_422_w64_16bpc_neon: 13701.2 10079.6 9692.6 w_mask_422_w128_16bpc_neon: 35455.3 25892.5 24625.9 w_mask_444_w4_16bpc_neon: 153.0 112.3 112.7 w_mask_444_w8_16bpc_neon: 437.2 331.8 325.8 w_mask_444_w16_16bpc_neon: 1395.1 1069.1 1041.7 w_mask_444_w32_16bpc_neon: 5370.1 4213.5 4138.1 w_mask_444_w64_16bpc_neon: 13482.6 10190.5 10004.6 w_mask_444_w128_16bpc_neon: 35583.7 26911.2 25638.8 Corresponding numbers for 8 bpc for comparison: w_mask_420_w4_8bpc_neon: 126.6 79.1 87.7 w_mask_420_w8_8bpc_neon: 343.9 195.0 211.5 w_mask_420_w16_8bpc_neon: 886.3 540.3 577.7 w_mask_420_w32_8bpc_neon: 3558.6 2152.4 2216.7 w_mask_420_w64_8bpc_neon: 8894.9 5161.2 5297.0 w_mask_420_w128_8bpc_neon: 22520.1 13514.5 13887.2 w_mask_422_w4_8bpc_neon: 112.9 68.2 77.0 w_mask_422_w8_8bpc_neon: 314.4 175.5 208.7 w_mask_422_w16_8bpc_neon: 835.5 565.0 608.3 w_mask_422_w32_8bpc_neon: 3381.3 2231.8 2287.6 w_mask_422_w64_8bpc_neon: 8499.4 5343.6 5460.8 w_mask_422_w128_8bpc_neon: 21823.3 14206.5 14249.1 w_mask_444_w4_8bpc_neon: 104.6 65.8 72.7 w_mask_444_w8_8bpc_neon: 290.4 173.7 196.6 w_mask_444_w16_8bpc_neon: 831.4 586.7 591.7 w_mask_444_w32_8bpc_neon: 3320.8 2300.6 2251.0 w_mask_444_w64_8bpc_neon: 8300.0 5480.5 5346.8 w_mask_444_w128_8bpc_neon: 21633.8 15981.3 14384.8
2020-03-04arm64: mc: NEON implementation of blend for 16bpcMartin Storsjö
Checkasm numbers: Cortex A53 A72 A73 blend_h_w2_16bpc_neon: 109.3 83.1 56.7 blend_h_w4_16bpc_neon: 114.1 61.4 62.3 blend_h_w8_16bpc_neon: 133.3 80.8 81.1 blend_h_w16_16bpc_neon: 215.6 132.7 149.5 blend_h_w32_16bpc_neon: 390.4 254.2 235.8 blend_h_w64_16bpc_neon: 719.1 456.3 453.8 blend_h_w128_16bpc_neon: 1646.1 1112.3 1065.9 blend_v_w2_16bpc_neon: 185.9 175.9 180.0 blend_v_w4_16bpc_neon: 338.0 183.4 232.1 blend_v_w8_16bpc_neon: 426.5 213.8 250.6 blend_v_w16_16bpc_neon: 678.2 357.8 382.6 blend_v_w32_16bpc_neon: 1098.3 686.2 695.6 blend_w4_16bpc_neon: 75.7 31.5 32.0 blend_w8_16bpc_neon: 134.0 75.0 75.8 blend_w16_16bpc_neon: 467.9 267.3 310.0 blend_w32_16bpc_neon: 1201.9 658.7 779.7 Corresponding numbers for 8bpc for comparison: blend_h_w2_8bpc_neon: 104.1 55.9 60.8 blend_h_w4_8bpc_neon: 108.9 58.7 48.2 blend_h_w8_8bpc_neon: 99.3 64.4 67.4 blend_h_w16_8bpc_neon: 145.2 93.4 85.1 blend_h_w32_8bpc_neon: 262.2 157.5 148.6 blend_h_w64_8bpc_neon: 466.7 278.9 256.6 blend_h_w128_8bpc_neon: 1054.2 624.7 571.0 blend_v_w2_8bpc_neon: 170.5 106.6 113.4 blend_v_w4_8bpc_neon: 333.0 189.9 225.9 blend_v_w8_8bpc_neon: 314.9 199.0 203.5 blend_v_w16_8bpc_neon: 476.9 300.8 241.1 blend_v_w32_8bpc_neon: 766.9 430.4 415.1 blend_w4_8bpc_neon: 66.7 35.4 26.0 blend_w8_8bpc_neon: 110.7 47.9 48.1 blend_w16_8bpc_neon: 299.4 161.8 162.3 blend_w32_8bpc_neon: 725.8 417.0 432.8
2020-03-04arm: mc: Optimize blend_vMartin Storsjö
Use a post-increment with a register on the last increment, avoiding a separate increment. Avoid processing the last 8 pixels in the w32 case when we only output 24 pixels. Before: ARM32 Cortex A7 A8 A9 A53 A72 A73 blend_v_w4_8bpc_neon: 450.4 574.7 538.7 374.6 199.3 260.5 blend_v_w8_8bpc_neon: 559.6 351.3 552.5 357.6 214.8 204.3 blend_v_w16_8bpc_neon: 926.3 511.6 787.9 593.0 271.0 246.8 blend_v_w32_8bpc_neon: 1482.5 917.0 1149.5 991.9 354.0 368.9 ARM64 blend_v_w4_8bpc_neon: 351.1 200.0 224.1 blend_v_w8_8bpc_neon: 333.0 212.4 203.8 blend_v_w16_8bpc_neon: 495.2 302.0 247.0 blend_v_w32_8bpc_neon: 840.0 557.8 514.0 After: ARM32 blend_v_w4_8bpc_neon: 435.5 575.8 537.6 356.2 198.3 259.5 blend_v_w8_8bpc_neon: 545.2 347.9 553.5 339.1 207.8 204.2 blend_v_w16_8bpc_neon: 913.7 511.0 788.1 573.7 275.4 243.3 blend_v_w32_8bpc_neon: 1445.3 951.2 1079.1 920.4 352.2 361.6 ARM64 blend_v_w4_8bpc_neon: 333.0 191.3 225.9 blend_v_w8_8bpc_neon: 314.9 199.3 203.5 blend_v_w16_8bpc_neon: 476.9 301.3 241.1 blend_v_w32_8bpc_neon: 766.9 432.8 416.9
2020-03-04arm64: mc: Treat the stride as a full 64 bit (potential signed) value in ↵Martin Storsjö
blend_8bpc_neon
2020-03-04arm64: mc: Fix indentationMartin Storsjö
2020-03-04arm64: mc: Use more intuitive lane specifications for loads/storesMartin Storsjö
For loads where we load/store a full or half register (instead of a lanewise load/store), the lane specification in itself doesn't matter, only its size. This doesn't change the generated code, but makes it more readable.
2020-03-02arm64: loopfilter: NEON implementation of loopfilter for 16 bpcMartin Storsjö
Checkasm runtimes: Cortex A53 A72 A73 lpf_h_sb_uv_w4_16bpc_neon: 919.0 795.0 714.9 lpf_h_sb_uv_w6_16bpc_neon: 1267.7 1116.2 1081.9 lpf_h_sb_y_w4_16bpc_neon: 1500.2 1543.9 1778.5 lpf_h_sb_y_w8_16bpc_neon: 2216.1 2183.0 2568.1 lpf_h_sb_y_w16_16bpc_neon: 2641.8 2630.4 2639.4 lpf_v_sb_uv_w4_16bpc_neon: 836.5 572.7 667.3 lpf_v_sb_uv_w6_16bpc_neon: 1130.8 709.1 955.5 lpf_v_sb_y_w4_16bpc_neon: 1271.6 1434.4 1272.1 lpf_v_sb_y_w8_16bpc_neon: 1818.0 1759.1 1664.6 lpf_v_sb_y_w16_16bpc_neon: 1998.6 2115.8 1586.6 Corresponding numbers for 8 bpc for comparison: lpf_h_sb_uv_w4_8bpc_neon: 799.4 632.8 695.4 lpf_h_sb_uv_w6_8bpc_neon: 1067.3 613.6 767.5 lpf_h_sb_y_w4_8bpc_neon: 1490.5 1179.1 1018.9 lpf_h_sb_y_w8_8bpc_neon: 1892.9 1382.0 1172.0 lpf_h_sb_y_w16_8bpc_neon: 2117.4 1625.4 1739.0 lpf_v_sb_uv_w4_8bpc_neon: 447.1 447.7 446.0 lpf_v_sb_uv_w6_8bpc_neon: 522.1 529.0 513.1 lpf_v_sb_y_w4_8bpc_neon: 1043.7 785.0 775.9 lpf_v_sb_y_w8_8bpc_neon: 1500.4 1115.9 881.2 lpf_v_sb_y_w16_8bpc_neon: 1493.5 1371.4 1248.5
2020-03-02arm: loopfilter: Prepare for 16 bpcMartin Storsjö
2020-03-02arm: loopfilter: Fix a commentMartin Storsjö
2020-02-17arm: cdef: Do an 8 bit implementation for cases with all edges presentMartin Storsjö
This increases the code size by around 3 KB on arm64. Before: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 807.1 517.0 617.7 506.6 429.9 357.8 cdef_filter_4x8_8bpc_neon: 1407.9 899.3 1054.6 862.3 726.5 628.1 cdef_filter_8x8_8bpc_neon: 2394.9 1456.8 1676.8 1461.2 1084.4 1101.2 ARM64: cdef_filter_4x4_8bpc_neon: 460.7 301.8 308.0 cdef_filter_4x8_8bpc_neon: 831.6 547.0 555.2 cdef_filter_8x8_8bpc_neon: 1454.6 935.6 960.4 After: ARM32: cdef_filter_4x4_8bpc_neon: 669.3 541.3 524.4 424.9 322.7 298.1 cdef_filter_4x8_8bpc_neon: 1159.1 922.9 881.1 709.2 538.3 514.1 cdef_filter_8x8_8bpc_neon: 1888.8 1285.4 1358.5 1152.9 839.3 871.2 ARM64: cdef_filter_4x4_8bpc_neon: 383.6 262.1 259.9 cdef_filter_4x8_8bpc_neon: 684.9 472.2 464.7 cdef_filter_8x8_8bpc_neon: 1160.0 756.8 788.0 (The checkasm benchmark averages three different cases; the fully edged case is one of those three, while it's the most common case in actual video. The difference is much bigger if only benchmarking that particular case.) This actually apparently makes the code a little bit slower for the w=4 cases on Cortex A8, while it's a significant speedup on all other cores.
2020-02-13arm: cdef: Remove leftover unused labels and macro parametersMartin Storsjö
These were missed in 361a3c8ee2d03f87f42a76213ee0f93e49fa9ec3.
2020-02-11arm64: looprestoration: NEON implementation of SGR for 10 bpcMartin Storsjö
This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can be int16_t for 10 bpc, but need to be int32_t for 12 bpc. Make actual templates out of the functions in looprestoration_tmpl.S, and add box3/5_h to looprestoration16.S. Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter (which is passed even in 8bpc mode), add a define to bitdepth.h for passing such a parameter in all modes. This makes this function a few instructions slower in 8bpc mode than it was before (overall impact seems to be around 1% of the total runtime of SGR), but allows using the same actual function instantiation for all modes, saving a bit of code size. Examples of checkasm runtimes: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 516755.8 389412.7 349058.7 selfguided_5x5_10bpc_neon: 380699.9 293486.6 254591.6 selfguided_mix_10bpc_neon: 878142.3 667495.9 587844.6 Corresponding 8 bpc numbers for comparison: selfguided_3x3_8bpc_neon: 491058.1 361473.4 347705.9 selfguided_5x5_8bpc_neon: 352655.0 266423.7 248192.2 selfguided_mix_8bpc_neon: 826094.1 612372.2 581943.1
2020-02-11arm64: looprestoration: Prepare for 16 bpc by splitting code to separate filesMartin Storsjö
looprestoration_common.S contains functions that can be used as is with one single instantiation of the functions for both 8 and 16 bpc. This file will be built once, regardless of which bitdepths are enabled. looprestoration_tmpl.S contains functions where the source can be shared and templated between 8 and 16 bpc. This will be included by the separate 8/16bpc implementaton files.
2020-02-11arm: looprestoration: Add 8bpc to existing function names, add HIGHBD_*_SUFFIXMartin Storsjö
Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same concrete function implementations can be shared for both 8 and 16 bpc for those functions.
2020-02-11arm: looprestoration: Improve scheduling in box3/5_h slightlyMartin Storsjö
Set flags further from the branch instructions that use them.
2020-02-11arm: Use int16_t for the tmp intermediate bufferMartin Storsjö
For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other intermediate int16_t buffers also need to be made of size coef anyway.
2020-02-11arm: looprestoration: Fix a commentMartin Storsjö
2020-02-11arm64: mc: Reduce the width of a register copyMartin Storsjö
Only copy as much as really is needed/used.
2020-02-11arm64: mc: Use two regs for alternating output rows for w4/8 in avg/w_avg/maskMartin Storsjö
It was already done this way for w32/64. Not doing it for w16 as it didn't help there (and instead gave a small slowdown due to the two setup instructions). This gives a small speedup on in-order cores like A53. Before: Cortex A53 A72 A73 avg_w4_8bpc_neon: 60.9 25.6 29.0 avg_w8_8bpc_neon: 143.6 52.8 64.0 After: avg_w4_8bpc_neon: 56.7 26.7 28.5 avg_w8_8bpc_neon: 137.2 54.5 64.4
2020-02-11arm64: mc: Simplify avg/w_avg/mask by always using the w16 macroMartin Storsjö
This shortens the source by 40 lines, and gives a significant speedup on A53, a small speedup on A72 and a very minor slowdown for avg/w_avg on A73. Before: Cortex A53 A72 A73 avg_w4_8bpc_neon: 67.4 26.1 25.4 avg_w8_8bpc_neon: 158.7 56.3 59.1 avg_w16_8bpc_neon: 382.9 154.1 160.7 w_avg_w4_8bpc_neon: 99.9 43.6 39.4 w_avg_w8_8bpc_neon: 253.2 98.3 99.0 w_avg_w16_8bpc_neon: 543.1 285.0 301.8 mask_w4_8bpc_neon: 110.6 51.4 45.1 mask_w8_8bpc_neon: 295.0 129.9 114.0 mask_w16_8bpc_neon: 654.6 365.8 369.7 After: avg_w4_8bpc_neon: 60.8 26.3 29.0 avg_w8_8bpc_neon: 142.8 52.9 64.1 avg_w16_8bpc_neon: 378.2 153.4 160.8 w_avg_w4_8bpc_neon: 78.7 41.0 40.9 w_avg_w8_8bpc_neon: 190.6 90.1 105.1 w_avg_w16_8bpc_neon: 531.1 279.3 301.4 mask_w4_8bpc_neon: 86.6 47.2 44.9 mask_w8_8bpc_neon: 222.0 114.3 114.9 mask_w16_8bpc_neon: 639.5 356.0 369.8
2020-02-08arm64: mc: NEON implementation of warp for 16 bpcMartin Storsjö
Checkasm benchmark numbers: Cortex A53 A72 A73 warp_8x8_16bpc_neon: 2029.9 1150.5 1225.2 warp_8x8t_16bpc_neon: 2007.6 1129.0 1192.3 Corresponding numbers for 8bpc for comparison: warp_8x8_8bpc_neon: 1863.8 1052.8 1106.2 warp_8x8t_8bpc_neon: 1847.4 1048.3 1099.8
2020-02-07arm64: cdef: Add NEON implementations of CDEF for 16 bpcMartin Storsjö
As some functions are made for both 8bpc and 16bpc from a shared template, those functions are moved to a separate assembly file which is included. That assembly file (cdef_tmpl.S) isn't intended to be assembled on its own (just like utils.S), but if it is assembled, it should produce an empty object file. Checkasm benchmarks: Cortex A53 A72 A73 cdef_dir_16bpc_neon: 422.7 305.5 314.0 cdef_filter_4x4_16bpc_neon: 452.9 282.7 296.6 cdef_filter_4x8_16bpc_neon: 800.9 515.3 534.1 cdef_filter_8x8_16bpc_neon: 1417.1 922.7 942.6 Corresponding numbers for 8bpc for comparison: cdef_dir_8bpc_neon: 394.7 268.8 281.8 cdef_filter_4x4_8bpc_neon: 461.5 300.9 307.7 cdef_filter_4x8_8bpc_neon: 831.6 546.1 555.6 cdef_filter_8x8_8bpc_neon: 1454.6 934.0 960.0
2020-02-07arm: cdef: Prepare for 16bpcMartin Storsjö
2020-02-06arm64: looprestoration: NEON implementation of wiener filter for 16 bpcMartin Storsjö
Checkasm benchmarks: Cortex A53 A72 A73 wiener_chroma_16bpc_neon: 190288.4 129369.5 127284.1 wiener_luma_16bpc_neon: 195108.4 129387.8 127042.7 The corresponding numbers for 8 bpc for comparison: wiener_chroma_8bpc_neon: 150586.9 101647.1 97709.9 wiener_luma_8bpc_neon: 146297.4 101593.2 97670.5
2020-02-06arm: looprestoration: Prepare for 16bpc wiener filter by adding _8bpc to ↵Martin Storsjö
function names
2020-02-06arm64: mc: NEON implementation of put/prep 8tap/bilin for 16 bpcMartin Storsjö
Examples of checkasm benchmarks: Cortex A53 A72 A73 mc_8tap_regular_w8_0_16bpc_neon: 96.8 49.6 62.5 mc_8tap_regular_w8_h_16bpc_neon: 570.3 388.0 467.2 mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 776.7 891.3 mc_8tap_regular_w8_v_16bpc_neon: 400.6 285.0 278.3 mc_bilinear_w8_0_16bpc_neon: 90.0 44.8 57.8 mc_bilinear_w8_h_16bpc_neon: 191.2 158.7 156.4 mc_bilinear_w8_hv_16bpc_neon: 295.9 234.6 244.9 mc_bilinear_w8_v_16bpc_neon: 147.2 98.7 89.2 mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.7 84.9 mct_8tap_regular_w8_h_16bpc_neon: 612.5 396.8 479.1 mct_8tap_regular_w8_hv_16bpc_neon: 1112.4 814.6 963.2 mct_8tap_regular_w8_v_16bpc_neon: 461.8 370.8 353.4 mct_bilinear_w8_0_16bpc_neon: 135.6 76.2 80.5 mct_bilinear_w8_h_16bpc_neon: 211.3 159.4 141.7 mct_bilinear_w8_hv_16bpc_neon: 325.7 237.2 227.2 mct_bilinear_w8_v_16bpc_neon: 180.7 135.9 129.5 For comparison, the corresponding numbers for 8 bpc: mc_8tap_regular_w8_0_8bpc_neon: 78.6 41.0 39.5 mc_8tap_regular_w8_h_8bpc_neon: 371.2 299.6 348.3 mc_8tap_regular_w8_hv_8bpc_neon: 817.1 675.0 726.5 mc_8tap_regular_w8_v_8bpc_neon: 243.7 260.4 253.0 mc_bilinear_w8_0_8bpc_neon: 74.8 35.4 36.1 mc_bilinear_w8_h_8bpc_neon: 179.9 69.9 79.2 mc_bilinear_w8_hv_8bpc_neon: 210.8 132.4 144.8 mc_bilinear_w8_v_8bpc_neon: 141.6 64.9 65.4 mct_8tap_regular_w8_0_8bpc_neon: 101.7 54.4 59.5 mct_8tap_regular_w8_h_8bpc_neon: 391.3 329.1 358.3 mct_8tap_regular_w8_hv_8bpc_neon: 880.4 754.9 829.4 mct_8tap_regular_w8_v_8bpc_neon: 270.8 300.8 277.4 mct_bilinear_w8_0_8bpc_neon: 97.6 54.0 55.4 mct_bilinear_w8_h_8bpc_neon: 173.3 73.5 79.5 mct_bilinear_w8_hv_8bpc_neon: 228.3 163.0 174.0 mct_bilinear_w8_v_8bpc_neon: 128.9 72.5 63.3
2020-02-04arm64: mc: NEON implementation of avg/mask/w_avg for 16 bpcMartin Storsjö
avg_w4_16bpc_neon: 78.2 43.2 48.9 avg_w8_16bpc_neon: 199.1 108.7 123.1 avg_w16_16bpc_neon: 615.6 339.9 373.9 avg_w32_16bpc_neon: 2313.0 1390.6 1490.6 avg_w64_16bpc_neon: 5783.6 3119.5 3653.0 avg_w128_16bpc_neon: 15444.6 8168.7 8907.9 w_avg_w4_16bpc_neon: 120.1 87.8 92.4 w_avg_w8_16bpc_neon: 321.6 252.4 263.1 w_avg_w16_16bpc_neon: 1017.5 794.5 831.2 w_avg_w32_16bpc_neon: 3911.4 3154.7 3306.5 w_avg_w64_16bpc_neon: 9977.9 7794.9 8022.3 w_avg_w128_16bpc_neon: 25721.5 19274.6 20041.7 mask_w4_16bpc_neon: 139.5 96.5 104.3 mask_w8_16bpc_neon: 376.0 283.9 300.1 mask_w16_16bpc_neon: 1217.2 906.7 950.0 mask_w32_16bpc_neon: 4811.1 3669.0 3901.3 mask_w64_16bpc_neon: 12036.4 8918.4 9244.8 mask_w128_16bpc_neon: 30888.8 21999.0 23206.7 For comparison, these are the corresponding numbers for 8bpc: avg_w4_8bpc_neon: 56.7 26.2 28.5 avg_w8_8bpc_neon: 137.2 52.8 64.3 avg_w16_8bpc_neon: 377.9 151.5 161.6 avg_w32_8bpc_neon: 1528.9 614.5 633.9 avg_w64_8bpc_neon: 3792.5 1814.3 1518.3 avg_w128_8bpc_neon: 10685.3 5220.4 3879.9 w_avg_w4_8bpc_neon: 75.2 53.0 41.1 w_avg_w8_8bpc_neon: 186.7 120.1 105.2 w_avg_w16_8bpc_neon: 531.6 314.1 302.1 w_avg_w32_8bpc_neon: 2138.4 1120.4 1171.5 w_avg_w64_8bpc_neon: 5151.9 2910.5 2857.1 w_avg_w128_8bpc_neon: 13945.0 7330.5 7389.1 mask_w4_8bpc_neon: 82.0 47.2 45.1 mask_w8_8bpc_neon: 213.5 115.4 115.8 mask_w16_8bpc_neon: 639.8 356.2 370.1 mask_w32_8bpc_neon: 2566.9 1489.8 1435.0 mask_w64_8bpc_neon: 6727.6 3822.8 3425.2 mask_w128_8bpc_neon: 17893.0 9622.6 9161.3
2020-02-01Rework the CDEF top edge handlingHenrik Gramner
Avoids some pointer chasing and simplifies the DSP code, at the cost of making the initialization a little bit more complicated. Also reduces memory usage by a small amount due to properly sizing the buffers instead of always allocating enough space for 4:4:4.
2020-01-29arm: cdef: Add special cased versions for pri_strength/sec_strength being zeroMartin Storsjö
Before: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 964.6 599.5 707.9 601.2 465.1 405.2 cdef_filter_4x8_8bpc_neon: 1726.0 1066.2 1238.7 1041.7 798.6 725.3 cdef_filter_8x8_8bpc_neon: 2974.4 1671.8 1943.9 1806.1 1229.8 1242.1 ARM64: cdef_filter_4x4_8bpc_neon: 569.2 337.8 348.7 cdef_filter_4x8_8bpc_neon: 1031.1 623.3 633.6 cdef_filter_8x8_8bpc_neon: 1847.5 1097.7 1117.5 After: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 798.4 524.2 617.3 506.8 432.4 361.1 cdef_filter_4x8_8bpc_neon: 1394.7 910.4 1054.0 863.6 730.2 632.2 cdef_filter_8x8_8bpc_neon: 2364.6 1453.8 1675.1 1466.0 1086.4 1107.7 ARM64: cdef_filter_4x4_8bpc_neon: 461.7 303.1 308.6 cdef_filter_4x8_8bpc_neon: 833.0 547.5 556.0 cdef_filter_8x8_8bpc_neon: 1459.3 934.1 967.9