Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/FFmpeg/FFmpeg.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-10-25lavc/aarch64: add hevc horizontal qpel/uni/biJ. Dekker
checkasm --benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 170.7 put_hevc_qpel_bi_h4_8_neon: 64.5 put_hevc_qpel_bi_h6_8_c: 373.7 put_hevc_qpel_bi_h6_8_neon: 130.2 put_hevc_qpel_bi_h8_8_c: 662.0 put_hevc_qpel_bi_h8_8_neon: 138.5 put_hevc_qpel_bi_h12_8_c: 1529.5 put_hevc_qpel_bi_h12_8_neon: 422.0 put_hevc_qpel_bi_h16_8_c: 2735.5 put_hevc_qpel_bi_h16_8_neon: 560.5 put_hevc_qpel_bi_h24_8_c: 6015.7 put_hevc_qpel_bi_h24_8_neon: 1636.0 put_hevc_qpel_bi_h32_8_c: 10779.0 put_hevc_qpel_bi_h32_8_neon: 2204.5 put_hevc_qpel_bi_h48_8_c: 24375.0 put_hevc_qpel_bi_h48_8_neon: 4984.0 put_hevc_qpel_bi_h64_8_c: 42768.0 put_hevc_qpel_bi_h64_8_neon: 8795.7 put_hevc_qpel_h4_8_c: 149.0 put_hevc_qpel_h4_8_neon: 55.7 put_hevc_qpel_h6_8_c: 321.2 put_hevc_qpel_h6_8_neon: 106.0 put_hevc_qpel_h8_8_c: 578.7 put_hevc_qpel_h8_8_neon: 133.2 put_hevc_qpel_h12_8_c: 1279.0 put_hevc_qpel_h12_8_neon: 391.7 put_hevc_qpel_h16_8_c: 2286.2 put_hevc_qpel_h16_8_neon: 519.7 put_hevc_qpel_h24_8_c: 5100.7 put_hevc_qpel_h24_8_neon: 1546.2 put_hevc_qpel_h32_8_c: 9022.0 put_hevc_qpel_h32_8_neon: 2060.2 put_hevc_qpel_h48_8_c: 20293.5 put_hevc_qpel_h48_8_neon: 4656.7 put_hevc_qpel_h64_8_c: 36037.0 put_hevc_qpel_h64_8_neon: 8262.7 put_hevc_qpel_uni_h4_8_c: 162.2 put_hevc_qpel_uni_h4_8_neon: 61.7 put_hevc_qpel_uni_h6_8_c: 355.2 put_hevc_qpel_uni_h6_8_neon: 114.2 put_hevc_qpel_uni_h8_8_c: 651.0 put_hevc_qpel_uni_h8_8_neon: 135.7 put_hevc_qpel_uni_h12_8_c: 1412.5 put_hevc_qpel_uni_h12_8_neon: 402.7 put_hevc_qpel_uni_h16_8_c: 2551.0 put_hevc_qpel_uni_h16_8_neon: 533.5 put_hevc_qpel_uni_h24_8_c: 5782.2 put_hevc_qpel_uni_h24_8_neon: 1578.7 put_hevc_qpel_uni_h32_8_c: 10586.5 put_hevc_qpel_uni_h32_8_neon: 2102.2 put_hevc_qpel_uni_h48_8_c: 23812.0 put_hevc_qpel_uni_h48_8_neon: 4739.5 put_hevc_qpel_uni_h64_8_c: 42958.7 put_hevc_qpel_uni_h64_8_neon: 8366.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-06-28lavc/aarch64: motion estimation functions in neonSwinney, Jonathan
- ff_pix_abs16_neon - ff_pix_abs16_xy2_neon In direct micro benchmarks of these ff functions verses their C implementations, these functions performed as follows on AWS Graviton 3. ff_pix_abs16_neon: pix_abs_0_0_c: 141.1 pix_abs_0_0_neon: 19.6 ff_pix_abs16_xy2_neon: pix_abs_0_3_c: 269.1 pix_abs_0_3_neon: 39.3 Tested with: ./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast pathsBen Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-01avcodec/vc1: Arm 64-bit NEON deblocking filter fast pathsBen Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C version can still outperform the NEON version in specific cases. The balance between different code paths is stream-dependent, but in practice the best case happens about 5% of the time, the worst case happens about 40% of the time, and the complexity of the remaining cases fall somewhere in between. Therefore, taking the average of the best and worst case timings is probably a conservative estimate of the degree by which the NEON code improves performance. vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
2021-02-18lavc/aarch64: add HEVC sao_band NEONJosh Dekker
Only works for 8x8. Signed-off-by: Josh Dekker <josh@itanimul.li>
2021-02-18lavc/aarch64: port HEVC SIMD idct NEONReimar Döffinger
Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth available on aarch64. For a UHD HDR (10 bit) sample video these were consuming the most time and this optimization reduced overall decode time from 19.4s to 16.4s, approximately 15% speedup. Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts", running on Apple M1. Signed-off-by: Josh Dekker <josh@itanimul.li>
2020-05-15libavcodec: aarch64: Add a NEON implementation of pixblockdspMartin Storsjö
Cortex A53 A72 A73 get_pixels_c: 140.7 87.7 72.5 get_pixels_neon: 46.0 20.0 19.5 get_pixels_unaligned_c: 140.7 87.7 73.0 get_pixels_unaligned_neon: 49.2 20.2 26.2 diff_pixels_c: 209.7 133.7 138.7 diff_pixels_neon: 54.2 31.7 23.5 diff_pixels_unaligned_c: 209.7 134.2 139.0 diff_pixels_unaligned_neon: 68.0 27.7 41.7 Signed-off-by: Martin Storsjö <martin@martin.st>
2020-03-11lavc/aarch64: Fix compilation with --disable-neonCarl Eugen Hoyos
Fixes ticket #8565.
2020-03-11lavc/aarch64: Move non-neon vp9 copy functions out of neon source file.Carl Eugen Hoyos
Fixes part of ticket #8565.
2019-04-10aarch64/opusdsp: implement NEON accelerated postfilter and deemphasisLynne
153372 UNITS in postfilter_c, 65536 runs, 0 skips 73164 UNITS in postfilter_neon, 65536 runs, 0 skips -> 2.1x speedup 80591 UNITS in deemphasis_c, 131072 runs, 0 skips 43969 UNITS in deemphasis_neon, 131072 runs, 0 skips -> 1.83x speedup Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime) Deemphasis SIMD based on the following unrolling: const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1; float state = coeff; for (int i = 0; i < len; i += 4) { y[0] = x[0] + c1*state; y[1] = x[1] + c2*state + c1*x[0]; y[2] = x[2] + c3*state + c1*x[1] + c2*x[0]; y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0]; state = y[3]; y += 4; x += 4; } Unlike the x86 version, duplication is used instead of pslldq so the structure and tables are different.
2019-02-20aarch64: vp8: Move the vp8dsp makefile entries to the right placesMartin Storsjö
Even if NEON would be disabled, the init functions should be built as they are called as long as ARCH_AARCH64 is set. These functions are part of a generic DSP subsytem, not tied directly to one decoder. (They should be built if the vp7 decoder is enabled, even if the vp8 decoder is disabled.) Signed-off-by: Martin Storsjö <martin@martin.st> (cherry picked from commit b4b27dce95a6d40bfcd78043d3abec7d80dae143)
2019-01-31libavcodec: vp8 neon optimizations for aarch64Magnus Röös
Partial port of the ARM Neon for aarch64. Benchmarks from fate: benchmarking with Linux Perf Monitoring API nop: 58.6 checkasm: using random seed 1760970128 NEON: - vp8dsp.idct [OK] - vp8dsp.mc [OK] - vp8dsp.loopfilter [OK] checkasm: all 21 tests passed vp8_idct_add_c: 201.6 vp8_idct_add_neon: 83.1 vp8_idct_dc_add_c: 107.6 vp8_idct_dc_add_neon: 33.8 vp8_idct_dc_add4y_c: 426.4 vp8_idct_dc_add4y_neon: 59.4 vp8_loop_filter8uv_h_c: 688.1 vp8_loop_filter8uv_h_neon: 216.3 vp8_loop_filter8uv_inner_h_c: 649.3 vp8_loop_filter8uv_inner_h_neon: 195.3 vp8_loop_filter8uv_inner_v_c: 544.8 vp8_loop_filter8uv_inner_v_neon: 131.3 vp8_loop_filter8uv_v_c: 706.1 vp8_loop_filter8uv_v_neon: 141.1 vp8_loop_filter16y_h_c: 668.8 vp8_loop_filter16y_h_neon: 242.8 vp8_loop_filter16y_inner_h_c: 647.3 vp8_loop_filter16y_inner_h_neon: 224.6 vp8_loop_filter16y_inner_v_c: 647.8 vp8_loop_filter16y_inner_v_neon: 128.8 vp8_loop_filter16y_v_c: 721.8 vp8_loop_filter16y_v_neon: 154.3 vp8_loop_filter_simple_h_c: 387.8 vp8_loop_filter_simple_h_neon: 187.6 vp8_loop_filter_simple_v_c: 384.1 vp8_loop_filter_simple_v_neon: 78.6 vp8_put_epel8_h4v4_c: 3971.1 vp8_put_epel8_h4v4_neon: 855.1 vp8_put_epel8_h4v6_c: 5060.1 vp8_put_epel8_h4v6_neon: 989.6 vp8_put_epel8_h6v4_c: 4320.8 vp8_put_epel8_h6v4_neon: 1007.3 vp8_put_epel8_h6v6_c: 5449.3 vp8_put_epel8_h6v6_neon: 1158.1 vp8_put_epel16_h6_c: 6683.8 vp8_put_epel16_h6_neon: 831.8 vp8_put_epel16_h6v6_c: 11110.8 vp8_put_epel16_h6v6_neon: 2214.8 vp8_put_epel16_v6_c: 7024.8 vp8_put_epel16_v6_neon: 799.6 vp8_put_pixels8_c: 112.8 vp8_put_pixels8_neon: 78.1 vp8_put_pixels16_c: 131.3 vp8_put_pixels16_neon: 129.8 Signed-off-by: Magnus Röös <mla2.roos@gmail.com>
2017-07-03lavc/aarch64: add sbrdsp neon implementationMatthieu Bouron
autocorrelate_c: 644.0 autocorrelate_neon: 420.0 hf_apply_noise_0_c: 1688.5 hf_apply_noise_0_neon: 1498.6 hf_apply_noise_1_c: 1691.2 hf_apply_noise_1_neon: 1500.6 hf_apply_noise_2_c: 1688.1 hf_apply_noise_2_neon: 1500.3 hf_apply_noise_3_c: 1696.6 hf_apply_noise_3_neon: 1502.2 hf_g_filt_c: 2117.8 hf_g_filt_neon: 1218.7 hf_gen_c: 4573.4 hf_gen_neon: 2461.0 neg_odd_64_c: 72.0 neg_odd_64_neon: 64.7 qmf_deint_bfly_c: 1107.6 qmf_deint_bfly_neon: 291.6 qmf_deint_neg_c: 210.4 qmf_deint_neg_neon: 107.4 qmf_post_shuffle_c: 163.0 qmf_post_shuffle_neon: 107.7 qmf_pre_shuffle_c: 120.5 qmf_pre_shuffle_neon: 110.7 sum64x5_c: 1361.6 sum64x5_neon: 435.4 sum_square_c: 1686.4 sum_square_neon: 787.2
2017-06-28lavc/aarch64: add a few SIMD functions for AAC PSClément Bœsch
☭ tests/checkasm/checkasm --bench --test=aacpsdsp checkasm: using random seed 3318985180 MMX implied by specified flags MMX implied by specified flags NEON: - aacpsdsp.add_squares [OK] - aacpsdsp.mul_pair_single [OK] - aacpsdsp.hybrid_analysis [OK] - aacpsdsp.stereo_interpolate [OK] checkasm: all 5 tests passed nop: 10.0 ps_add_squares_c: 63221.2 ps_add_squares_neon: 22311.7 ps_hybrid_analysis_c: 2466.6 ps_hybrid_analysis_neon: 1521.9 ps_mul_pair_single_c: 68592.0 ps_mul_pair_single_neon: 17426.6 ps_stereo_interpolate_c: 72344.3 ps_stereo_interpolate_neon: 72308.8 ps_stereo_interpolate_ipdopd_c: 117415.2 ps_stereo_interpolate_ipdopd_neon: 113386.3
2017-03-16lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functionsMatthieu Bouron
2017-01-24aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filterMartin Storsjö
This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_10bpp_neon: 213.2 172.6 vp9_loop_filter_h_8_8_10bpp_neon: 281.2 244.2 vp9_loop_filter_h_16_8_10bpp_neon: 657.0 444.5 vp9_loop_filter_h_16_16_10bpp_neon: 1280.4 877.7 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 397.7 358.0 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 465.7 429.0 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 465.7 428.0 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 533.7 499.0 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 271.5 244.0 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 330.0 305.0 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 329.0 306.0 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 386.0 365.0 vp9_loop_filter_v_4_8_10bpp_neon: 150.0 115.2 vp9_loop_filter_v_8_8_10bpp_neon: 209.0 175.5 vp9_loop_filter_v_16_8_10bpp_neon: 492.7 345.2 vp9_loop_filter_v_16_16_10bpp_neon: 951.0 682.7 This is significantly faster than the ARM version in almost all cases except for the mix2 functions. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-3x. Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfmMartin Storsjö
This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0 109.7 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0 733.5 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0 119.7 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0 494.7 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 13333.7 9695.1 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 18531.3 12459.8 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 24470.7 16160.2 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7 The larger transforms are significantly faster than the corresponding ARM versions. The speedup vs C code is smaller than in 32 bit mode, probably because the 64 bit intermediates in the C code can be expressed more efficiently in aarch64. Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24aarch64: Add NEON optimizations for 10 and 12 bit vp9 MCMartin Storsjö
This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon: 324.4 296.6 vp9_avg32_10bpp_neon: 1236.5 1148.2 vp9_avg64_10bpp_neon: 4639.6 4571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3 vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.9 2195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2 vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-06imdct15: remove the AArch64 assemblyRostislav Pehlivanov
Prep work for the next commit, which will add a new FFT algorithm which makes the iMDCT over 3x faster than it is currently (standalone, the FFT is with some framesizes over 10x faster). The new FFT algorithm uses the already thouroughly SIMD'd power of two FFT which already has SIMD for AArch64, so users of that platform will still see an improvement. The previous FFT+SIMD was barely 2.5x faster than the C versions on these platforms. Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2016-11-15aarch64: vp9: Implement NEON loop filtersMartin Storsjö
This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon: 672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.0 88.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon: 546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon: 256.6 93.4 loop_filter_h_8_8_neon: 307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon: 271.7 65.3 loop_filter_v_8_8_neon: 312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. This is an adapted cherry-pick from libav commits 9d2afd1eb8c5cc0633062430e66326dbf98c99e0 and 31756abe29eb039a11c59a42cb12e0cc2aef3b97. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2016-11-15aarch64: vp9: Add NEON itxfm routinesMartin Storsjö
This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. This is an adapted cherry-pick from libav commit 3c9546dfafcdfe8e7860aff9ebbf609318220f29. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2016-11-15aarch64: vp9: Add NEON optimizations of VP9 MC functionsMartin Storsjö
This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. This is an adapted cherry-pick from libav commit 383d96aa2229f644d9bd77b821ed3a309da5e9fc. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2016-09-25avcodec: fix vc1dsp dependenciesXiaolei Yu
2016-05-11aarch64/synth_filter: fix compilationJames Almer
Signed-off-by: James Almer <jamrial@gmail.com>
2016-05-09Merge commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec'Derek Buitenhuis
* commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec': build: miscellaneous cosmetics Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2016-04-07build: miscellaneous cosmeticsDiego Biurrun
Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.
2016-03-01fft: Split MDCT bits off from FFTDiego Biurrun
2016-01-31avcodec/dca: add new decoder based on libdcadecfoo86
2016-01-31avcodec/dca: remove old decoderfoo86
Remove all files and functions which are not going to be reused, and disable all functions and FATE tests temporarily which will be.
2016-01-25avcodec/synth_filter: split off remaining code from dcadec filesJames Almer
Signed-off-by: James Almer <jamrial@gmail.com>
2015-12-14arm64: int32_to_float_fmul neon asmJanne Grunau
3% faster dts decoding on a cortex-a57. cortex-a57 cortex-a53 int32_to_float_fmul_array8_c: 1270.9 4475.6 int32_to_float_fmul_array8_neon: 328.6 569.2 int32_to_float_fmul_scalar_c: 928.5 4119.6 int32_to_float_fmul_scalar_neon: 309.1 524.1
2015-12-14arm64: port synth_filter_float_neon from armJanne Grunau
~25% faster dts decoding overall. The checkasm CPU cycles numbers are not that useful since synth_filter_float() calls FFTContext.imdct_half(). cortex-a57 cortex-a53 synth_filter_float_c: 1866.2 3490.9 synth_filter_float_neon: 915.0 1531.5 With fftc.imdct_half forced to imdct_half_neon: cortex-a57 cortex-a53 synth_filter_float_c: 1718.4 3025.3 synth_filter_float_neon: 926.2 1530.1
2015-12-14arm64: convert dcadsp neon asm from armJanne Grunau
~2% faster dts decoding overall. cortex-a57 cortex-a53 dca_decode_hf_c: 474.8 1659.9 dca_decode_hf_neon: 225.2 301.1 dca_lfe_fir0_c: 913.2 1537.7 dca_lfe_fir0_neon: 286.8 451.9 dca_lfe_fir1_c: 848.7 1711.5 dca_lfe_fir1_neon: 387.1 506.4
2015-07-21h264: aarch64: intra prediction optimisationsJanne Grunau
2015-02-02opus: Factor out imdct15 into a standalone componentDiego Biurrun
It will be reused by the AAC decoder.
2014-05-15aarch64: opus NEON iMDCT and FFTJanne Grunau
Opus celt decoding 11% faster and the iMDCT over 2.5 times faster on Apple's A7.
2014-04-23aarch64: NEON vorbis_inverse_couplingJanne Grunau
From the ARMv7 NEON version. 16 times faster as the C version, overall more than 12% faster vorbis decoding on Apple's A7.
2014-04-23aarch64: NEON fixed/floating point MPADSP apply_windowJanne Grunau
30%/25% (fixed/float) faster mp3 decoding on Apple's A7. The floating point decoder is approximately 7% faster.
2014-04-22aarch64: NEON float (i)MDCTJanne Grunau
Approximately as fast as the ARM NEON version on Apple's A7.
2014-04-22aarch64: NEON float FFTJanne Grunau
Approximately as fast as the ARM NEON version on Apple's A7.
2014-04-06aarch64: implement videodsp.prefetchJanne Grunau
8% faster h264 decoding on Apple A7.
2014-03-20build: Group general components separate from de/encoders in arch MakefilesDiego Biurrun
This is in line with how the top-level libavcodec Makefile is structured.
2014-01-15aarch64: port neon clobber test from armJanne Grunau
2014-01-15aarch64: h264 loop filter NEON optimizationsJanne Grunau
Ported from ARMv7 NEON.
2014-01-15aarch64: hpeldsp NEON optimizationsJanne Grunau
Ported from ARMv7 NEON.
2014-01-15aarch64: h264 qpel NEON optimizationsJanne Grunau
Ported from ARMv7 NEON.
2014-01-15aarch64: h264 idct NEON assembler optimizationsJanne Grunau
Ported from ARMv7 NEON.
2014-01-15aarch64: h264 chroma motion compensation NEON optimizationsJanne Grunau
Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included.