Age | Commit message (Collapse) | Author |
|
|
|
Relative speedups over the C code:
Cortex A53 A72 A73
intra_pred_dc_128_w4_8bpc_neon: 2.08 1.47 2.17
intra_pred_dc_128_w8_8bpc_neon: 3.33 2.49 4.03
intra_pred_dc_128_w16_8bpc_neon: 3.93 3.86 3.75
intra_pred_dc_128_w32_8bpc_neon: 3.14 3.79 2.90
intra_pred_dc_128_w64_8bpc_neon: 3.68 1.97 2.42
intra_pred_dc_left_w4_8bpc_neon: 2.41 1.70 2.23
intra_pred_dc_left_w8_8bpc_neon: 3.53 2.41 3.32
intra_pred_dc_left_w16_8bpc_neon: 3.87 3.54 3.34
intra_pred_dc_left_w32_8bpc_neon: 4.10 3.60 2.76
intra_pred_dc_left_w64_8bpc_neon: 3.72 2.00 2.39
intra_pred_dc_top_w4_8bpc_neon: 2.27 1.66 2.07
intra_pred_dc_top_w8_8bpc_neon: 3.83 2.69 3.43
intra_pred_dc_top_w16_8bpc_neon: 3.66 3.60 3.20
intra_pred_dc_top_w32_8bpc_neon: 3.92 3.54 2.66
intra_pred_dc_top_w64_8bpc_neon: 3.60 1.98 2.30
intra_pred_dc_w4_8bpc_neon: 2.29 1.42 2.16
intra_pred_dc_w8_8bpc_neon: 3.56 2.83 3.05
intra_pred_dc_w16_8bpc_neon: 3.46 3.37 3.15
intra_pred_dc_w32_8bpc_neon: 3.79 3.41 2.74
intra_pred_dc_w64_8bpc_neon: 3.52 2.01 2.41
intra_pred_h_w4_8bpc_neon: 10.34 5.74 5.94
intra_pred_h_w8_8bpc_neon: 12.13 6.33 6.43
intra_pred_h_w16_8bpc_neon: 10.66 7.31 5.85
intra_pred_h_w32_8bpc_neon: 6.28 4.18 2.88
intra_pred_h_w64_8bpc_neon: 3.96 1.85 1.75
intra_pred_v_w4_8bpc_neon: 11.44 6.12 7.57
intra_pred_v_w8_8bpc_neon: 14.76 7.58 7.95
intra_pred_v_w16_8bpc_neon: 11.34 6.28 5.88
intra_pred_v_w32_8bpc_neon: 6.56 3.33 3.34
intra_pred_v_w64_8bpc_neon: 4.57 1.24 1.97
|
|
------------------------------------------
x86_64: warp_8x8_8bpc_c: 1773.4
x86_32: warp_8x8_8bpc_c: 1740.4
----------
x86_64: warp_8x8_8bpc_ssse3: 317.5
x86_32: warp_8x8_8bpc_ssse3: 378.4
----------
x86_64: warp_8x8_8bpc_sse4: 303.7
x86_32: warp_8x8_8bpc_sse4: 367.7
----------
x86_64: warp_8x8_8bpc_avx2: 224.9
---------------------
---------------------
x86_64: warp_8x8t_8bpc_c: 1664.6
x86_32: warp_8x8t_8bpc_c: 1674.0
----------
x86_64: warp_8x8t_8bpc_ssse3: 320.7
x86_32: warp_8x8t_8bpc_ssse3: 379.5
----------
x86_64: warp_8x8t_8bpc_sse4: 304.8
x86_32: warp_8x8t_8bpc_sse4: 369.8
----------
x86_64: warp_8x8t_8bpc_avx2: 228.5
------------------------------------------
|
|
Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
to be clipped.
This fixes mismatches for some samples, see issue #299.
Before: Cortex A53 A72 A73
inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 93.0 52.8 49.5
inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 260.0 186.0 196.4
inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1371.0 953.4 1028.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7363.2 4887.5 5135.8
inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25029.0 17492.3 18404.5
After:
inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 105.0 58.7 55.2
inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 294.0 211.5 209.9
inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1495.8 1050.4 1070.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7866.7 5197.8 5321.4
inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25807.2 18619.3 18526.9
|
|
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
|
|
Even though smull+smlal does two multiplications instead of one,
the combination seems to be better handled by actual cores.
Before: Cortex A53 A72 A73
inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 356.0 279.2 278.0
inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.0 1329.5 1308.8
After:
inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 360.0 253.2 269.3
inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1793.1 1300.9 1254.0
(In this particular cases, it seems like it is a minor regression
on A53, which is probably more due to having to change the ordering
of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
the second output register sooner than an addl+addl2 would have, but
in general, smull+smlal seems to be equally good or better than
addl+mul on A53 as well.)
|
|
Right now this just allocates a new buffer for every frame, uses it,
then discards it immediately. This is not optimal, either dav1d should
start reusing buffers internally or we need to pool them in dav1dplay.
As it stands, this is not really a performance gain. I'll have to
investigate why, but my suspicion is that seeing any gains might require
reusing buffers somewhere.
Note: Thrashing buffers is not as bad as it seems, initially. Not only
does libplacebo pool and reuse GPU memory and buffer state objects
internally, but this also absolves us from having to do any manual
polling to figure out when the buffer is reusable again. Creating, using
and immediately destroying buffers actually isn't as bad an approach as
it might otherwise seem.
It's entirely possible that this is only bad because of lock contention.
As said, I'll have to investigate further...
|
|
Useful to test the effects of performance changes to the
decoding/rendering loop as a whole.
|
|
Only meaningful with libplacebo. The defaults are higher quality than
SDL so it's an unfair comparison and definitely too much for slow iGPUs
at 4K res. Make the defaults fast/dumb processing only, and guard the
debanding/dithering/upscaling/etc. behind a new --highquality flag.
|
|
------------------------------------------
x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
---------------------
x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
---------------------
x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
---------------------
x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
---------------------
x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
---------------------
x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
---------------------
x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
---------------------
x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
---------------------
x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
---------------------
x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
------------------------------------------
|
|
---------------------
x86_64:
------------------------------------------
lpf_h_sb_uv_w4_8bpc_c: 430.6
lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
lpf_h_sb_uv_w4_8bpc_avx2: 200.4
---------------------
lpf_h_sb_uv_w6_8bpc_c: 981.9
lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
lpf_h_sb_uv_w6_8bpc_avx2: 270.0
---------------------
lpf_h_sb_y_w4_8bpc_c: 3001.7
lpf_h_sb_y_w4_8bpc_ssse3: 466.3
lpf_h_sb_y_w4_8bpc_avx2: 383.1
---------------------
lpf_h_sb_y_w8_8bpc_c: 4457.7
lpf_h_sb_y_w8_8bpc_ssse3: 818.9
lpf_h_sb_y_w8_8bpc_avx2: 537.0
---------------------
lpf_h_sb_y_w16_8bpc_c: 1967.9
lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
lpf_h_sb_y_w16_8bpc_avx2: 1078.2
---------------------
lpf_v_sb_uv_w4_8bpc_c: 369.4
lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
lpf_v_sb_uv_w4_8bpc_avx2: 58.1
---------------------
lpf_v_sb_uv_w6_8bpc_c: 769.6
lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
lpf_v_sb_uv_w6_8bpc_avx2: 117.8
---------------------
lpf_v_sb_y_w4_8bpc_c: 772.4
lpf_v_sb_y_w4_8bpc_ssse3: 179.8
lpf_v_sb_y_w4_8bpc_avx2: 173.6
---------------------
lpf_v_sb_y_w8_8bpc_c: 1660.2
lpf_v_sb_y_w8_8bpc_ssse3: 468.3
lpf_v_sb_y_w8_8bpc_avx2: 345.8
---------------------
lpf_v_sb_y_w16_8bpc_c: 1889.6
lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
lpf_v_sb_y_w16_8bpc_avx2: 568.1
------------------------------------------
|
|
fguv_32x32xn_8bpc_420_csfl0_c: 8945.4
fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6
fguv_32x32xn_8bpc_420_csfl1_c: 6363.4
fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5
|
|
This would affect the output in samples with an odd width and horizontal
chroma subsampling. The check does not exist in libaom, and might cause
mismatches.
This causes issues in the sample from #210, which uses super-resolution
and has odd width. To work around this, make super-resolution's resize()
always write an even number of pixels. This should not interfere with
SIMD in the future.
|
|
fgy_32x32xn_8bpc_c: 16181.8
fgy_32x32xn_8bpc_avx2: 3231.4
gen_grain_y_ar0_8bpc_c: 108857.6
gen_grain_y_ar0_8bpc_avx2: 22826.7
gen_grain_y_ar1_8bpc_c: 168239.8
gen_grain_y_ar1_8bpc_avx2: 72117.2
gen_grain_y_ar2_8bpc_c: 266165.9
gen_grain_y_ar2_8bpc_avx2: 126281.8
gen_grain_y_ar3_8bpc_c: 448139.4
gen_grain_y_ar3_8bpc_avx2: 137047.1
|
|
|
|
|
|
Both values can be independently coded in the bitstream, and are not
always equal to frame_width and frame_height.
|
|
For some reason the MSVC CRT _wassert() function is not flagged as
__declspec(noreturn), so when using those headers the compiler will
expect execution to continue after an assertion has been triggered
and will therefore complain about the use of uninitialized variables
when compiled in debug mode in certain code paths.
Reorder some case statements as a workaround.
|
|
For w <= 32 we can't process more than two rows per loop iteration.
Credit to OSS-Fuzz.
|
|
16-bit precision is sufficient for the second pass, but the first pass
requires 32-bit precision to correctly handle some esoteric edge cases.
|
|
avoid too narrow clipping
See issue #295, this fixes it for arm64.
Before: Cortex A53 A72 A73
inv_txfm_add_4x4_adst_adst_1_8bpc_neon: 103.0 63.2 65.2
inv_txfm_add_4x8_adst_adst_1_8bpc_neon: 197.0 145.0 134.2
inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 332.0 248.0 247.1
inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1676.8 1197.0 1186.8
After:
inv_txfm_add_4x4_adst_adst_1_8bpc_neon: 103.0 76.4 67.0
inv_txfm_add_4x8_adst_adst_1_8bpc_neon: 205.0 155.0 143.8
inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 358.0 269.0 276.2
inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.2 1347.8 1312.1
This would probably only be needed for adst in the first pass, but
the additional code complexity from splitting the implementations
(as we currently don't have transforms differentiated between first
and second pass) isn't necessarily worth it (the speedup over C code
is still 8-10x).
|
|
__assume() doesn't work correctly in clang-cl versions prior to 7.0.0
which causes bogus warnings regarding use of uninitialized variables
to be printed. Avoid that by using __builtin_unreachable() instead.
|
|
clang-cl doesn't like function calls in __assume statements, even
trivial inline ones.
|
|
This large constant needs a movw instruction, which newer binutils can
figure out, but older versions need stated explicitly.
This fixes #296.
|
|
The chroma part of pal_idx potentially conflicts during intra
reconstruction with edge_{8,16}bpc. Fixes out of range pixel values
caused by invalid palette indices in
clusterfuzz-testcase-minimized-dav1d_fuzzer_mt-5076736684851200.
Fixes #294. Reported as integer overflows in boxsum5sqr with undefined
behavior sanitizer. Credits to oss-fuzz.
|
|
|
|
Fixes libaom/dav1d mismatch in av1-1-b10-23-film_grain-50.ivf.
|
|
|
|
- calculate chroma grain based on src (not dst) luma pixels;
- division should precede multiplication in delta calculation.
Together, these fix differences in film grain reconstruction between
libaom and dav1d for various generated samples.
|
|
|
|
Use the so far unused lr register instead of r10.
|
|
|
|
Otherwise the table can get out of sync when the frame size and tile
count stays the same, but the tile coordinates change. Fixes #266.
|
|
Fixes integer overflows with very large frame sizes.
Credit to OSS-Fuzz.
|
|
|
|
|
|
|
|
Eliminates some sign extensions.
|
|
When compiling in release mode, instead of just deleting assertions,
use them to give hints to the compiler. This allows for slightly
better code generation in some cases.
|
|
A73 A53
w_mask_420_w4_8bpc_c: 818 1082.9
w_mask_420_w4_8bpc_neon: 79 126.6
w_mask_420_w8_8bpc_c: 2486 3399.8
w_mask_420_w8_8bpc_neon: 200.2 343.7
w_mask_420_w16_8bpc_c: 8022.3 10989.6
w_mask_420_w16_8bpc_neon: 528.1 889
w_mask_420_w32_8bpc_c: 31851.8 42808.6
w_mask_420_w32_8bpc_neon: 2062.5 3380.8
w_mask_420_w64_8bpc_c: 79268.5 102683.9
w_mask_420_w64_8bpc_neon: 5252.9 8575.4
w_mask_420_w128_8bpc_c: 193704.1 255586.5
w_mask_420_w128_8bpc_neon: 14602.3 22167.7
w_mask_422_w4_8bpc_c: 777.3 1038.5
w_mask_422_w4_8bpc_neon: 72.1 112.9
w_mask_422_w8_8bpc_c: 2405.7 3168
w_mask_422_w8_8bpc_neon: 191.9 314.1
w_mask_422_w16_8bpc_c: 7783.7 10543.9
w_mask_422_w16_8bpc_neon: 559.8 835.5
w_mask_422_w32_8bpc_c: 30895.7 41141.2
w_mask_422_w32_8bpc_neon: 2089.7 3187.2
w_mask_422_w64_8bpc_c: 75500.2 98766.3
w_mask_422_w64_8bpc_neon: 5379 8208.2
w_mask_422_w128_8bpc_c: 186967.1 245809.1
w_mask_422_w128_8bpc_neon: 15159.9 21474.5
w_mask_444_w4_8bpc_c: 850.1 1136.6
w_mask_444_w4_8bpc_neon: 66.5 104.7
w_mask_444_w8_8bpc_c: 2373.5 3262.9
w_mask_444_w8_8bpc_neon: 180.5 290.2
w_mask_444_w16_8bpc_c: 7291.6 10590.7
w_mask_444_w16_8bpc_neon: 550.9 809.7
w_mask_444_w32_8bpc_c: 8048.3 10140.8
w_mask_444_w32_8bpc_neon: 2136.2 3095
w_mask_444_w64_8bpc_c: 18055.3 23060
w_mask_444_w64_8bpc_neon: 5522.5 8124.8
w_mask_444_w128_8bpc_c: 42754.3 56072
w_mask_444_w128_8bpc_neon: 15569.5 21531.5
|
|
A73 A53
blend_h_w2_8bpc_c: 184.7 301.5
blend_h_w2_8bpc_neon: 58.8 104.1
blend_h_w4_8bpc_c: 291.4 507.3
blend_h_w4_8bpc_neon: 48.7 108.9
blend_h_w8_8bpc_c: 510.1 992.7
blend_h_w8_8bpc_neon: 66.5 99.3
blend_h_w16_8bpc_c: 972 1835.3
blend_h_w16_8bpc_neon: 82.7 145.2
blend_h_w32_8bpc_c: 776.7 912.9
blend_h_w32_8bpc_neon: 155.1 266.9
blend_h_w64_8bpc_c: 1424.3 1635.4
blend_h_w64_8bpc_neon: 273.4 480.9
blend_h_w128_8bpc_c: 3318.1 3774
blend_h_w128_8bpc_neon: 614.1 1097.9
blend_v_w2_8bpc_c: 278.8 427.5
blend_v_w2_8bpc_neon: 113.7 170.4
blend_v_w4_8bpc_c: 960.2 1597.7
blend_v_w4_8bpc_neon: 222.9 351.4
blend_v_w8_8bpc_c: 1694.2 3333.5
blend_v_w8_8bpc_neon: 200.9 333.6
blend_v_w16_8bpc_c: 3115.2 5971.6
blend_v_w16_8bpc_neon: 233.2 494.8
blend_v_w32_8bpc_c: 3949.7 6070.6
blend_v_w32_8bpc_neon: 460.4 841.6
blend_w4_8bpc_c: 244.2 388.3
blend_w4_8bpc_neon: 25.5 66.7
blend_w8_8bpc_c: 616.3 1120.8
blend_w8_8bpc_neon: 46 110.7
blend_w16_8bpc_c: 2193.1 4056.4
blend_w16_8bpc_neon: 140.7 299.3
blend_w32_8bpc_c: 2502.8 2998.5
blend_w32_8bpc_neon: 381.4 725.3
|
|
|
|
|
|
This particular sequence is executed often enough to justify having
a separate slightly more optimized code path instead of just chaining
multiple generic symbol decoding function calls together.
|
|
* Eliminate the trailing zero after the CDF probabilities. We can
reuse the count value as a terminator instead. This reduces the
size of the CDF context by around 8%.
* Align the CDF arrays.
* Various other minor optimizations.
|
|
|
|
|
|
dav1dplay shouldn't be built by default. And it's an example more than a tool.
|
|
|
|
|