Age | Commit message (Collapse) | Author |
|
|
|
|
|
Some specific Haswell CPU:s have a hardware bug where the popcnt
instruction doesn't set zero flag correctly, which causes the wrong
branch to be taken.
popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch
on the input value instead of the output reduces the amount of time
wasted going down the wrong code path in case of branch mispredictions.
|
|
Only use this in the cases when NEON can be used unconditionally
without runtime detection (when __ARM_NEON is defined).
The speedup over the C code is very modest for the smaller functions
(and the NEON version actually is a little slower than the C code
on Cortex A7 for adapt4), but the speedup is around 2x for
adapt16.
Cortex A7 A8 A9 A53 A72 A73
msac_decode_bool_c: 41.1 43.0 43.0 37.3 26.2 31.3
msac_decode_bool_neon: 40.2 42.0 37.2 32.8 19.9 25.5
msac_decode_bool_adapt_c: 65.1 70.4 58.5 54.3 33.2 40.8
msac_decode_bool_adapt_neon: 56.8 52.4 49.3 42.6 27.1 33.7
msac_decode_bool_equi_c: 36.9 37.2 42.8 32.6 22.7 42.3
msac_decode_bool_equi_neon: 34.9 35.1 36.4 29.7 19.5 36.4
msac_decode_symbol_adapt4_c: 114.2 139.0 111.6 99.9 65.5 83.5
msac_decode_symbol_adapt4_neon: 119.2 128.3 95.7 82.2 58.2 57.5
msac_decode_symbol_adapt8_c: 176.0 207.9 164.0 154.4 88.0 117.0
msac_decode_symbol_adapt8_neon: 128.3 130.3 110.7 85.1 59.9 61.4
msac_decode_symbol_adapt16_c: 292.1 320.5 256.4 246.4 129.1 173.3
msac_decode_symbol_adapt16_neon: 162.2 144.3 129.0 104.2 69.2 69.9
(Omitting msac_decode_hi_tok from the benchmark, as the "C" version
measured there uses the NEON version of msac_decode_symbol_adapt4.)
|
|
The speedup (over the normal version, that just calls the existing
assembly version of symbol_adapt4) is not very impressive on
bigger cores, but looks decent on small cores. It's an improvement
though, in any case.
Cortex A53 A72 A73
msac_decode_hi_tok_c: 175.7 136.2 138.1
msac_decode_hi_tok_neon: 146.8 129.4 125.9
|
|
|
|
Include the letter prefix when calling the macro, making it
slightly less obscure.
|
|
By multiplicating the performance counter value (within its own
time base) by the intended target time base, and only then dividing,
we reduce the available numeric range by the factor of the
original time base times the new time base.
On Windows 10 on ARM64, the performance counter frequency is
19200000 (on x86_64 in a virtual machine, it's 10000000), making
the calculation overflow every (1 << 64) / (19200000 * 1000000000)
= 960 seconds, i.e. 16 minutes - long before the actual uint64_t
nanosecond return value wraps around.
|
|
Even if we don't want to throttle decoding to realtime, and
even if the file itself didn't contain a valid fps value, we
may want to call the synchronize function to fetch the current
elapsed decoding time, for displaying the fps value.
|
|
|
|
Bilin scaled being very rarely used, add a new table entry to
mc_subpel_filters, and jump to the put/prep_8tap_scaled code.
AVX2 performance is obviously the same as the 8tap code, the speed up is
much smaller though, as the C code is a true bilinear codepath,
auto-vectorized. Yet, the AVX2 performance are always better.
|
|
mct_scaled_8tap_regular_w4_8bpc_c: 872.1
mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3
mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1
mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7
mct_scaled_8tap_regular_w8_8bpc_c: 2261.0
mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9
mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3
mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8
mct_scaled_8tap_regular_w16_8bpc_c: 4335.3
mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2
mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4
mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6
mct_scaled_8tap_regular_w32_8bpc_c: 17871.9
mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7
mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0
mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1
mct_scaled_8tap_regular_w64_8bpc_c: 46967.5
mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2
mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3
mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9
mct_scaled_8tap_regular_w128_8bpc_c: 111190.8
mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0
mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6
mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0
|
|
|
|
Clamping in the motion vector projection calculation is required by spec.
In commit aca57bf3db00c29e90605656f1015561d1d67c2d
a rewrite of the function omitted the clamping. This commit readds the
clamping.
|
|
Add an .error case for windows if subtracting more than 8 KB, simplify
the generic subtraction case.
|
|
The transforms process vectors of up to 8 elements at a time, for
transforms up to size 8; for larger transforms, it uses vectors of
4 elements.
Overall, the speedup over C code seems to be around 8-14x for the
larger transforms, and 10-19x for the smaller ones.
Relative speedup over C code (built with GCC 7.5) for a few functions:
Cortex A7 A8 A9 A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_8bpc_neon: 3.83 3.42 2.57 3.36 2.97 7.47
inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 7.25 13.53 8.38 8.82 7.96 12.37
inv_txfm_add_8x8_dct_dct_0_8bpc_neon: 4.78 6.61 4.82 4.65 5.27 9.76
inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 10.20 19.07 13.07 14.69 11.45 15.50
inv_txfm_add_16x16_dct_dct_0_8bpc_neon: 4.26 5.06 3.00 3.74 4.05 4.49
inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 10.51 16.02 13.57 14.03 12.86 18.16
inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 7.95 11.75 9.09 10.64 10.06 14.07
inv_txfm_add_32x32_dct_dct_0_8bpc_neon: 5.31 5.58 3.14 4.18 4.80 4.57
inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 12.66 16.07 14.34 16.00 15.24 21.32
inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 8.25 10.69 8.90 10.59 10.41 14.39
inv_txfm_add_64x64_dct_dct_0_8bpc_neon: 4.69 5.97 3.17 3.96 4.57 4.34
inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 11.47 12.68 10.18 14.73 14.20 17.95
inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 8.84 10.13 7.94 11.25 10.58 13.88
|
|
|
|
The struct is already zero-initialized when the function is called
except for the checkasm test, so move the zeroing there instead.
|
|
meson.source_root() returns the root of a parent project if dav1d is
embedded as a subproject.
|
|
---------------------
x86_64:
------------------------------------------
mct_8tap_regular_w4_h_8bpc_c: 302.3
mct_8tap_regular_w4_h_8bpc_sse2: 47.3
mct_8tap_regular_w4_h_8bpc_ssse3: 19.5
---------------------
mct_8tap_regular_w8_h_8bpc_c: 745.5
mct_8tap_regular_w8_h_8bpc_sse2: 235.2
mct_8tap_regular_w8_h_8bpc_ssse3: 70.4
---------------------
mct_8tap_regular_w16_h_8bpc_c: 1844.3
mct_8tap_regular_w16_h_8bpc_sse2: 755.6
mct_8tap_regular_w16_h_8bpc_ssse3: 225.9
---------------------
mct_8tap_regular_w32_h_8bpc_c: 6685.5
mct_8tap_regular_w32_h_8bpc_sse2: 2954.4
mct_8tap_regular_w32_h_8bpc_ssse3: 795.8
---------------------
mct_8tap_regular_w64_h_8bpc_c: 15633.5
mct_8tap_regular_w64_h_8bpc_sse2: 7120.4
mct_8tap_regular_w64_h_8bpc_ssse3: 1900.4
---------------------
mct_8tap_regular_w128_h_8bpc_c: 37772.1
mct_8tap_regular_w128_h_8bpc_sse2: 17698.1
mct_8tap_regular_w128_h_8bpc_ssse3: 4665.5
------------------------------------------
mct_8tap_regular_w4_v_8bpc_c: 306.5
mct_8tap_regular_w4_v_8bpc_sse2: 71.7
mct_8tap_regular_w4_v_8bpc_ssse3: 37.9
---------------------
mct_8tap_regular_w8_v_8bpc_c: 923.3
mct_8tap_regular_w8_v_8bpc_sse2: 168.7
mct_8tap_regular_w8_v_8bpc_ssse3: 71.3
---------------------
mct_8tap_regular_w16_v_8bpc_c: 3040.1
mct_8tap_regular_w16_v_8bpc_sse2: 505.1
mct_8tap_regular_w16_v_8bpc_ssse3: 199.7
---------------------
mct_8tap_regular_w32_v_8bpc_c: 12354.8
mct_8tap_regular_w32_v_8bpc_sse2: 1942.0
mct_8tap_regular_w32_v_8bpc_ssse3: 714.2
---------------------
mct_8tap_regular_w64_v_8bpc_c: 29427.9
mct_8tap_regular_w64_v_8bpc_sse2: 4637.4
mct_8tap_regular_w64_v_8bpc_ssse3: 1829.2
---------------------
mct_8tap_regular_w128_v_8bpc_c: 72756.9
mct_8tap_regular_w128_v_8bpc_sse2: 11301.0
mct_8tap_regular_w128_v_8bpc_ssse3: 5020.6
------------------------------------------
mct_8tap_regular_w4_hv_8bpc_c: 876.9
mct_8tap_regular_w4_hv_8bpc_sse2: 171.7
mct_8tap_regular_w4_hv_8bpc_ssse3: 112.2
---------------------
mct_8tap_regular_w8_hv_8bpc_c: 2215.1
mct_8tap_regular_w8_hv_8bpc_sse2: 730.2
mct_8tap_regular_w8_hv_8bpc_ssse3: 330.9
---------------------
mct_8tap_regular_w16_hv_8bpc_c: 6075.5
mct_8tap_regular_w16_hv_8bpc_sse2: 2252.1
mct_8tap_regular_w16_hv_8bpc_ssse3: 973.4
---------------------
mct_8tap_regular_w32_hv_8bpc_c: 22182.7
mct_8tap_regular_w32_hv_8bpc_sse2: 7692.6
mct_8tap_regular_w32_hv_8bpc_ssse3: 3599.8
---------------------
mct_8tap_regular_w64_hv_8bpc_c: 50876.8
mct_8tap_regular_w64_hv_8bpc_sse2: 18499.6
mct_8tap_regular_w64_hv_8bpc_ssse3: 8815.6
---------------------
mct_8tap_regular_w128_hv_8bpc_c: 122926.3
mct_8tap_regular_w128_hv_8bpc_sse2: 45120.0
mct_8tap_regular_w128_hv_8bpc_ssse3: 22085.7
------------------------------------------
|
|
---------------------
x86_64:
------------------------------------------
mct_bilinear_w4_h_8bpc_c: 98.9
mct_bilinear_w4_h_8bpc_sse2: 30.2
mct_bilinear_w4_h_8bpc_ssse3: 11.5
---------------------
mct_bilinear_w8_h_8bpc_c: 175.3
mct_bilinear_w8_h_8bpc_sse2: 57.0
mct_bilinear_w8_h_8bpc_ssse3: 19.7
---------------------
mct_bilinear_w16_h_8bpc_c: 396.2
mct_bilinear_w16_h_8bpc_sse2: 179.3
mct_bilinear_w16_h_8bpc_ssse3: 50.9
---------------------
mct_bilinear_w32_h_8bpc_c: 1311.2
mct_bilinear_w32_h_8bpc_sse2: 718.8
mct_bilinear_w32_h_8bpc_ssse3: 243.9
---------------------
mct_bilinear_w64_h_8bpc_c: 2892.7
mct_bilinear_w64_h_8bpc_sse2: 1746.0
mct_bilinear_w64_h_8bpc_ssse3: 568.0
---------------------
mct_bilinear_w128_h_8bpc_c: 7192.6
mct_bilinear_w128_h_8bpc_sse2: 4339.8
mct_bilinear_w128_h_8bpc_ssse3: 1619.2
------------------------------------------
mct_bilinear_w4_v_8bpc_c: 129.7
mct_bilinear_w4_v_8bpc_sse2: 26.6
mct_bilinear_w4_v_8bpc_ssse3: 16.7
---------------------
mct_bilinear_w8_v_8bpc_c: 233.3
mct_bilinear_w8_v_8bpc_sse2: 55.0
mct_bilinear_w8_v_8bpc_ssse3: 24.7
---------------------
mct_bilinear_w16_v_8bpc_c: 498.9
mct_bilinear_w16_v_8bpc_sse2: 146.0
mct_bilinear_w16_v_8bpc_ssse3: 54.2
---------------------
mct_bilinear_w32_v_8bpc_c: 1562.2
mct_bilinear_w32_v_8bpc_sse2: 560.6
mct_bilinear_w32_v_8bpc_ssse3: 201.0
---------------------
mct_bilinear_w64_v_8bpc_c: 3221.3
mct_bilinear_w64_v_8bpc_sse2: 1380.6
mct_bilinear_w64_v_8bpc_ssse3: 499.3
---------------------
mct_bilinear_w128_v_8bpc_c: 7357.7
mct_bilinear_w128_v_8bpc_sse2: 3439.0
mct_bilinear_w128_v_8bpc_ssse3: 1489.1
------------------------------------------
mct_bilinear_w4_hv_8bpc_c: 185.0
mct_bilinear_w4_hv_8bpc_sse2: 54.5
mct_bilinear_w4_hv_8bpc_ssse3: 22.1
---------------------
mct_bilinear_w8_hv_8bpc_c: 377.8
mct_bilinear_w8_hv_8bpc_sse2: 104.3
mct_bilinear_w8_hv_8bpc_ssse3: 35.8
---------------------
mct_bilinear_w16_hv_8bpc_c: 1159.4
mct_bilinear_w16_hv_8bpc_sse2: 311.0
mct_bilinear_w16_hv_8bpc_ssse3: 106.3
---------------------
mct_bilinear_w32_hv_8bpc_c: 4436.2
mct_bilinear_w32_hv_8bpc_sse2: 1230.7
mct_bilinear_w32_hv_8bpc_ssse3: 400.7
---------------------
mct_bilinear_w64_hv_8bpc_c: 10627.7
mct_bilinear_w64_hv_8bpc_sse2: 2934.2
mct_bilinear_w64_hv_8bpc_ssse3: 957.2
---------------------
mct_bilinear_w128_hv_8bpc_c: 26048.9
mct_bilinear_w128_hv_8bpc_sse2: 7590.3
mct_bilinear_w128_hv_8bpc_ssse3: 2947.0
------------------------------------------
|
|
This allows skipping half of the first transforms if the input
coefficients lie within the upper 4x4 (but checkasm only tests in
increments of 8x8 at the moment).
With checkasm modified to test in smaller increments, the speedup
is like this:
Before: Cortex A53 A72 A73
inv_txfm_add_16x8_dct_dct_1_10bpc_neon: 874.4 709.0 707.3
After:
inv_txfm_add_16x8_dct_dct_1_10bpc_neon: 618.0 479.5 472.9
|
|
|
|
|
|
|
|
Blacklisted some files not directly relevant to the codebase (such as
tests, tools and debugging functions).
The coverage HTML report gets attached as a build artifact, although
unfortunately we can't link directly to the `index.html`. We also attach
the coverage XML as a cobertura report, although I'm not sure if it does
anything.
|
|
|
|
This is currently not used in dav1d (yet), but there's a need for
it in rav1e, which shares this header with dav1d.
|
|
mc_scaled_8tap_regular_w2_8bpc_c: 764.4
mc_scaled_8tap_regular_w2_8bpc_avx2: 191.3
mc_scaled_8tap_regular_w2_dy1_8bpc_c: 705.8
mc_scaled_8tap_regular_w2_dy1_8bpc_avx2: 89.5
mc_scaled_8tap_regular_w2_dy2_8bpc_c: 964.0
mc_scaled_8tap_regular_w2_dy2_8bpc_avx2: 120.3
mc_scaled_8tap_regular_w4_8bpc_c: 1355.7
mc_scaled_8tap_regular_w4_8bpc_avx2: 180.9
mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1233.2
mc_scaled_8tap_regular_w4_dy1_8bpc_avx2: 115.3
mc_scaled_8tap_regular_w4_dy2_8bpc_c: 1707.6
mc_scaled_8tap_regular_w4_dy2_8bpc_avx2: 117.9
mc_scaled_8tap_regular_w8_8bpc_c: 2483.2
mc_scaled_8tap_regular_w8_8bpc_avx2: 294.8
mc_scaled_8tap_regular_w8_dy1_8bpc_c: 2166.4
mc_scaled_8tap_regular_w8_dy1_8bpc_avx2: 222.0
mc_scaled_8tap_regular_w8_dy2_8bpc_c: 3133.7
mc_scaled_8tap_regular_w8_dy2_8bpc_avx2: 292.6
mc_scaled_8tap_regular_w16_8bpc_c: 5239.2
mc_scaled_8tap_regular_w16_8bpc_avx2: 729.9
mc_scaled_8tap_regular_w16_dy1_8bpc_c: 5156.5
mc_scaled_8tap_regular_w16_dy1_8bpc_avx2: 602.2
mc_scaled_8tap_regular_w16_dy2_8bpc_c: 8018.4
mc_scaled_8tap_regular_w16_dy2_8bpc_avx2: 783.1
mc_scaled_8tap_regular_w32_8bpc_c: 14745.0
mc_scaled_8tap_regular_w32_8bpc_avx2: 2205.0
mc_scaled_8tap_regular_w32_dy1_8bpc_c: 14862.3
mc_scaled_8tap_regular_w32_dy1_8bpc_avx2: 1721.3
mc_scaled_8tap_regular_w32_dy2_8bpc_c: 23607.6
mc_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2325.7
mc_scaled_8tap_regular_w64_8bpc_c: 54891.7
mc_scaled_8tap_regular_w64_8bpc_avx2: 8351.4
mc_scaled_8tap_regular_w64_dy1_8bpc_c: 50249.0
mc_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5864.4
mc_scaled_8tap_regular_w64_dy2_8bpc_c: 79400.1
mc_scaled_8tap_regular_w64_dy2_8bpc_avx2: 8295.7
mc_scaled_8tap_regular_w128_8bpc_c: 121046.8
mc_scaled_8tap_regular_w128_8bpc_avx2: 21809.1
mc_scaled_8tap_regular_w128_dy1_8bpc_c: 133720.4
mc_scaled_8tap_regular_w128_dy1_8bpc_avx2: 16197.8
mc_scaled_8tap_regular_w128_dy2_8bpc_c: 218774.8
mc_scaled_8tap_regular_w128_dy2_8bpc_avx2: 22993.1
|
|
posix_memalign is defined as a built-in in gcc in msys2 but it's not available
when linking with the Universal C Runtime. _aligned_malloc is available in the
UCRT.
That should only affect builds targeting Windows since _aligned_malloc is a MS
thing.
|
|
Eliminate store forwarding stalls.
Use shorter instruction encodings where possible.
Misc. tweaks.
|
|
This one correctly sets the subsampling mode based on whether or not the
plane is actually subsampled, and also infers PL_CHROMA_UNKNOWN as
PL_CHROMA_TOP_LEFT in such cases.
|
|
libplacebo v66 got helper functions that make preserving the aspect
ratio in this case trivial. But we still need to make sure to clear the
FBO to black if the image doesn't cover it fully.
|
|
Returning out of this function when pl_render_image() fails is the wrong
thing to do, since that leaves the swapchain frame acquired but never
submitted. Instead, just clear the target FBO to blank red (to make it
clear that something went wrong) and continue on with presentation.
|
|
|
|
Annoying minor differences in this struct layout mean we can't just
memcpy the entire thing. Oh well.
Note: technically, PL_API_VER 33 added this API, but PL_API_VER 63 is
the minimum version of libplacebo that doesn't have glaring bugs when
generating chroma grain, so we require that as a minimum instead.
(I tested this version on some 4:2:2 and 4:2:0, 8-bit and 10-bit grain
samples I had lying around and made sure the output was identical up to
differences in rounding / dithering.)
|
|
Generalize the code to set the right pl_image metadata based on the
values signaled in the Dav1dPictureParameters / Dav1dSequenceHeader.
Some values are not mapped, in which case stdout will be spammed.
Whatever. Hopefully somebody sees that error spam and opens a bug report
for libplacebo to implement it.
|
|
Having the pl_image generation live in upload_planes() rather than
render() will make it easier to set the correct pl_image metadata based
on the Dav1dPicture headers moving forwards. Rename the function to make
more sense, semantically.
Reduce some code duplication by turning per-plane fields into arrays
wherever appropriate.
As an aside, also apply the correct chroma location rather than
hard-coding it as PL_CHROMA_LEFT.
|
|
This is turned into a const array in upstream libplacebo, which
generates warnings due to the implicit cast. Rewrite the code to have
the mutable array live inside a separate variable `extensions` and only
set `iparams.extensions` to this, rather than directly manipulating it.
|
|
Signed-off-by: Marvin Scholz <epirat07@gmail.com>
|
|
|
|
Add code to check that a function doesn't accidentally overwrite
anything in the area located just above the current stack frame.
|
|
|
|
This allows selecting at runtime if placebo should use OpenGL
or Vulkan for rendering.
|
|
|
|
|
|
|
|
|
|
|
|
|