Age | Commit message (Collapse) | Author |
|
Returning out of this function when pl_render_image() fails is the wrong
thing to do, since that leaves the swapchain frame acquired but never
submitted. Instead, just clear the target FBO to blank red (to make it
clear that something went wrong) and continue on with presentation.
|
|
|
|
Annoying minor differences in this struct layout mean we can't just
memcpy the entire thing. Oh well.
Note: technically, PL_API_VER 33 added this API, but PL_API_VER 63 is
the minimum version of libplacebo that doesn't have glaring bugs when
generating chroma grain, so we require that as a minimum instead.
(I tested this version on some 4:2:2 and 4:2:0, 8-bit and 10-bit grain
samples I had lying around and made sure the output was identical up to
differences in rounding / dithering.)
|
|
Generalize the code to set the right pl_image metadata based on the
values signaled in the Dav1dPictureParameters / Dav1dSequenceHeader.
Some values are not mapped, in which case stdout will be spammed.
Whatever. Hopefully somebody sees that error spam and opens a bug report
for libplacebo to implement it.
|
|
Having the pl_image generation live in upload_planes() rather than
render() will make it easier to set the correct pl_image metadata based
on the Dav1dPicture headers moving forwards. Rename the function to make
more sense, semantically.
Reduce some code duplication by turning per-plane fields into arrays
wherever appropriate.
As an aside, also apply the correct chroma location rather than
hard-coding it as PL_CHROMA_LEFT.
|
|
This is turned into a const array in upstream libplacebo, which
generates warnings due to the implicit cast. Rewrite the code to have
the mutable array live inside a separate variable `extensions` and only
set `iparams.extensions` to this, rather than directly manipulating it.
|
|
Signed-off-by: Marvin Scholz <epirat07@gmail.com>
|
|
|
|
Add code to check that a function doesn't accidentally overwrite
anything in the area located just above the current stack frame.
|
|
|
|
This allows selecting at runtime if placebo should use OpenGL
or Vulkan for rendering.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To un-clutter the main dav1dplay.c, move the fifo to its own
file and header.
|
|
If the maximum number of arguments (currently 15) is changed into
an even number, and a function actually takes the full number of
arguments, we would have the situation where the checked spot on
the stack is at the same place as we store an inverted copy of it.
We already allocate enough space for two values though (for stack
alignment purposes, 16 bytes on arm64 and 8 bytes on arm32) so by
storing the reference in the upper half of this, the lower half of
it works as canary and isn't overwritten.
|
|
checking for stack clobbering
|
|
checking for stack clobbering
|
|
Use 'unsigned' instead of 'unsigned int' for consistency.
Add 'const' to a few variables.
Make proper use of C99 features.
|
|
Also skip the AVX warmup.
|
|
If functions return a float value, this value is stored in this
register.
|
|
We should just use a normal bl here, and the linker will add the 'x'
bit if necessary.
This fixes calling the checkasm_fail_func on windows, where the
code is built in thumb mode (and the linker doesn't clear the 'x'
bit in the blx instruction).
|
|
|
|
|
|
* The build from 'build-debian' is reused. 'logging' is not disabled
since that would trigger an almost full rebuild.
* All ASM tests are merged into one job which is expected to
seldomly fail, therefore ease of debugging is traded in for
efficiency.
|
|
|
|
|
|
When benchmarking, the functions are called with a fixed width
of 64x32 or 32x16, while the test itself is run with a random size
in the range up to 128x32.
In 16 bpc mode, the source pixels must be within the valid range,
because they otherwise cause accesses out of bounds in the scaling
array.
|
|
Also avoid integer overflows by using 64-bit intermediate precision.
|
|
Allows for macro-op fusion.
|
|
Eliminates the x86-64 check from the meson configuration file to be
consistent with how other x86-64-exclusive code is handled.
|
|
Allows for constant propagation and tail call elimination in the
msac initialization, which is performed in each tile.
|
|
Utilize the unsigned representation of a signed integer to skip
the refill code if the count was already negative to begin with,
which saves a few clock cycles at the end of each tile.
|
|
Add an element size specifier to the existing individual transform
functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify
that they operate on input vectors of 8h, and make the symbols
public, to let the 10 bpc case call them from a different object file.
The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon.
Make the existing itx.S compiled regardless of whether 8 bpc support
is enabled. For builds with 8 bpc support disabled, this does include
the unused frontend functions though, but this is hopefully tolerable
to avoid having to split the file into a sharable file for transforms
and a separate one for frontends.
This only implements the 10 bpc case, as that case can use transforms
operating on 16 bit coefficients in the second pass.
Relative speedup vs C for a few functions:
Cortex A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 4.14 4.06 4.49
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 6.51 6.49 6.42
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 5.02 4.63 6.23
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 8.54 7.13 11.96
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.52 6.60 8.03
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 11.27 9.62 12.22
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 9.60 6.97 8.59
inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.60 3.48 3.19
inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 14.65 12.64 16.86
inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 11.57 8.80 12.68
inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 8.79 8.00 9.21
inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 7.58 6.21 7.80
inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 2.41 2.85 2.75
inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 12.91 10.27 12.24
inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 10.96 7.97 10.31
inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 8.95 7.42 9.55
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 7.97 6.12 7.82
|
|
This matches what is done in C by -fvisibility=hidden.
This avoids issues with relocations against other symbols exported
from another assembly file.
|
|
|
|
|
|
|
|
Before this, we never did the early exit from the first pass.
Before: Cortex A53 A72 A73
inv_txfm_add_64x16_dct_dct_1_8bpc_neon: 7275.7 5198.3 5250.9
inv_txfm_add_64x16_dct_dct_2_8bpc_neon: 7276.1 5197.0 5251.3
inv_txfm_add_64x16_dct_dct_3_8bpc_neon: 7275.8 5196.2 5254.5
inv_txfm_add_64x16_dct_dct_4_8bpc_neon: 7273.6 5198.8 5254.2
After:
inv_txfm_add_64x16_dct_dct_1_8bpc_neon: 5187.8 3763.8 3735.0
inv_txfm_add_64x16_dct_dct_2_8bpc_neon: 7280.6 5185.6 5256.3
inv_txfm_add_64x16_dct_dct_3_8bpc_neon: 7270.7 5179.8 5250.3
inv_txfm_add_64x16_dct_dct_4_8bpc_neon: 7271.7 5212.4 5256.4
The other related variants didn't have this bug and properly exited
early when possible.
|
|
Unify some loads and stores, avoiding some extra pointer moving.
|
|
This gives a couple cycles speedup.
|
|
|
|
This isn't used for a sqrdmulh in its current form here.
The one left in idct_coeffs[1] isn't used within the idct itself,
but inv_txfm_horz_scale_dct_32x8 relies on it being left there for
use with sqrdmulh scaling later.
|
|
These cases were removed from x86 to save space and simplify the code
in e0b88bd2b2c97a2695edcc498485e1cb3003e7f1, as those cases
were essentially unused in real world bitstreams.
|
|
The macro became unused in 9f084b0d2.
|
|
On windows and darwin (and modern android), the x18 register is reserved
and shouldn't be modified by user code, while it is freely available on
linux. Strictly avoid it, to keep the assembly code portable.
|
|
|