Age | Commit message (Collapse) | Author |
|
When compiling with asm enabled there's no point in compiling
C versions of DSP functions that have asm implementations using
instruction sets that the compiler can unconditionally use.
E.g. when compiling with -mssse3 we can remove the C version
of all functions with SSSE3 implementations.
This is accomplished using the compiler's dead code elimination
functionality.
Can be configured using the new 'trim_dsp' meson option, which
by default is enabled when compiling in release mode.
|
|
|
|
|
|
|
|
Split the 5x5, 3x3, and mix cases into separate functions.
Shrink some tables.
Move some scalar calculations out of the DSP function.
Make Wiener and SGR share the same function prototype to
eliminate a branch in lr_stripe().
|
|
right edge
|
|
The previous implementation did two separate passes in the horizontal
and vertical directions, with the intermediate values being stored
in a buffer on the stack. This caused bad cache thrashing.
By interleaving the horizontal and vertical passes in combination
with a ring buffer for storing only a few rows at a time the
performance is improved by a significant amount.
Also split the function into 7-tap and 5-tap versions. The latter is
faster and fairly common (always for chroma, sometimes for luma).
|
|
Combine horizontal and vertical filter pointers into a single parameter
when calling the wiener DSP function.
Eliminate the +128 filter coefficient handling where possible.
|
|
This allows using completely different codepaths for 10 and 12 bpc,
or just adding SIMD functions for either of them.
|
|
|
|
A symbol starting with two leading underscores is reserved for
the compiler/standard library implementation.
Also remove the trailing two double underscores for consistency
and symmetry.
|
|
|
|
The relative speedup compared to C code is around 4.2 for a Cortex A53
and 5.1 for a Snapdragon 835 (compared to GCC's autovectorized code),
6-7x compared to GCC's output without autovectorization, and ~8x
compared to clang's output (which doesn't seem to try to vectorize
this function).
|
|
wiener_luma_8bpc_c: 326272.1
wiener_luma_8bpc_avx2: 19841.5
Decoding time of first 1000 frames of Chimera-8bit-1920x1080.ivf goes
from 27.471 to 23.558 seconds.
|
|
Also copy 4 pixels so SIMD can use a padded write (movd).
|
|
With minor contributions from:
- Jean-Baptiste Kempf <jb@videolan.org>
- Marvin Scholz <epirat07@gmail.com>
- Hugo Beauzée-Luyssen <hugo@videolan.org>
|