Age | Commit message (Collapse) | Author |
|
The previous implementation did two separate passes in the horizontal
and vertical directions, with the intermediate values being stored
in a buffer on the stack. This caused bad cache thrashing.
By interleaving the horizontal and vertical passes in combination
with a ring buffer for storing only a few rows at a time the
performance is improved by a significant amount.
Also split the function into 7-tap and 5-tap versions. The latter is
faster and fairly common (always for chroma, sometimes for luma).
|
|
Combine horizontal and vertical filter pointers into a single parameter
when calling the wiener DSP function.
Eliminate the +128 filter coefficient handling where possible.
|
|
Avoids some pointer chasing and simplifies the DSP code, at the cost
of making the initialization a little bit more complicated.
Also reduces memory usage by a small amount due to properly sizing
the buffers instead of always allocating enough space for 4:4:4.
|
|
|
|
When compiling in release mode, instead of just deleting assertions,
use them to give hints to the compiler. This allows for slightly
better code generation in some cases.
|
|
clang-8:
cdef_filter_4x4_8bpc_c: 436.6
cdef_filter_4x4_8bpc_vsx: 101.1
cdef_filter_4x8_8bpc_c: 827.7
cdef_filter_4x8_8bpc_vsx: 183.5
cdef_filter_8x8_8bpc_c: 1510.2
cdef_filter_8x8_8bpc_vsx: 289.1
gcc-9:
cdef_filter_4x4_8bpc_c: 403.2
cdef_filter_4x4_8bpc_vsx: 105.6
cdef_filter_4x8_8bpc_c: 825.5
cdef_filter_4x8_8bpc_vsx: 192.2
cdef_filter_8x8_8bpc_c: 1586.3
cdef_filter_8x8_8bpc_vsx: 295.0
|
|
|
|
Limited to PowerPC64 LE for now.
|