Age | Commit message (Collapse) | Author |
|
Fixes part of ticket #8771
Signed-off-by: James Almer <jamrial@gmail.com>
(cherry picked from commit 2c844c98285ca03d9cc44db920da645cf0376c40)
|
|
Use this in vf_spp.c, where the get_pixels operation is done on
unaligned source addresses.
Hook up the x86 (mmx and sse) versions of get_pixels to this
function pointer, as those implementations seem to support unaligned
use.
This fixes fate-filter-spp on armv7.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Fix overflow for coeff -32768 in function ADD_RES_SSE_16_32_8 with no
performance drop.(SSE2/AVX/AVX2)
./checkasm --test=hevc_add_res --bench
Mainline:
- hevc_add_res.add_residual [OK]
hevc_add_res_32x32_8_sse2: 127.5
hevc_add_res_32x32_8_avx: 127.0
hevc_add_res_32x32_8_avx2: 86.5
Add overflow test case:
- hevc_add_res.add_residual [FAILED]
After:
- hevc_add_res.add_residual [OK]
hevc_add_res_32x32_8_sse2: 126.8
hevc_add_res_32x32_8_avx: 128.3
hevc_add_res_32x32_8_avx2: 86.8
Signed-off-by: Xu Guangxin <guangxin.xu@intel.com>
Signed-off-by: Linjie Fu <linjie.fu@intel.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
|
|
Fix overflow for coeff -32768 in function ADD_RES_SSE_8_8 with
no performance drop.
./checkasm --test=hevc_add_res --bench
Mainline:
- hevc_add_res.add_residual [OK]
hevc_add_res_8x8_8_sse2: 15.5
Add overflow test case:
- hevc_add_res.add_residual [FAILED]
After:
- hevc_add_res.add_residual [OK]
hevc_add_res_8x8_8_sse2: 15.5
Signed-off-by: Xu Guangxin <guangxin.xu@intel.com>
Signed-off-by: Linjie Fu <linjie.fu@intel.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
|
|
Fix overflow for coeff -32768 in function ADD_RES_MMX_4_8 with no
performance drop.
./checkasm --test=hevc_add_res --bench
Mainline:
- hevc_add_res.add_residual [OK]
hevc_add_res_4x4_8_mmxext: 15.5
Add overflow test case:
- hevc_add_res.add_residual [FAILED]
After:
- hevc_add_res.add_residual [OK]
hevc_add_res_4x4_8_mmxext: 15.0
Signed-off-by: Xu Guangxin <guangxin.xu@intel.com>
Signed-off-by: Linjie Fu <linjie.fu@intel.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
|
|
Found-by: james
|
|
Fixes: Segfault (not reproducable with asm, which made this hard to debug)
Fixes: decoding errors
Fixes: 19854/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_DIRAC_fuzzer-5729372837511168
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
|
|
VP4 applies a loop filter during motion compensation, causing the block offset
will often by unaligned. This produces a bus error on some platforms, namely
ARMv7 NEON.
This patch adds a unaligned version of the loop filter function pointer
to VP3DSPContext.
Reported-by: Mike Melanson <mike@multimedia.cx>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
|
|
It's not using ymm registers, so limiting it to CPUs with fast AVX
is not necessary.
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Fixes checkasm on systems like win64.
Reviewed-by: Lynne
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Prevents pointless register saving on win64 for the sse3 and avx
versions of the function.
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Assembly failed when using yasm rather than nasm.
|
|
Replaced VSHUFPS with VPBLENDD to relieve port 5 bottleneck
AVX2 is 1.4x faster than AVX
|
|
Has a slight speedup.
Can't be carried over to aarch64, since it has no shufps-like instruction.
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
58893 decicycles in deemphasis_c, 130548 runs, 524 skips
9475 decicycles in deemphasis_fma3, 130686 runs, 386 skips -> 6.21x speedup
24866 decicycles in postfilter_c, 65386 runs, 150 skips
5268 decicycles in postfilter_fma3, 65505 runs, 31 skips -> 4.72x speedup
Total decoder speedup: ~14%
Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;
for (int i = 0; i < len; i += 4) {
y[0] = x[0] + c1*state;
y[1] = x[1] + c2*state + c1*x[0];
y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
state = y[3];
y += 4;
x += 4;
}
|
|
The entire function was defined away before.
|
|
Its only used in the encoder and in CELT's PVQ.
|
|
|
|
|
|
bits per raw sample
based on patch by Kieran Kunhya
|
|
Removes an unneeded copy and does the 5-point permute in-place.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
|
|
Saves 1 gpr and 2 instructions and simplifies the macros a bit.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
|
|
This is a profile supporting > 8-bit video and has a higher quality DCT
|
|
This was originally based on libsbc, and was fully integrated into ffmpeg.
Rough speed test:
C version: speed= 592x
MMX version: speed= 785x
|
|
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
|
|
asm code by Henrik Gramner
|
|
AVX-512 support has been introduced, and even if no functions currently
use zmm registers (able to load as much as 64 bytes of consecutive data
per instruction), they will be added eventually.
Reviewed-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
ff_add_left_pred_int16_unaligned_ssse3
SSSE3_FAST is the proper check for it.
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
ff_add_left_pred_unaligned_avx2
Fixes valgrind
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
in order to add avx2 version
|
|
|
|
The commit b7c16a3f2c4921f613319938b8ee0e3d6fa83e8d ("x86: fft: Port to
cpuflags") breaks the opus decoder in ffmpeg when compiling for 3dnow. The
output is audible, but there's a lot of noise.
The reason for the breakage is that the commit unintentionally changed the
INTERL macro so that it is empty when compiling for 3dnow. This patch
fixes it.
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
speed seems to be similar, but simplify code
|
|
Remove the broadcast instructions as well now that they are wide
enough.
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
better func separator
and add comment for the restore rgb planes10 declaration
|
|
|
|
|
|
jpeg2000_ict_float_c: 2296.0
jpeg2000_ict_float_sse: 628.0
jpeg2000_ict_float_avx: 317.0
jpeg2000_ict_float_fma3: 262.0
Signed-off-by: James Almer <jamrial@gmail.com>
|