Age | Commit message (Collapse) | Author |
|
* commit '732510636e597585a79be7d111c88b3f7e174fe7':
aarch64: Remove a dot from a label
Merged-by: James Almer <jamrial@gmail.com>
|
|
This fixes building with armasm64 (when run through gas-preprocessor).
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
autocorrelate_c: 644.0
autocorrelate_neon: 420.0
hf_apply_noise_0_c: 1688.5
hf_apply_noise_0_neon: 1498.6
hf_apply_noise_1_c: 1691.2
hf_apply_noise_1_neon: 1500.6
hf_apply_noise_2_c: 1688.1
hf_apply_noise_2_neon: 1500.3
hf_apply_noise_3_c: 1696.6
hf_apply_noise_3_neon: 1502.2
hf_g_filt_c: 2117.8
hf_g_filt_neon: 1218.7
hf_gen_c: 4573.4
hf_gen_neon: 2461.0
neg_odd_64_c: 72.0
neg_odd_64_neon: 64.7
qmf_deint_bfly_c: 1107.6
qmf_deint_bfly_neon: 291.6
qmf_deint_neg_c: 210.4
qmf_deint_neg_neon: 107.4
qmf_post_shuffle_c: 163.0
qmf_post_shuffle_neon: 107.7
qmf_pre_shuffle_c: 120.5
qmf_pre_shuffle_neon: 110.7
sum64x5_c: 1361.6
sum64x5_neon: 435.4
sum_square_c: 1686.4
sum_square_neon: 787.2
|
|
|
|
☭ tests/checkasm/checkasm --bench --test=aacpsdsp
checkasm: using random seed 3318985180
MMX implied by specified flags
MMX implied by specified flags
NEON:
- aacpsdsp.add_squares [OK]
- aacpsdsp.mul_pair_single [OK]
- aacpsdsp.hybrid_analysis [OK]
- aacpsdsp.stereo_interpolate [OK]
checkasm: all 5 tests passed
nop: 10.0
ps_add_squares_c: 63221.2
ps_add_squares_neon: 22311.7
ps_hybrid_analysis_c: 2466.6
ps_hybrid_analysis_neon: 1521.9
ps_mul_pair_single_c: 68592.0
ps_mul_pair_single_neon: 17426.6
ps_stereo_interpolate_c: 72344.3
ps_stereo_interpolate_neon: 72308.8
ps_stereo_interpolate_ipdopd_c: 117415.2
ps_stereo_interpolate_ipdopd_neon: 113386.3
|
|
Properly use the b.eq form instead of the nonstandard form (which
both gas and newer clang accept though), and expand the register
lists that used a range (which the Xcode 6.2 clang, based on clang
3.5 svn, didn't support).
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).
This is cherrypicked from libav commit
a970f9de865c84ed5360dd0398baee7d48d04620.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
Fixes regression introduced by 5d0b8b1ae307951310c7d9a8fa282fbca9b997cd.
|
|
Separates macro arguments with commas and passes .4H/.8H as macro
arguments instead of 4H/8H (the later form being interpreted as an
hexadecimal value).
Fixes ticket #6324.
Suggested-by: Martin Storsjö <martin@martin.st>
|
|
* commit '2425d7329fdccfa9954faba748f3865151354f0c':
arm64: replace 'bic' with immediate with 'and' with inverted immediate
Merged-by: Clément Bœsch <u@pkh.me>
|
|
* commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6':
mpegaudiodsp: aarch64: Adjust function prototype after 2caa93b813adc5dbb7771dfe615da826a2947d18
Merged-by: James Almer <jamrial@gmail.com>
|
|
|
|
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
|
|
This is following Libav layout to ease merges.
|
|
* commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b':
aarch64: Add missing sign extension in ff_h264_idct8_add_neon
Merged-by: Clément Bœsch <u@pkh.me>
|
|
* commit '2caa93b813adc5dbb7771dfe615da826a2947d18':
mpegaudiodsp: Change type of array stride parameters to ptrdiff_t
Merged-by: James Almer <jamrial@gmail.com>
|
|
* commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c':
h264chroma: Change type of stride parameters to ptrdiff_t
Merged-by: James Almer <jamrial@gmail.com>
|
|
* commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428':
idct: Change type of array stride parameters to ptrdiff_t
Merged-by: James Almer <jamrial@gmail.com>
|
|
* commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0':
lavc: add clobber tests for the new encoding/decoding API
The merge only re-order what we already have.
Merged-by: Clément Bœsch <u@pkh.me>
|
|
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 21512 bytes to 31400 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Before:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1902.7
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1903.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2201.1
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1011.6
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 9716.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9704.9
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5
After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8
vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1142.4
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1139.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon: 1772.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 6944.4
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 6944.2
vp9_inv_dct_dct_32x32_sub8_add_10_neon: 7609.8
vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1
vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8
vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7
vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
pass2 function
This allows reusing the macro for a separate implementation of the
pass2 function.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from
26288 to 21512 bytes.
This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.
Before:
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9
After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This makes the code a bit more readable.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Align the second/third operands as they usually are.
Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.
This is cherrypicked from libav commit
7995ebfad12002033c73feed422a1cfc62081e8f.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.
This is cherrypicked from libav commit
3a0d5e206d24d41d87a25ba16a79b2ea04c39d4c.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Align the second/third operands as they usually are.
Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This matches the order they are in the 16 bpp version.
There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.
This makes the 8 bpp version match the 16 bpp version better.
This is cherrypicked from libav commit
b8f66c0838b4c645227f23a35b4d54373da4c60a.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
This is cherrypicked from libav commit
09eb88a12e008d10a3f7a6be75d18ad98b368e68.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
After this, we still can skip pushing d12-d15.
Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
This is cherrypicked from libav commit
65aa002d54433154a6924dc13e498bec98451ad0.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This is one cycle faster in total, and three instructions fewer.
Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2
This is cherrypicked from libav commit
3bf9c48320f25f3d5557485b0202f22ae60748b0.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.
Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
After:
vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0
vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7
vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0
vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7
This is cherrypicked from libav commit
c582cb8537367721bb399a5d01b652c20142b756.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This is cherrypicked from libav commit
07b5136c481d394992c7e951967df0cfbb346c0b.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.
This is cherrypicked from libav commit
b0806088d3b27044145b20421da8d39089ae0c6a.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Previously we first calculated hev, and then negated it.
Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.
Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7
vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0
After:
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
This is cherrypicked from libav commit
e1f9de86f454861b69b199ad801adc2ec6c3b220.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This work is sponsored by, and copyright, Google.
Before: Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2
vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3
This is cherrypicked from libav commit
3fcf788fbbccc4130868e7abe58a88990290f7c1.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
No measured speedup on a Cortex A53, but other cores might benefit.
This is cherrypicked from libav commit
388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
Fold the field lengths into the macro.
This makes the macro invocations much more readable, when the
lines are shorter.
This also makes it easier to use only half the registers within
the macro.
This is cherrypicked from libav commit
5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This is cherrypicked from libav commit
0c0b87f12d48d4e7f0d3d13f9345e828a3a5ea32.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This is cherrypicked from libav commit
8476eb0d3ab1f7a52317b23346646389c08fb57a.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This is cherrypicked from libav commit
3dd7827258ddaa2e51085d0c677d6f3b1be3572f.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.
Use a single-lane load where we don't need the semantics of ld1r.
This is cherrypicked from libav commit
ed8d293306e12c9b79022d37d39f48825ce7f2fa.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
function
This is cherrypicked from libav commit
4da4b2b87f08a1331650c7e36eb7d4029a160776.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 14740 bytes to 24292 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2
This is cherrypicked from libav commit
a63da4511d0fee66695ff4afd264ba1dbf1e812d.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
function
This allows reusing the macro for a separate implementation of the
pass2 function.
This is cherrypicked from libav commit
79d332ebbde8c0a3e9da094dcfd10abd33ba7378.
Signed-off-by: Martin Storsjö <martin@martin.st>
|