Age | Commit message (Collapse) | Author |
|
The uv argument is normally in a gpr, but in checkasm it's forcefully
loaded from stack.
|
|
fguv_32x32xn_8bpc_420_csfl0_c: 8945.4
fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6
fguv_32x32xn_8bpc_420_csfl1_c: 6363.4
fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5
|
|
fgy_32x32xn_8bpc_c: 16181.8
fgy_32x32xn_8bpc_avx2: 3231.4
gen_grain_y_ar0_8bpc_c: 108857.6
gen_grain_y_ar0_8bpc_avx2: 22826.7
gen_grain_y_ar1_8bpc_c: 168239.8
gen_grain_y_ar1_8bpc_avx2: 72117.2
gen_grain_y_ar2_8bpc_c: 266165.9
gen_grain_y_ar2_8bpc_avx2: 126281.8
gen_grain_y_ar3_8bpc_c: 448139.4
gen_grain_y_ar3_8bpc_avx2: 137047.1
|
|
|
|
A symbol starting with two leading underscores is reserved for
the compiler/standard library implementation.
Also remove the trailing two double underscores for consistency
and symmetry.
|
|
|
|
This is using a slightly adapted version of my GPU-based algorithm. The
major difference to the algorithm suggested by the spec (and implemented
in libaom) is that instead of using a line buffer to hold the previous
row's film grain blocks, we compute each row/block fully independently.
This opens up the door to exploit parallelism in the future, since we
don't have any left->right or top->down dependency except for the PRNG
state. (Which we could pre-compute for a massively parallel / GPU
implementation)
That being said, it's probably somewhat slower than using a line buffer
for the serial / single CPU case, although most likely not by much
(since the areas with the most redundant work get progressively smaller,
down to a single 2x2 square for the worst case).
|