github.com/videolan/dav1d.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2022-10-27	threading: Add a pending list for async task insertion	Victorien Le Couviour--Tuffet

2022-10-20	threading: Fix a race around frame completion (frame-mt)	Victorien Le Couviour--Tuffet
	The completion of the first frame to decode while an async reset request on that same frame is pending will render it stale. The processing of such a stale request is likely to result in a hang. One reason this happens is the skip condition at the beginning of reset_task_cur(). => Consume the async request before that check. Another reason is several threads producing async reset requests in parallel: an async request for the first frame could cascade through the other threads (other frames) during completion of that frame, meaning not being caught by the last synchronous reset_task_cur() after signaling the main thread and before releasing the lock. => To solve this we need to add protections at the racy locations. That means after we increase first, before returning from reset_task_cur_async(), and after consuming the async request.
2022-09-30	x86: Add 10-bit 8x8/8x16/16x8/16x16 itx AVX-512 (Ice Lake) asm	Henrik Gramner

2022-09-30	Specify hidden visibility for global data symbol declarations	Henrik Gramner
	'-fvisibility=hidden' only applies to definitions, not declarations, so the compiler has to be conservative about how references to global data symbols are performed. Explicitly specifying the visibility allows for better code generation.
2022-09-26	x86: Fix incorrect 32-bit parameter usage in high bit-depth AVX-512 mc	Henrik Gramner
	The 32-bit width parameter was used directly as a pointer offset, but the upper half is undefined. Fix it by replacing 'cmp' with 'sub' to explicitly zero those bits.
2022-09-19	arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths	Martin Storsjö
	This fixes conformance with the argon test samples, in particular with these samples: profile0_core/streams/test10100_579_8614.obu profile0_core/streams/test10218_6914.obu This gives a pretty notable slowdown to these transforms - some examples: Before: Cortex A53 A72 A73 Apple M1 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4 After: inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0 Thus, for the transforms alone, it makes them around 10-13% slower (the Apple M1 measurements are too noisy to be conclusive here). Measured on actual full decoding, it makes decoding of 10 bpc Chimera around maybe 1% slower on an Apple M1 - close to measurement noise anyway.
2022-09-19	x86: Fix overflows in 12bpc AVX2 IDCT/IADST	Henrik Gramner

2022-09-19	x86: Fix overflows in 12bpc AVX2 DC-only IDCT	Henrik Gramner
	Using smaller immediates also results in a small code size reduction in some cases, so apply those changes to the (10bpc-only) SSE code as well.
2022-09-19	x86: Fix clipping in high bit-depth AVX2 4x16 IDCT	Henrik Gramner
	Certain clips were incorrectly performed on negated values, which caused things to be off-by-one in both directions. Correct this by negating such values prior to clipping instead of afterwards.
2022-09-15	Fix checking the reference dimesions for the projection process	David Conrad
	Section 7.9.2 returns 0 "If RefMiRows[ srcIdx ] is not equal to MiRows, RefMiCols[ srcIdx ] is not equal to MiCols" dav1d was comparing pixel width/height, not block width/height, so conform with the spec
2022-09-15	Fix calculation of OBMC lap dimensions	David Conrad
	Individual OBMC lapped predictions have a max width of 64 pixels for the top lap and have a max height of 64 for the left laps This is 7.11.3.9. Overlapped motion compensation process step4 = Clip3( 2, 16, Num_4x4_Blocks_Wide[ candSz ] ) dav1d wasn't clipping this as needed, which means that with scaled MC, the interpolation of the 2nd half of a 128 block was incorrect, since mx/my for subpel filter selection need to be reset at the 64 pixel boundary
2022-09-15	Support film grain application whose only effect is clipping to video range	David Conrad
	This is the parameter combination: num_y_points == 0 && num_cb_points == 0 && num_cr_points == 0 && chroma_scaling_from_luma == 1 && clip_to_restricted_range == 1 Film grain application has two effects: adding noise, and optionally clipping to video range For luma, the spec skips film grain application if there's no noise (num_y_points == 0), but for chroma, it's only skipped if there's no chroma noise and chroma_scaling_from_luma is false This means it's possible for there to be no noise (num_*_points = 0), but if clip_to_restricted_range is true then chroma pixels can be clipped to video range, if chroma_scaling_from_luma is true. Luma pixels, however, aren't clipped to video range unless there's noise to apply. dav1d currently skips applying film grain entirely if there is no noise, regardless of the secondary clipping.
2022-09-15	Ignore T.35 metadata if the OBU contains no payload	David Conrad
	The syntax of itu_t_t35_payload_bytes is not defined in the AV1 specification, but it does state that decoders should ignore the entire OBU if they do not understand it.
2022-09-15	Fix chroma deblock filter size calculation for lossless	David Conrad
	In section 5.11.34 txSz is always defined to TX_4X4 if Lossless is true Chroma deblock filter size calculation needs to use this overridden txSz when lossless is enabled
2022-09-15	Fix rounding in the calculation of initialSubpelX	David Conrad
	The spec divides err by two, rounding to 0, instead of >>1, which rounds towards negative infinity
2022-09-15	Fix overflow when saturating dequantized coefficients clipped to 0	David Conrad
	It's possible to encode a large coefficient that becomes 0 after the clipping in dequant (Abs( dq ) & 0xFFFFFF), e.g. 0x1000000 After that &0xFFFFFF, coeffs are saturated in the range of [-(1 << (bitdepth+7)), 1 << (bitdepth+7)) dav1d implements this saturation via umin(dq - sign, cf_max), then applies the sign afterwards via xor. However, for dq = 0 and sign = 1, this step evaulates to umin(UINT_MAX, cf_max) == cf_max instead of the expected 0. So instead, do unsigned saturate as umin(dq, cf_max + sign), then apply sign via (sign ? -dq : dq) On arm this is the same number of instructions, since cneg exists and is used On x86 this requires an additional instruction, but this isn't a latency-critical path
2022-09-15	Fix overflow in 8-bit NEON ADST	David Conrad
	In 8-bit adst, it's possible that the final Round2(x[0], 12) can exceed 16-bits signed Specifically, in 7.13.2.6. Inverse ADST4 process, the precision requirement is: "It is a requirement of bitstream conformance that all values stored in the s and x arrays by this process are representable by a signed integer using r + 12 bits of precision." For 8 bits, r is 16 for both row and column, so x[] can be 28-bit signed. For values [134215680, 134217727] (within 2047 of the maximum 28-bit value), the final Round2(x[0], 12) evaluates to 32768, exceeding 16-bits signed. So switch to using sqrshrn, which saturates to 16-bits signed This is a continuation of: Commit b53ff29d80a21180e5ad9bbe39a02541151f4f53 arm: itx: Do clipping in all narrowing downshifts
2022-09-12	x86: Fix clipping in 10bpc SSE4.1 IDCT asm	Henrik Gramner

2022-09-08	threading: Limit the progress bitfields to the used size	Victorien Le Couviour--Tuffet
	Store the used size instead of the allocated size. The used size can be smaller than the allocated size, which results in a wrong computation of the linear progress from the frame_progress bitfield.
2022-09-08	x86: Fix rare crash in chroma film grain asm	Henrik Gramner
	The width parameter is used directly as a pointer offset, so ensure that it has an appropriately sized data type. This has been done previously for luma, but chroma was overlooked.
2022-09-07	x86: Fix overflows in 12bpc AVX2 identity itx asm	Henrik Gramner

2022-09-07	x86: Fix an alignment issue in 8-bit AVX-512 loop restoration	Henrik Gramner
	We don't have a separate 8-bit AVX-512 5-tap Wiener filter so the 7-tap function is used for chroma as well, and in some esoteric edge cases chroma dst pointers may only have a 32-byte alignment despite having a width larger than 32, so use an unaligned store as a workaround.
2022-08-30	threading: Fix copy_lpf_progress initialization	Victorien Le Couviour--Tuffet
	The copy_lpf_progress bitfield might not be fully cleared when size goes down. Credit to Oss-Fuzz.
2022-08-19	data: don't overwrite the Dav1dDataProps size value	James Almer
	Fixes a regression since commit 3d3c51a07cc3dd1e3687da40fdb6fbb857cbced1.
2022-07-25	Adjust inlining attributes on some functions	Henrik Gramner
	The code size increase of inlining every call to certain functions isn't a worthwhile trade-off, and most compilers actually ends up overriding those particular inlining hints anyway. In some cases it's also better to split the function into separate luma and chroma functions.
2022-07-19	x86: Remove leftover instruction in loopfilter AVX2 asm	Henrik Gramner
	In 0aca76c sequences of pand/pandn/por was replaced by pblendvb, but one instruction (which now acts as a no-op) was accidentally left in.
2022-07-14	Enable pointer authentication in assembly when building arm64e	David Conrad

2022-07-11	Don't trash the return stack buffer in the NEON loop filter	David Conrad
	The NEON loop filter's innermost asm function can return to a different location than the address that called it. This messes up the return stack predictor, causing returns to be mispredicted Instead, rework the function to always return to the address that calls it, and instead return the information needed for the caller to short-circuit storing pixels
2022-07-06	Eliminate unused C DSP functions at compile time	Henrik Gramner
	When compiling with asm enabled there's no point in compiling C versions of DSP functions that have asm implementations using instruction sets that the compiler can unconditionally use. E.g. when compiling with -mssse3 we can remove the C version of all functions with SSSE3 implementations. This is accomplished using the compiler's dead code elimination functionality. Can be configured using the new 'trim_dsp' meson option, which by default is enabled when compiling in release mode.
2022-07-06	cpu: Inline dav1d_get_cpu_flags()	Henrik Gramner

2022-06-22	x86: Add minor loopfilter asm improvements	Henrik Gramner

2022-06-15	x86: Add high bit-depth loopfilter AVX-512 (Ice Lake) asm	Henrik Gramner

2022-06-03	x86: Add a workaround for quirky AVX-512 hardware behavior	Henrik Gramner
	On Intel CPUs certain AVX-512 shuffle instructions incorrectly flag the upper halves of YMM registers as in use when writing to XMM registers, which may cause AVX/SSE state transitions. This behavior is not documented and only occurs on physical hardware, not when using the Intel SDE, so as far as I can tell it appears to be a hardware bug. Work around the issue by using EVEX-only registers. This avoids the problem at the cost of a slightly larger code size.
2022-05-25	x86: Add high bit-depth cdef_filter AVX-512 (Ice Lake) asm	Henrik Gramner

2022-05-18	x86/itx_avx2: fix typo	David Michael Barr

2022-04-28	Use a relaxed memory ordering in dav1d_ref_inc()	Henrik Gramner
	Increasing a reference counter only requires atomicity, but not ordering or synchronization.
2022-04-28	Remove redundant code in dav1d_cdf_thread_unref()	Henrik Gramner
	Checking if the Dav1dRef pointer is non-zero and zeroing it is already performed in dav1d_ref_dec(), no need to do it twice. Also reorder code to enable tail call elimination.
2022-04-28	Inline dav1d_ref_inc()	Henrik Gramner
	Avoids the function call overhead in non-LTO builds. Also reorder code in dav1d_ref_dec() to enable tail call elimination.
2022-04-24	x86/itx: Add 32x8 12bpc AVX2 transforms	Matthias Dressel
	inv_txfm_add_32x8_dct_dct_0_12bpc_c: 286.7 inv_txfm_add_32x8_dct_dct_0_12bpc_avx2: 20.1 inv_txfm_add_32x8_dct_dct_1_12bpc_c: 7832.7 inv_txfm_add_32x8_dct_dct_1_12bpc_avx2: 710.6 inv_txfm_add_32x8_dct_dct_2_12bpc_c: 7838.1 inv_txfm_add_32x8_dct_dct_2_12bpc_avx2: 711.6 inv_txfm_add_32x8_dct_dct_3_12bpc_c: 7818.3 inv_txfm_add_32x8_dct_dct_3_12bpc_avx2: 710.9 inv_txfm_add_32x8_dct_dct_4_12bpc_c: 7820.6 inv_txfm_add_32x8_dct_dct_4_12bpc_avx2: 710.5 inv_txfm_add_32x8_identity_identity_0_12bpc_c: 1526.6 inv_txfm_add_32x8_identity_identity_0_12bpc_avx2: 19.3 inv_txfm_add_32x8_identity_identity_1_12bpc_c: 1519.4 inv_txfm_add_32x8_identity_identity_1_12bpc_avx2: 19.9 inv_txfm_add_32x8_identity_identity_2_12bpc_c: 1519.9 inv_txfm_add_32x8_identity_identity_2_12bpc_avx2: 43.6 inv_txfm_add_32x8_identity_identity_3_12bpc_c: 1519.4 inv_txfm_add_32x8_identity_identity_3_12bpc_avx2: 67.8 inv_txfm_add_32x8_identity_identity_4_12bpc_c: 1523.2 inv_txfm_add_32x8_identity_identity_4_12bpc_avx2: 91.6
2022-04-24	x86/itx: Add 8x32 12bpc AVX2 transforms	Matthias Dressel
	inv_txfm_add_8x32_dct_dct_0_12bpc_c: 334.6 inv_txfm_add_8x32_dct_dct_0_12bpc_avx2: 66.0 inv_txfm_add_8x32_dct_dct_1_12bpc_c: 7929.7 inv_txfm_add_8x32_dct_dct_1_12bpc_avx2: 489.3 inv_txfm_add_8x32_dct_dct_2_12bpc_c: 7925.8 inv_txfm_add_8x32_dct_dct_2_12bpc_avx2: 547.1 inv_txfm_add_8x32_dct_dct_3_12bpc_c: 7928.9 inv_txfm_add_8x32_dct_dct_3_12bpc_avx2: 647.8 inv_txfm_add_8x32_dct_dct_4_12bpc_c: 7916.1 inv_txfm_add_8x32_dct_dct_4_12bpc_avx2: 701.0 inv_txfm_add_8x32_identity_identity_0_12bpc_c: 2413.1 inv_txfm_add_8x32_identity_identity_0_12bpc_avx2: 28.6 inv_txfm_add_8x32_identity_identity_1_12bpc_c: 2415.2 inv_txfm_add_8x32_identity_identity_1_12bpc_avx2: 28.6 inv_txfm_add_8x32_identity_identity_2_12bpc_c: 2413.7 inv_txfm_add_8x32_identity_identity_2_12bpc_avx2: 55.1 inv_txfm_add_8x32_identity_identity_3_12bpc_c: 2415.4 inv_txfm_add_8x32_identity_identity_3_12bpc_avx2: 85.3 inv_txfm_add_8x32_identity_identity_4_12bpc_c: 2401.8 inv_txfm_add_8x32_identity_identity_4_12bpc_avx2: 116.8
2022-04-24	x86/itx: Deduplicate dconly code	Matthias Dressel

2022-04-08	obu: don't output invisible but showable key frames more than once	James Almer
	From section 6.8.2 in the AV1 spec: "It is a requirement of bitstream conformance that when show_existing_frame is used to show a previous frame with RefFrameType[ frame_to_show_map_idx ] equal to KEY_FRAME, that the frame is output via the show_existing_frame mechanism at most once."
2022-04-08	obu: check that the frame referenced by existing_frame_idx is showable	James Almer
	From section 6.8.2 in the AV1 spec: "It is a requirement of bitstream conformance that when show_existing_frame is used to show a previous frame, that the value of showable_frame for the previous frame was equal to 1."
2022-04-08	obu: check refresh_frame_flags is not equal to allFrames on Intra Only frames	James Almer
	From section 6.8.2 in the AV1 spec: "If frame_type is equal to INTRA_ONLY_FRAME, it is a requirement of bitstream conformance that refresh_frame_flags is not equal to 0xff." Make this a soft requirement by checking that strict standard complaince is enabled.
2022-04-08	remove multipass wait from dav1d_decode_frame	Steve Lhomme
	There's an assert on n_fc == 1 at the beginning of the function. There cannot be a second pass used here. Signed-off-by: Steve Lhomme <robux4@videolabs.io>
2022-04-07	picture: ensure the new seq header and op param info flags are attached to ↵	James Almer
	the next visible picture in display order If the first picture in coding order after a new sequence header is parsed is not visible, the first picture output by dav1d after the fact (which is coded after the aforementioned invisible picture) would not trigger the new seq header event flag as expected, despite being the first containing a reference to a new sequence header. Assuming the invisible picture is ever output, the result of this change will be two pictures signaling a new sequence header was seen despite there being only one new sequence header.
2022-03-31	lib: add a function to query the decoder frame delay	James Almer

2022-03-31	lib: split calculating thread count to its own function	James Almer

2022-03-16	Set f->n_tile_data to 0 in dav1d_decode_frame()	Wan-Teh Chang
	Set f->n_tile_data to 0 after the dav1d_decode_frame_exit() call in dav1d_decode_frame(). dav1d_decode_frame_exit() unrefs every element in use in the f->tile array, so it is good to set f->n_tile_data to 0 to indicate that no elements are in use. We are already doing this after all other dav1d_decode_frame_exit() calls. NOTE: It is tempting to have dav1d_decode_frame_exit() itself set f->n_tile_data to 0. I did not do that in this merge request, because the following is a common pattern: dav1d_decode_frame_exit(f, error); f->n_tile_data = 0; pthread_cond_signal(&f->task_thread.cond); corresponding to the waiting code: while (f->n_tile_data > 0) pthread_cond_wait(&f->task_thread.cond, &c->task_thread.lock); I wonder if f->n_tile_data is set to 0 outside dav1d_decode_frame_exit() to make clear the association of f->n_tile_data with the condition variable f->task_thread.cond.
2022-03-14	x86: Reduce code size in 8-bit film grain AVX-512 asm	Henrik Gramner
	Split out common parts into separate functions. This reduces the overall binary size by more than 5 KiB.