Age | Commit message (Collapse) | Author |
|
|
|
This gave a 1.1x speedup, however also leads to very long compile times
that make it seems like Blender has stopped working.
This can be brought back in the future behind an option that users can
explicitly enabled.
Fix T100102
Ref D14923, D14763, T92212
|
|
|
|
|
|
* Store compact ray differentials in ShaderData and compute full differentials
on demand. This reduces register pressure on the GPU.
* Remove BSDF differential code that was effectively doing nothing as the
differential orientation was discarded when making it compact.
This gives a 1-5% speedup with RTX A6000 + OptiX in our benchmarks, with the
bigger speedups in simpler scenes.
Renders appear to be identical except for the Both displacement option that
does both displacement and bump.
Differential Revision: https://developer.blender.org/D15677
|
|
|
|
|
|
|
|
|
|
|
|
Detect cases where a ray-intersection would miss the current triangle, which if
the intersection is strictly watertight, implies that a neighboring triangle would
incorrectly be hit instead.
When that is detected, apply a ray-offset. The idea being that we only want to
introduce potential error from ray offsets if we really need to.
This work for BVH2 and Embree, as we are able to match the ray-interesction
bit-for-bit, though doing so for Embree requires ugly hacks. Tiny differences
like fused-multiply-add or dot product intrinstics in matrix inversion and ray
intersection needed to be matched exactly, so this is fragile.
Unfortunately we're not able to do the same for OptiX or MetalRT, since those
implementations are unknown (and possibly impossible to match as hardware
instructions). Still artifacts are much reduced, though not eliminated.
Ref T97259
Differential Revision: https://developer.blender.org/D15559
|
|
Helps improve ray-tracing precision. This is a bit complicated as it requires
different implementation depending on the CPU architecture.
|
|
These replace float3 and packed_float3 in various places in the kernel where a
spectral color representation will be used in the future. That representation
will require more than 3 channels and conversion to from/RGB. The kernel code
was refactored to remove the assumption that Spectrum and RGB colors are the
same thing.
There are no functional changes, Spectrum is still a float3 and the conversion
functions are no-ops.
Differential Revision: https://developer.blender.org/D15535
|
|
Now all the same ones are available on CPU and GPU, which was previously not
possible due to lack of operator overloadng in OpenCL. Print functions are
no-ops on some GPUs.
Ref D15535
|
|
|
|
|
|
|
|
zebin format is critical for the compatibility of AoT graphics binaries
across driver versions. It was previously disabled on Linux due to
runtime issues that are now fixed in
https://github.com/intel/compute-runtime/releases/tag/22.31.23852.
The minimum supported driver version isn't bumped to this one yet as
current codebase with current IGC compiler does actually run fine on
earlier drivers and is not running into these issues anymore.
|
|
|
|
This was by design, but maybe not so useful in practice. It's always possible to
set alpha to 1 in compositing if needed.
|
|
Emrbee
It should consistently use the Cycles pirmitive ID for self intersection detection,
not the one from the OptiX or Embree acceleration structure.
Differential Revision: https://developer.blender.org/D15632
|
|
|
|
After recent changes to change barycentric coordinate convention.
|
|
Assume that all faces using the smae material form a closed mesh, so that
joining meshes gives the same result as separate meshes.
It does mean that using different materials on different sides of one
closed mesh do not work, but the meaning of that is poorly defined anyway
if there is a volume interior.
|
|
|
|
Intel® Arc™ GPUs
Recently, performance with oneAPI have regressed due some recent
changes in Blender itself. This commit's changes is resolving this
and also improve compilation time for oneAPI backend first
execution (or Blender compilation time in case of AoT).
Regression have appeared after 5152c7c152e and not related to the
changes itself, but increase of kernels complexity introduced with
it. Changes in this commit is marking some Blender functions as
noinlined for oneAPI backend, which helps GPU compiler to deal with
this complexity without any negative side-effects on performance.
|
|
* OneAPI: remove separate float3 definition
* OneAPI: disable operator[] to match other GPUs
* OneAPI: make int3 compact to match other GPUs
* Use #pragma once
* Add __KERNEL_NATIVE_VECTOR_TYPES__ to simplify checks
* Remove unused vector3
|
|
Simplifies intersection code a little and slightly improves precision regarding
self intersection.
The parametric texture coordinate in shader nodes is still the same as before
for compatibility.
|
|
|
|
Runtime switching between Embree and BVH2 got lost.
|
|
The number of Execution Units and resident "threads" (simd width * threads
per EUs) are now exposed and used to select the number of states using
a simplified heuristic.
|
|
|
|
float8 is a reserved type in Metal, but is not implemented. So rename to
float8_t for now.
Also move back intersection handlers to kernel.metal, they can't be in the
class that encapsulates the other Metal kernel functions.
|
|
This was tested in some places to check if code was being compiled for the
CPU, however this is only defined in the kernel. Checking __KERNEL_GPU__
always works.
|
|
This patch adds required math functions for float8 to make it possible
using float8 instead of float3 for color data.
Differential Revision: https://developer.blender.org/D15525
|
|
Having the OptiX/MetalRT/Embree/MetalRT implementations all in one file with
many #ifdefs became too confusing. Instead split it up per device, and also
move it together with device specific hit/filter/intersect functions and
associated data types.
|
|
No need anymore to have a difference between CPU/GPU, all distances
remain in world space.
|
|
All our intersections functions now work with unnormalized ray direction,
which means we no longer need to transform ray distance between world and
object space, they can all remain in world space.
There doesn't seem to be any real performance difference one way or the
other, but it does simplify the code.
|
|
This helps with debugging, and gives a slightly closer match between CPU
and CUDA/HIP/Metal renders when it comes to ray tracing precision.
|
|
|
|
For transparency, volume and light intersection rays, adjust these distances
rather than the ray start position. This way we increment the start distance
by the smallest possible float increment to avoid self intersections, and be
sure it works as the distance compared to be will be exactly the same as
before, due to the ray start position and direction remaining the same.
Fix T98764, T96537, hair ray tracing precision issues.
Differential Revision: https://developer.blender.org/D15455
|
|
|
|
This was added for Metal, but also gives good results with CUDA and OptiX.
Also enable it for future Apple GPUs instead of only M1 and M2, since this has
been shown to help across multiple GPUs so the better bet seems to enable
rather than disable it.
Also moves some of the logic outside of the Metal device code, and always
enables the code in the kernel since other devices don't do dynamic compile.
Time per sample with OptiX + RTX A6000:
new old
barbershop_interior 0.0730s 0.0727s
bmw27 0.0047s 0.0053s
classroom 0.0428s 0.0464s
fishy_cat 0.0102s 0.0108s
junkshop 0.0366s 0.0395s
koro 0.0567s 0.0578s
monster 0.0206s 0.0223s
pabellon 0.0158s 0.0174s
sponza 0.0088s 0.0100s
spring 0.1267s 0.1280s
victor 0.0524s 0.0531s
wdas_cloud 0.0817s 0.0816s
Ref D15331, T87836
|
|
The Metal backend now compiles and caches a second set of kernels which are
optimized for scene contents, enabled for Apple Silicon.
The implementation supports doing this both for intersection and shading
kernels. However this is currently only enabled for intersection kernels that
are quick to compile, and already give a good speedup. Enabling this for
shading kernels would be faster still, however this also causes a long wait
times and would need a good user interface to control this.
M1 Max samples per minute (macOS 13.0):
PSO_GENERIC PSO_SPECIALIZED_INTERSECT PSO_SPECIALIZED_SHADE
barbershop_interior 83.4 89.5 93.7
bmw27 1486.1 1671.0 1825.8
classroom 175.2 196.8 206.3
fishy_cat 674.2 704.3 719.3
junkshop 205.4 212.0 257.7
koro 310.1 336.1 342.8
monster 376.7 418.6 424.1
pabellon 273.5 325.4 339.8
sponza 830.6 929.6 1142.4
victor 86.7 96.4 96.3
wdas_cloud 111.8 112.7 183.1
Code contributed by Jason Fielder, Morteza Mostajabodaveh and Michael Jones
Differential Revision: https://developer.blender.org/D14645
|
|
To be used for specialization in Metal, to automatically leave out unused nodes
from the kernel.
Ref D14645
|
|
To be used for specialization on Metal in a following commit, turning these
members into compile time constants.
Ref D14645
|
|
|
|
When the solve is successful, the light sample needs to be updated since the
effective shading point is now on the last refractive interface. Spread was
not taken into account, creating false caustics.
Differential Revision: https://developer.blender.org/D15449
|
|
|
|
This patch partitions the active indices into chunks prior to sorting by material in order to tradeoff some material coherence for better locality. On Apple Silicon GPUs (particularly higher end M1-family GPUs), we observe overall render time speedups of up to 15%. The partitioning is implemented by repeating the range of `shader_sort_key` for each partition, and encoding a "locator" key which distributes the indices into sorted chunks.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D15331
|