Age | Commit message (Collapse) | Author |
|
Simplifies intersection code a little and slightly improves precision regarding
self intersection.
The parametric texture coordinate in shader nodes is still the same as before
for compatibility.
|
|
|
|
Runtime switching between Embree and BVH2 got lost.
|
|
The number of Execution Units and resident "threads" (simd width * threads
per EUs) are now exposed and used to select the number of states using
a simplified heuristic.
|
|
|
|
float8 is a reserved type in Metal, but is not implemented. So rename to
float8_t for now.
Also move back intersection handlers to kernel.metal, they can't be in the
class that encapsulates the other Metal kernel functions.
|
|
This was tested in some places to check if code was being compiled for the
CPU, however this is only defined in the kernel. Checking __KERNEL_GPU__
always works.
|
|
This patch adds required math functions for float8 to make it possible
using float8 instead of float3 for color data.
Differential Revision: https://developer.blender.org/D15525
|
|
Having the OptiX/MetalRT/Embree/MetalRT implementations all in one file with
many #ifdefs became too confusing. Instead split it up per device, and also
move it together with device specific hit/filter/intersect functions and
associated data types.
|
|
No need anymore to have a difference between CPU/GPU, all distances
remain in world space.
|
|
All our intersections functions now work with unnormalized ray direction,
which means we no longer need to transform ray distance between world and
object space, they can all remain in world space.
There doesn't seem to be any real performance difference one way or the
other, but it does simplify the code.
|
|
This helps with debugging, and gives a slightly closer match between CPU
and CUDA/HIP/Metal renders when it comes to ray tracing precision.
|
|
|
|
For transparency, volume and light intersection rays, adjust these distances
rather than the ray start position. This way we increment the start distance
by the smallest possible float increment to avoid self intersections, and be
sure it works as the distance compared to be will be exactly the same as
before, due to the ray start position and direction remaining the same.
Fix T98764, T96537, hair ray tracing precision issues.
Differential Revision: https://developer.blender.org/D15455
|
|
|
|
This was added for Metal, but also gives good results with CUDA and OptiX.
Also enable it for future Apple GPUs instead of only M1 and M2, since this has
been shown to help across multiple GPUs so the better bet seems to enable
rather than disable it.
Also moves some of the logic outside of the Metal device code, and always
enables the code in the kernel since other devices don't do dynamic compile.
Time per sample with OptiX + RTX A6000:
new old
barbershop_interior 0.0730s 0.0727s
bmw27 0.0047s 0.0053s
classroom 0.0428s 0.0464s
fishy_cat 0.0102s 0.0108s
junkshop 0.0366s 0.0395s
koro 0.0567s 0.0578s
monster 0.0206s 0.0223s
pabellon 0.0158s 0.0174s
sponza 0.0088s 0.0100s
spring 0.1267s 0.1280s
victor 0.0524s 0.0531s
wdas_cloud 0.0817s 0.0816s
Ref D15331, T87836
|
|
The Metal backend now compiles and caches a second set of kernels which are
optimized for scene contents, enabled for Apple Silicon.
The implementation supports doing this both for intersection and shading
kernels. However this is currently only enabled for intersection kernels that
are quick to compile, and already give a good speedup. Enabling this for
shading kernels would be faster still, however this also causes a long wait
times and would need a good user interface to control this.
M1 Max samples per minute (macOS 13.0):
PSO_GENERIC PSO_SPECIALIZED_INTERSECT PSO_SPECIALIZED_SHADE
barbershop_interior 83.4 89.5 93.7
bmw27 1486.1 1671.0 1825.8
classroom 175.2 196.8 206.3
fishy_cat 674.2 704.3 719.3
junkshop 205.4 212.0 257.7
koro 310.1 336.1 342.8
monster 376.7 418.6 424.1
pabellon 273.5 325.4 339.8
sponza 830.6 929.6 1142.4
victor 86.7 96.4 96.3
wdas_cloud 111.8 112.7 183.1
Code contributed by Jason Fielder, Morteza Mostajabodaveh and Michael Jones
Differential Revision: https://developer.blender.org/D14645
|
|
To be used for specialization in Metal, to automatically leave out unused nodes
from the kernel.
Ref D14645
|
|
To be used for specialization on Metal in a following commit, turning these
members into compile time constants.
Ref D14645
|
|
|
|
When the solve is successful, the light sample needs to be updated since the
effective shading point is now on the last refractive interface. Spread was
not taken into account, creating false caustics.
Differential Revision: https://developer.blender.org/D15449
|
|
|
|
This patch partitions the active indices into chunks prior to sorting by material in order to tradeoff some material coherence for better locality. On Apple Silicon GPUs (particularly higher end M1-family GPUs), we observe overall render time speedups of up to 15%. The partitioning is implemented by repeating the range of `shader_sort_key` for each partition, and encoding a "locator" key which distributes the indices into sorted chunks.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D15331
|
|
The current specific CentOS7 workaround we have for AoT, which is to
disable __FAST_MATH__ by using -fhonor-nans, now also fixes the
compilation issue for JIT as well since at least driver 23570.
|
|
with a very high min-driver version requirement, placeholder until JIT
CentOS runtime compilation issue gets fixed in a defined version.
min-driver version check can be worked around by setting
CYCLES_ONEAPI_ALL_DEVICES environment variable.
|
|
We used it only to access device id for explicitly allowing Arc GPUs.
It made the backend require ze_loader.dll which could be problematic if
we end up using direct linking.
I've replaced filtering based on PCI device id by using other HW properties
instead (EUs, threads per EU), that are now available through Level-Zero.
|
|
|
|
Initially oneAPI implementation have waited after each memory
operation, even if there was no need for this. Now, the implementation
will wait only if it is really necessary - it have improved
performance noticeble for some scenes and a bit for the rest of them.
|
|
Identical Intel GPUs ended up with the same id.
Added PCI BDF to the id to make it unique.
|
|
|
|
|
|
This patch adds a new Cycles device with similar functionality to the
existing GPU devices. Kernel compilation and runtime interaction happen
via oneAPI DPC++ compiler and SYCL API.
This implementation is primarly focusing on Intel® Arc™ GPUs and other
future Intel GPUs. The first supported drivers are 101.1660 on Windows
and 22.10.22597 on Linux.
The necessary tools for compilation are:
- A SYCL compiler such as oneAPI DPC++ compiler or
https://github.com/intel/llvm
- Intel® oneAPI Level Zero which is used for low level device queries:
https://github.com/oneapi-src/level-zero
- To optionally generate prebuilt graphics binaries: Intel® Graphics
Compiler All are included in Linux precompiled libraries on svn:
https://svn.blender.org/svnroot/bf-blender/trunk/lib The same goes for
Windows precompiled binaries but for the graphics compiler, available
as "Intel® Graphics Offline Compiler for OpenCL™ Code" from
https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html,
for which path can be set as OCLOC_INSTALL_DIR.
Being based on the open SYCL standard, this implementation could also be
extended to run on other compatible non-Intel hardware in the future.
Reviewed By: sergey, brecht
Differential Revision: https://developer.blender.org/D15254
Co-authored-by: Nikita Sirgienko <nikita.sirgienko@intel.com>
Co-authored-by: Stefan Werner <stefan.werner@intel.com>
|
|
This could result in wrong skipping of SVM nodes in the graph. Now make the
logic consistent with the clamping in the OSL implementation and constant
folding.
Thanks to Christophe Hery for finding the problem and providing the fix.
|
|
Enables Vega and Vega II GPUs as well as Vega APU, using changes in HIP code
to support 64-bit waves and a new HIP SDK version.
Tested with Radeon WX9100, Radeon VII GPUs and Ryzen 7 PRO 5850U with Radeon
Graphics APU.
Ref T96740, T91571
Differential Revision: https://developer.blender.org/D15242
|
|
This change helps decrease Intel GPU binaries compile time by 5-10
minutes without impacting other backends.
Reviewed By: sergey, brecht
Differential Revision: http://developer.blender.org/D15273
|
|
This patch unifies the names of math functions for different data types and uses
overloading instead. The goal is to make it possible to swap out all the float3
variables containing RGB data with something else, with as few as possible
changes to the code. It's a requirement for future spectral rendering patches.
Differential Revision: https://developer.blender.org/D15276
|
|
* Rename "texture" to "data array". This has not used textures for a long time,
there are just global memory arrays now. (On old CUDA GPUs there was a cache
for textures but not global memory, so we used to put all data in textures.)
* For CUDA and HIP, put globals in KernelParams struct like other devices.
* Drop __ prefix for data array names, no possibility for naming conflict now that
these are in a struct.
|
|
And use them more consistently than before.
|
|
|
|
This patch adds some useful debugging & profiling env vars to the Metal backend:
- `CYCLES_METAL_PROFILING`: output a per-kernel timing report at the end of the render
- `CYCLES_METAL_DEBUG`: enable per-dispatch tracing (very verbose)
- `CYCLES_DEBUG_METAL_CAPTURE_KERNEL`: enable programatic .gputrace capture for a specified kernel index
Here's an example of the timing report with `CYCLES_METAL_PROFILING` enabled:
```
---------------------------------------------------------------------------------------------------
Kernel name Total threads Dispatches Avg. T/D Time Time%
---------------------------------------------------------------------------------------------------
integrator_init_from_camera 657,407,232 161 4,083,274 0.24s 0.51%
integrator_intersect_closest 1,629,288,440 681 2,392,494 15.18s 32.12%
integrator_intersect_shadow 751,652,291 470 1,599,260 5.80s 12.28%
integrator_shade_background 304,612,074 263 1,158,220 1.16s 2.45%
integrator_shade_surface 1,159,764,041 676 1,715,627 20.57s 43.52%
integrator_shade_shadow 598,885,847 418 1,432,741 1.27s 2.69%
integrator_queued_paths_array 2,969,650,130 805 3,689,006 0.35s 0.74%
integrator_queued_shadow_paths_array 593,936,619 379 1,567,115 0.14s 0.29%
integrator_terminated_paths_array 22,205,417 155 143,260 0.05s 0.10%
integrator_sorted_paths_array 2,517,140,043 676 3,723,579 1.65s 3.50%
integrator_compact_paths_array 648,912,748 155 4,186,533 0.03s 0.07%
integrator_compact_states 20,872,687 155 134,662 0.14s 0.29%
integrator_terminated_shadow_paths_array 374,100,675 438 854,111 0.16s 0.33%
integrator_compact_shadow_paths_array 503,768,657 438 1,150,156 0.05s 0.10%
integrator_compact_shadow_states 37,664,941 202 186,460 0.23s 0.50%
integrator_reset 25,165,824 6 4,194,304 0.06s 0.12%
film_convert_combined_half_rgba 3,110,400 6 518,400 0.00s 0.01%
prefix_sum 676 676 1 0.19s 0.40%
---------------------------------------------------------------------------------------------------
6,760 47.27s 100.00%
---------------------------------------------------------------------------------------------------
```
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D15044
|
|
|
|
Must include the AOV writing feature in background shader evaluation.
Differential Revision: https://developer.blender.org/D15114
|
|
|
|
Move MNEE to own kernel, separate from shader ray-tracing. This does introduce
the limitation that a shader can't use both MNEE and AO/bevel, but that seems
like the better trade-off for now.
We can experiment with bigger kernel organization changes later.
Differential Revision: https://developer.blender.org/D15070
|
|
|
|
|
|
Regenerate blackbody RGB curve fit to not clamp values, and extend down to
800K since it does now change below 965K.
Note that as before, blackbody is only defined in the range 800K to 12000K
and has a fixed value outside of that. But within that range there should
be no more unnecessary gamut clamping.
|
|
This patch makes it possible to change the precision with which to
store volume data in the NanoVDB data structure (as float, half, or
using variable bit quantization) via the previously unused precision
field in the volume data block.
It makes it possible to further reduce memory usage during
rendering, at a slight cost to the visual detail of a volume.
Differential Revision: https://developer.blender.org/D10023
|
|
|
|
Found those missing casts while looking into a crash report made in
the Blender Chat. Was unable to reproduce the crash, but the casts
should totally be there to avoid integer overflow.
|