An investigation of rendering capabilities and results in KeyShot using the 64 core AMD Threadripper 3990X and the 72 RT core NVIDIA Quadro RTX 6000.
At the Luxion office, we have a new workstation equipped with an AMD Threadripper 3990X CPU and an NVIDIA Quadro RTX 6000 GPU. They are both priced at around $3500 and each currently represents the best performance available from a CPU or a GPU in a workstation.
The AMD Threadripper 3990X CPU is based on the amazing Zen architecture that AMD introduced in 2017. The 3990X is currently the fastest workstation CPU (ignoring the server-based AMD EPYC processors). It has 64 cores and allows for 128 simultaneous threads. It has a total of 292 MB of on-chip cache, and a memory bandwidth of 95 GB/s. It uses a state-of-the-art 7nm process and has a power consumption around 280 W.
The NVIDIA Quadro RTX 6000 GPU is based on the revolutionary Turing architecture that NVIDIA introduced in 2018. It comes with 72 RT cores dedicated to ray tracing, 4608 CUDA cores dedicated to shading and general computation, and 576 Tensor cores for deep learning and denoising. It has 24 GB of GDDR6 memory with a bandwidth of 672 GB/s. It is using a 12nm process and has a power consumption around 295 W.
Luxion demonstrated interactive ray tracing technology to the public in March 2006. The ray tracing code at the time was running on the AMD Opteron architecture. Since then, we have refined our ray tracing code to take full advantage of the latest CPU developments. In 2010, we demonstrated KeyShot on a 40 core / 80 thread Intel quad-socket Westmere-based workstation including interactive ray tracing of more than 1 billion unique polygons. Internally, Luxion worked on GPU ray tracing back in 2011, but after thorough analysis, we came to the conclusion that the limited memory and performance at the time did not make it competitive with our CPU renderer.
In 2018, this all changed when NVIDIA introduced the RTX architecture with dedicated ray tracing hardware. At Luxion, we decided this would be the time to support GPU rendering, and KeyShot 9, released November 2019, added full support for GPU rendering using RTX and OptiX 7. We have kept CPU rendering separate, providing users an option to use the CPU, as in all previous KeyShot versions, or the new KeyShot 9 GPU rendering.
The GPU rendering is using slightly different algorithms as GPUs perform best with uniform parallel workloads. This does mean that the GPU algorithms converge more slowly to a noise-free image than the CPU algorithms. However, the large number of compute threads on the GPU does allow for much higher throughput and the recent addition of fast denoising algorithms has further bridged the gap between brute force GPU algorithms and more sophisticated CPU algorithms.
KeyShot Benchmark scene by Magnus Skogsfjord
Over the years, we have received many questions on which CPU would provide the best performance and, with the introduction of GPU rendering, we are getting even more questions.
In KeyShot, the famous camera scene was used to test performance for many years. However, it's quite simple and doesn't really show the benefit of very fast hardware. With KeyShot 9.3, we introduced the new KeyShot Benchmark tool, available with the free KeyShot Viewer, that enables CPU and/or GPU benchmarking. Our benchmark test uses the beautiful microphone product scene created by Magnus Skogsfjord. To provide a proper benchmark, we have calibrated the output quality to match on both CPU and GPU, so the GPU is tracing more rays to get the final image. As the base performance, we use an 8-core / 16 thread i7-6900K CPU running at 3.2GHz- which is calibrated to a value of 1.0.
KeyShot benchmark results on the new workstation:
AMD Threadripper 3990X: 11.83
NVIDIA Quadro RTX 6000: 34.73
The output from the KeyShot Viewer Benchmark
For the product scene on our workstation, these results show the GPU is roughly three times faster than the CPU. For both the CPU and the GPU, KeyShot was able to maintain a workload above 98%, which means that KeyShot is fully utilizing the parallel aspects of the hardware.
The AMD Threadripper 3990X delivers the fastest workstation CPU performance for KeyShot to date. It is roughly twice as fast as the AMD Threadripper 2990WX processor ( 32 cores / 64 threads) and almost 12 times faster than the Intel i7-6900K processor (8 core / 16 thread).
Likewise, the NVIDIA Quadro RTX 6000 delivers the fastest single GPU performance for KeyShot to date. The new RTX cards are approximately six times faster than the previous generation NVIDIA GPUs based on the Pascal architecture. This shows the benefit of the new RT cores added in the Turing architecture, which enabled the GPU to push the ray tracing performance for a real product scene beyond the best CPU available.
The performance roughly matches what we have seen with other product scenes. An initial analysis of the performance on both the GPU and CPU indicates that the memory bandwidth is the limiting performance for both the CPU and the GPU for a number of product scenes we analyzed.
While running the benchmark we also noted the power consumption reported by the UPS connected to the workstation.
AMD Threadripper 3990X: 530 W
NVIDIA Quadro RTX 6000: 450 W
While the values are fairly close, this was a bit surprising to us as we thought the brute-force nature of the GPU and the 12nm process would result in higher power consumption, but these values do speak to the efficiency of modern GPU architectures.
KeyShot GPU Rendering
KeyShot runs very fast using the new NVIDIA RTX cards. Once the data is uploaded and shaders are compiled, the workflow is very smooth and fast. One of the challenges with GPUs is running out of memory. Very complex scenes with a lot of geometry and textures may not fit on the GPU, which leaves the CPU as the only choice. It is possible to swap textures from CPU memory to the GPU, but this comes at a performance penalty.
The GPU can handle pretty complex scenes though. With two RTX 5000 cards using NVIDIA NVLink for a combined 32 GB of memory, we have been able to ray trace scenes containing 1.37 billion unique triangles. However, sharing geometry over NVLink does come at a fairly significant performance hit. For complex scenes, the Quadro RTX 6000 with 24 GB or the Quadro RTX 8000 with 48 GB offers quite a bit of room for geometry and textures, and these cards may still use NVLink offering up to 96 GB of shared GPU memory.
Scene with 1.37 billion triangles rendered on two RTX 5000 using NVLink. Image by Dries Vervoort.
The new Turing architecture also comes with a very fast AI denoiser that takes only a few tens of milliseconds to denoise a frame. This is a huge benefit for interactive workflows compared to the CPU, where a state-of-the-art denoiser based on deep learning takes a few seconds.
Another significant benefit of GPU rendering is the ease at which performance may be scaled by simply adding more GPUs to the workstation. Most desktop workstations support multiple GPUs and, we have found, the performance scales almost linearly with each additional GPU added.
KeyShot CPU Rendering
With the high performance obtained with the RTX architecture, one might ask if there is still a need for CPU rendering? The answer to this question depends on the workflow. For most product scenes the GPU does deliver blazing performance, but for highly complex scenes with a lot of geometry and textures, the CPU with access to more memory becomes competitive. While it might be possible to render such complex scenes on the GPU it is easier to manage the data on the CPU and the overhead of moving data between the GPU and main memory will likely mean the CPU is the better choice even from a performance point of view.
In addition, the CPU outperforms the GPU in scenes with highly divergent shading behavior. An example is the foam head by Esben Oxholm. It uses a heterogeneous scattering media modulated by a 3D procedural texture to achieve a complex foam appearance. On the GPU the scattering media in combination with the location-dependent procedural texturing results in divergent behavior which slows the GPU quite significantly. As a consequence, the 3990X is three times faster at rendering this scene than the RTX 6000.
Foam created using heterogeneous scattering media and a modulated procedural texture. Image by Esben Oxholm.
Another area where the CPU has advantages is accuracy. KeyShot uses double-precision (64-bit) floating-point for some of the critical parts of the ray tracing core to ensure highly accurate handling of geometry. The RTX architecture relies on single-precision (32-bit) floating-point, which does limit the accuracy in large scenes, and can lead to gaps or inaccurate shading.
For the ultimate accuracy, KeyShot on the CPU does have one extra trick: direct ray tracing of NURBS. Ray tracing NURBS means that the geometry is always going to be smooth. Direct NURBS ray tracing is beneficial when working with large models that contain small parts. When these models are converted to triangles the small parts often use fewer triangles and consequently look faceted up close. In contrast, NURBS models look smooth at all distances. NURBS rendering is slower than rendering triangles but it allows the user to work on a relatively coarse triangle model during setup and then switch to the accurate NURBS ray tracing for high resolution final frame rendering without having to worry about visual faceted geometry.
Curved surfaces can appear faceted when rendered up close (left) while ray traced surfaces using the original NURBS data looks smooth at any distance (right).
Both the AMD Threadripper 3990X and the NVIDIA Quadro RTX 6000 are fantastic for rendering in KeyShot. The ideal workstation should have both!
The Threadripper 3990X is very fast at setting up the scene, processing the geometry, and finally rendering it. It provides a very smooth interactive workflow and allows for direct NURBS ray tracing and scene complexity only limited by the available memory. The 64 cores / 128 threads push the performance to almost 12 times faster than an 8 core / 16 thread Intel i7 CPU. The AMD Threadripper 3990X is the fastest (and I would add best) CPU you can buy for rendering today.
Likewise, the Quadro RTX 6000 card is blazingly fast at rendering. The new RT cores elevate the performance to almost 35 times faster than the 8 core / 16 thread Intel i7 and roughly 3 times faster than the 3990X CPU. Combined with denoising the interactive workflow on the Quadro RTX 6000 is fantastic giving almost instant final results, and for offline rendering of animations it is a godsend, cranking out frames faster than anything we have seen before. For the fastest possible rendering performance, one or more Quadro RTX 6000 cards in your workstation is highly recommended.