1. 13
  1. 11

    So I went and read the paper. Here’s a “short” summary for those of you who aren’t well-versed in path tracing jargon.

    Perhaps I’m misreading the situation here, but to me this manuscript seems like a response to the recent GPU marketing offense in the production rendering space. The context here is that traditionally all production 3D rendering for films has been done on CPUs, but lately products such as Redshift and technologies such as NVIDIA’s RTX extensions have made it clear that there are big wins to be made in GPU-based ray tracing.

    So it’s in Intel’s interest to demonstrate how well CPU-based systems can cope with very large 3D scenes.

    Back to the paper. The point is that Disney’s Moana island scene (a production asset of the 2016 film with the same name) is absolutely huge. Their target is to render it at “interactive framerates” which usually means getting more than 1 FPS. To get the density of the geometry across, they make the following comparison:

    the Moana Island contains more geometry instances (over 100 M) than most scenes used in ray tracing research contain in final polygons

    Here “instancing” refers to a memory usage optimization where you place the same model multiple times in the scene, but store its geometry only once. For example a hundred of leaves in a bush can be made of just from a handful of models.

    The large scale can make it difficult to even load the scene in memory. They mention that the original PBRT plain text format scene took 65 minutes to load, but after converting the geometry to a binary format it went down to 15 seconds. The total preprocessing time is still 6 minutes (peak memory use is 104 GiB), which isn’t too bad, since it also includes constructing the bounding volume hierarchy (BVH).

    The BVH is a tree that is used to speed up ray tracing of a scene. The idea is that you first check if your ray intersects a rough bounding volume before going in and doing intersection tests for individual polygons. This way it’s fast to drill down to the part of scene that probably contains the intersecting surface. However, in the Moana scene many instanced objects overlap and contain “long, thing” polygons that inflate the bounding volumes (see Figure 4 in the PDF.) This results in a bad top-level BVH that makes ray traversal slower. They didn’t propose any solution to this problem.

    It’s also pretty funny that in the original JSON data (the scene is shipped in both JSON and PBRT formats) the models are stored as quads, but in the PBRT version the quads are split into triangles. This effectively doubles the memory use, so they had to merge each pair of successive triangles back to a quad during loading.

    To render the scene they have a nine node distributed system:

    Our benchmarks are run using nine Intel Skylake Xeon nodes on the Texas Advanced Computing Center’s Stampede2 system, with one head node and eight worker nodes. Each node has two Intel Xeon Platinum 8160 processors and 192GB of DDR4 RAM.

    With 24 cores per processor (with SMT) that makes 864 independent threads. As far as I understand, the execution path from top-down is the following:

    1. Copy the whole dataset from the single head node to the eight worker nodes.

    2. Split the whole frame (1536 x 644 pixels) into tiles, and assign tiles to worker nodes. This is done using OSPray’s MPI support.

    3. Each worker apparently splits the tile to even smaller tiles to parallelize across threads. Implemented with Intel’s Thread Building Blocks compiler directives.

    4. Each thread traces a “packet” of rays using Embree. This is where the BVH is required.

      • A packet here means a bundle of rays that share the same SIMD register with each ray in its own SIMD lane. See this paper.
    5. After the rays hit geometry, their results are shaded and textured by SIMD code compiled with ISPC.

    6. The tiles are transferred back to the head node. Pixel colors of overlapping tiles are averaged together.

    7. The full picture is smoothed with Intel’s Open Image Denoise library (see Figure 8).

    Phew, so now we have a frame, but how fast did it render? Under 500 ms it seems, so I suppose they hit their goal. It’s interesting to note that the denoising pass is surprisingly expensive (emphasis mine):

    For each camera position, we measured the average ray tracing time to be: 207ms for Shot (36.95 Mray/s), 252ms for Beach (35.29 Mray/s), 183ms for Palms (39.38 Mray/s), and 339ms for Dunes (36.32 Mray/s). The image denoising cost depends only on the number of pixels being processed, and takes on average 130ms per-frame across all the benchmarked views.

    These kind of “systems papers” are often interesting because the problems are orthogonal to usual course of research. Just loading that huge JSON blob can be a major engineering issue, not the light transport algorithm.