ACM Reference Format
Pantaleoni, J., Fascione, L., Hill, M., Aila, T. 2010. PantaRay: Fast Ray-traced Occlusion Caching of Massive
Scenes. ACM Trans. Graph. 29, 4, Article 37 (July 2010), 10 pages. DOI = 10.1145/1778765.1778774
http://doi.acm.org/10.1145/1778765.1778774.
Copyright Notice
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profi t or direct commercial advantage
and that copies show this notice on the fi rst page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1
(212) 869-0481, or permissions@acm.org.
© 2010 ACM 0730-0301/2010/07-ART37 $10.00 DOI 10.1145/1778765.1778774
http://doi.acm.org/10.1145/1778765.1778774
PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes
Jacopo Pantaleoni∗
NVIDIA Research
Luca Fascione†
Weta Digital
Martin Hill†
Weta Digital
Timo Aila∗
NVIDIA Research
Figure 1: The geometric complexity of scenes rendered in the movie Avatar often exceeds a billion polygons and varies widely: distant
rocks and vegetation are tessellated to a level of meters and centimeters, while the faces of even distant characters are modeled to over
40,000 polygons from forehead to chin. The spatial resolution of occlusion caches precomputed by our system also spans several orders
of magnitude.
Abstract
We describe the architecture of a novel system for precomputing
sparse directional occlusion caches. These caches are used for ac-
celerating a fast cinematic lighting pipeline that works in the spher-
ical harmonics domain. The system was used as a primary light-
ing technology in the movie Avatar, and is able to efficiently han-
dle massive scenes of unprecedented complexity through the use of
a flexible, stream-based geometry processing architecture, a novel
out-of-core algorithm for creating efficient ray tracing acceleration
structures, and a novel out-of-core GPU ray tracing algorithm for
the computation of directional occlusion and spherical integrals at
arbitrary points.
CR Categories: I.3.2 [Graphics Systems C.2.1, C.2.4, C.3)]:
Stand-alone systems—; I.3.7 [Three-Dimensional Graphics and
Realism]: Color,shading,shadowing, and texture—Raytracing;
Keywords: global illumination, precomputed radiance transfer,
caching, out of core
∗e-mail:{jpantaleoni,taila}@nvidia.com
†e-mail:{lukes,martinh}@wetafx.co.nz
1 Introduction
The movie Avatar featured unprecedented geometric complexity
(Figure 1), with production shots containing anywhere from ten
million to over one billion polygons.
To make the rendering of such complex scenes manageable while
satisfying the need to provide fast lighting iterations for lighting
artists and the director, modern relighting methods based on spheri-
cal harmonics (SH) [Ramamoorthi and Hanrahan 2001] and image-
based lighting [Debevec 1998] were used. These methods can
speed up the lighting iterations significantly, but unfortunately re-
quire an extremely compute and resource intensive precomputation
of directional occlusion information. Directional occlusion encodes
the visibility term used for lighting modulation as a function of di-
rection, and is typically computed using ray tracing.
We describe PantaRay1, a system designed to make this precompu-
tation practical by leveraging the development of modern ray trac-
ing algorithms for massively parallel GPU architectures [Aila and
Laine 2009] and combining them with new out-of-core and level of
detail rendering techniques.
The PantaRay engine is an out-of-core, massively parallel ray tracer
designed to handle scenes that are roughly an order of magni-
tude bigger than available system memory, and that require baking
spherical harmonics-encoded directional occlusion (SH occlusion)
and indirect lighting information for billions of points with highly
varying spatial density.
Our key contributions are the introduction of a flexible, stream-
based geometry processing architecture, a novel out-of-core algo-
rithm for constructing efficient ray tracing acceleration structures,
and a novel out-of-core GPU ray tracing algorithm for the compu-
tation of directional occlusion and spherical integrals. These are
1A twist on the Greek aphorism panta rei, i.e. everything flows
ACM Transactions on Graphics, Vol. 29, No. 4, Article 37, Publication date: July 2010.
beauty image
PRMan
tessellation
PRMan
final render
PantaRay
vislocal
micropolygons other cachesvislocal
cache
scene geometry
PRMan
other caching
Figure 2: A visual representation of the rendering pipeline used for the movie Avatar showing the various passes, the data flow among
them, and the role played by our system.
combined into a new precomputation system designed to efficiently
handle very high levels of geometric complexity.
Our system has been integrated into the production pipeline of Weta
Digital and is showcased in the movie Avatar, but the algorithmic
contributions and design decisions discussed in this paper could be
usefully applied in other domains, such as large-scale scientific vi-
sualization, which would benefit from rich lighting of extremely
complex geometric datasets.
2 Related Work
Much research has addressed the topic of massive model rendering
and visualization. Here we compare our system to some of the most
relevant work.
There is a vast amount of literature on the topic of direct visual-
ization of massive triangle meshes. Most such methods, includ-
ing [Borgeat et al. 2005] and [Cignoni et al. 2004], subdivide the
models into cells or patches and create multiple or progressive LOD
representations of those elements through mesh simplification. As
the goal of our system is not direct visualization but rather the com-
putation of low-frequency directional occlusion information, these
accurate simplification methods are not needed and we resort to
much cruder representations. Moreover, as we target ray tracing,
our out-of-core spatial index construction had the additional re-
quirement of targeting high ray tracing efficiency, employing parti-
tioning and subdivision methods based on the surface area heuristic
(SAH) [Havran 2000].
Wald et al. [2005] and Yoon et al. [2006] introduced two sys-
tems based on level of detail (LOD) for ray tracing large triangle
meshes. Unlike our approach, their systems relied on OS-level
memory mapping functionality and targeted moderately parallel
systems such as commodity multi-CPU systems, performing LOD
selection in each thread independently. This strategy would not be
portable to modern massively parallel GPU architectures. More-
over, no special effort was taken to speed up the out-of-core con-
struction of the acceleration structure, which in the case of [Wald
et al. 2005] took up to a day for a model containing 350M triangles.
Crassin et al. [2009] and Gobbetti et al. [2008] introduced two sys-
tems to render large volumetric datasets. These systems perform
direct visualization of geometry represented as voxel grids, rather
than computing complex visibility queries. Like our system, both
approaches decompose computation into a CPU-based LOD selec-
tion phase and a GPU-based rendering phase. Their systems per-
form these steps to visualize the entire model from a single point of
view at each frame, while we do it to compute directional occlusion
from large batches of nearby points at the same time.
Christensen et al. [2003] presented a ray tracing system using ray
differentials to perform LOD selection for high order surfaces. The
described system is able to efficiently handle very large tessellations
of the base meshes, but does not provide a level of detail scheme to
handle base meshes which do not fit in main memory. This was
essential for our approach, which needed to handle base meshes
with hundreds of millions or billions of control polygons.
Budge et al. [2009] presented an out-of-core data management layer
for path tracing on heteregeneous architectures. The system builds
on a dataflow network of kernel queues and a rendering-agnostic
task scheduler that prioritizes the execution of kernels based on data
availability, queue size and other criteria. The path tracer exploits
this generic framework by using a two-level acceleration structure,
where each second level out-of-core hierarchy is bound to a distinct
processing queue, extending the work of [Pharr et al. 1997]. The
resulting algorithm shows good scalability and thus satisfies one of
our main requirements. Unlike their work, we focus on developing
highly efficient special-purpose algorithms for the computation of
directional occlusion, minimizing I/O through careful LOD selec-
tion, and on the problem of efficient construction of high quality
out-of-core acceleration structures.
Ragan-Kelley et al. [2007] introduced Lightspeed as an interac-
tive lighting preview system that can greatly accelerate relighting
with local light sources and shadow maps in the presence of pro-
grammable shaders. Unlike their work, we focus on the efficient
computation of complex visibility for fast image based lighting in
massive scenes.
3 System Overview
Lighting of the movie Avatar was performed with a spherical har-
monics lighting pipeline based on the work of Ramamoorthi and
Hanrahan [2001], in which light transport is decomposed into a
multiple product integral:
Lo(x,ωo) =
�
Ω+
Li(x,ω)ρ(x,ω,ωo)V (x,ω)�ω, nˆ�dω (1)
where Lo is the exitant radiance, x is the point of interest, ωo is the
outgoing direction, Ω+ is the hemisphere above x, Li is incident ra-
diance, ω is the incident direction, ρ is the BRDF,V is the visibility
function, nˆ is the normalized surface normal and �·, ·� indicates the
scalar product operator.
In this framework, directional visibility is precomputed at sparse
locations in the scene and stored in a spherical harmonics basis.
Building on the work of Kautz et al. [2002], Ng et al. [2004], and
Snyder [2006], this directional visibility can then be reused over
many lighting cycles by performing a simple dot-product with the
less expensive terms of the equation, which are computed at render
time. Our system was built to efficiently perform this precomputa-
tion on massive scenes of unprecedented complexity.
The overall pipeline is divided into several computation passes
as depicted in Figure 2. During preparation, the scene geometry
is tessellated and divided into microgrids according to a camera-
based metric, using a custom point cloud output driver in Photo-
Realistic RenderMan (PRMan). We store these microgrids on disk
in a stream representation which allows vertices to be associated
37:2 • J. Pantaleoni et al.
ACM Transactions on Graphics, Vol. 29, No. 4, Article 37, Publication date: July 2010.
Figure 3: Zooming into scene 6 shows the various levels of tessellation.
with arbitrary user data, much like the primitive variable mecha-
nism in PRMan [Upstill 1990] or the vertex attribute machinery in
OpenGL [Segal and Akeley 1999]. In order to include occluding
geometries not directly visible to the camera, assets outside of the
viewing frustum are also tessellated, either using a relatively large
overscan or according to a world-based metric. Figure 3 shows an
example of the various tesselation densities encountered in a typical
production scene.
The vislocal pass invokes our PantaRay engine to augment the mi-
crogrid stream with directional occlusion data encoded in the spher-
ical harmonics basis and other precomputed quantities such as area
light visibility, blurred reflections and occasionally one-bounce in-
direct lighting. All these properties are generated by programmable
shaders using the ray tracing capabilities of our engine.
In the end the result of the PantaRay precomputation is used in
PRMan to render the final images in what is called the beauty pass.
In this pass, the lighting, BRDF and visibility fields are composed
at render time at a very low cost, to the point where the lighting
iterations can happen inside the beauty pipeline at final quality.
While the vislocal datasets can be reused for many lighting itera-
tions, which greatly offsets their computation cost, computing vis-
local remains an extremely resource-intensive process, and is a nat-
ural point to start looking for optimizations.
To illustrate the targeted complexity, the movie Avatar required
baking scenes with tens of thousands of different plants modeled as
subdivision surfaces at a resolution of 100K to 1M control polygons
each, and hundreds of characters modeled at a resolution of 1-2M
control polygons. Since occlusion is a global effect, out-of-camera
objects must be kept during the computation. Similarly, translu-
cence and subsurface scattering require processing geometry that is
not directly visible from the camera. Rather than tracing full reso-
lution models, lower resolution proxies could have been developed
and used for far away assets. While our pipeline used stochastic
simplification to reduce the complexity of vegetation before ras-
terization [Cook et al. 2007], we did not explore the possibility of
performing any additional simplification to the ray tracing assets
before they entered our system: we chose instead to construct a
fully automated system capable of directly handling the raw model
complexity rather than create a semi-automatic pipeline for proxy
generation.
The highly variable spatial resolution of the PantaRay output pre-
sented another challenge: many shots in these scenes required a
spatially varying baking resolution ranging from a few points per
meter on distant geometry such as terrains, to several points per
millimeter, for example to accurately represent the lighting on and
under the characters’ fingernails.
The speed and memory limitations of existing general purpose ray
tracing technology, and the reduced flexibility and programmability
in other special purpose baking tools, such as ptfilter [Christensen
2008], did not scale to these production needs. In practice, our goal
was to raise the tractability limit of shots in the movie Avatar by
roughly 2 orders of magnitude in terms of both speed and scene
size while keeping a reasonable degree of programmability.
4 Architecture
Handling the necessary complexity inside a flexible ray tracing sys-
tem requires efficient out-of-core and streaming techniques. To
support the use of such methods throughout the entire software
pipeline, we designed the system around the concept of microgrid
streams, which are opaque sources of microgrids (that is microp-
olygon grids as in [Cook et al. 1987]). Microgrid streams can be
read into main memory and eventually rewound, or restarted from
the beginning. Such streams can represent either geometry stored
on disk or procedural geometry. Each microgrid is essentially a
small indexed mesh with up to 256 vertices forming micropolygons,
where each micropolygon can have one, two, three or four vertices
(to represent points, lines, triangles and quads). Vertices are repre-
sented by their position, a normal, a radius and any attached user
data. We decided to disallow any form of random access for two
reasons: first, geometry files are typically compressed to save disk
space and potentially achieve higher I/O bandwidth; second, input
streams could be procedurally generated, and the procedural gener-
ation function might not allow for individual primitive generation
(as for example in some L-systems).
The input to PantaRay is an XML scene description, containing
a list of shaders, a list of geometries and their associated binding
relationships.
A geometry is a microgrid stream, which can specify both an oc-
cluder and a collection of bake sets. A bake set represents the
central PantaRay unit of work, and specifies that the input stream
should be cloned to a corresponding output stream and further dec-
orated with a given list of shader output attributes. Geometries can
further be instanced through a user-defined transformation, poten-
tially specifying a procedural displacement shader.
Shaders are programmable units responsible for computing some
required information at the vertices of each microgrid in a bake set.
The first task that PantaRay performs after parsing the scene file is
building an out of core acceleration structure (AS) for the input oc-
cluder geometry. After the AS is built, PantaRay processes the bake
sets and begins shader execution. The following sections describe
these processes in detail.
4.1 Acceleration Structure Generation
The main bottleneck in building an out-of-core acceleration struc-
ture can easily be I/O speed, as typical bounding volume hierar-
chies (BVH) or k-d tree building strategies require touching all the
objects multiple times. Even taking into account the performance
of state-of-the-art storage technologies, the system had to assume
that tens of thousand of concurrent processes would be using the
same storage, requiring all non-local I/O to be modeled as a high
latency, high bandwidth device.
Hence we developed a general purpose stream-based builder which
tried to minimize the number of times the stream is rewound.
The first component of this builder is a streaming bucketing pass
designed to handle hundreds of millions of microgrids. The buck-
eter uses a simple binning approach: it constructs a regular 3d grid
by first streaming the geometry once to count how many microgrids
PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes • 37:3
ACM Transactions on Graphics, Vol. 29, No. 4, Article 37, Publication date: July 2010.
(b)
(etc.)
(c) (d)(a)
Figure 4: Out-of-core spatial index construction. Microgrids stream from disk into a regular grid of buckets (a). Buckets are coalesced
and split into chunks (b) of up to 64KB. A BVH inside and among chunks (c) is broken into bricks (d) of up to 256 nodes. Each brick is
contiguous on disk.
fall in each bucket, and then streaming it a second time to populate
those buckets on local disk.
The first streaming pass reserves the correct amount of disk space
for each bucket and creates an index, but also keeps statistics about
the number of microgrids, micropolygons, vertices and byte size for
each of them.
The second pass of the algorithm loops through each microgrid to
find out all the buckets in which the microgrid falls, and records the
microgrid-bucket pairs into an in-memory cache with a few million
entries. Once the cache is full, the pairs are sorted by bucket index
and written to disk in their corresponding slot, essentially making a
single seek per bucket or less per cache flush.
The purpose of this bucketing pass is to create manageable units
of work which could fit in memory. However, the resulting uniform
grid is very coarse and often imbalanced, which makes it unsuitable
for direct ray tracing. With extremely large scenes it frequently hap-
pens that a large portion of the buckets are empty or very sparsely
populated, while a few remain too densely populated.
For these reasons, after the bucketing is done, we perform a chunk-
ing pass, whose purpose is to build a second disk-based spatial in-
dex with more uniform distribution of geometry, aggregating low-
complexity buckets and splitting high-complexity ones untill all oc-
cupy roughly 64KB of memory. We consider an implicit k-d tree
over the uniform grid of buckets. First, we perform a bottom-up
propagation of statistics from the leaves to the parents, so that for
each node it is possible to compute a rough estimate of the aggre-
ga