Shaders2026 · May · 02—15 min read

Moving Smoke Collision Off the CPU for VR

Unity's particle Collision module is a per-frame, per-particle CPU tax. On Quest 3 that tax is unaffordable. Here's the alternative I built.

The CPU budget on a standalone VR headset is small and not negotiable. A Quest 3 running at 72Hz gives you roughly thirteen milliseconds per frame, and a good portion of that is already spent before your gameplay code runs a single line. Tracking, runtime, rendering setup, physics tick, animation, audio. Anything you add on top of that has to justify itself in microseconds. Mixed-reality scenarios are worse, because passthrough composition takes another slice off the top.

This is the context in which Unity's particle Collision module becomes a problem. It is not that it produces bad-looking results. It is that it does its work on the CPU, every frame, per particle, and that bill is the wrong bill to be paying on the target hardware. This is a writeup of the system I built to move that work off the CPU entirely, why the architecture is what it is, and the trade-offs that came with it.

The CPU bill that comes with default particle collision

A smoke grenade in my training scenarios spawns several hundred particles over its lifetime, each living for a few seconds, with overlapping waves keeping the count high while the smoke is dense. With the Collision module enabled, every one of those particles is queried against the physics scene on the CPU on every frame it is alive.

The cost breakdown looks roughly like this on Quest 3:

Work	Frequency	Cost driver
Particle integration	Every frame	Number of live particles
Collision query per particle	Every frame	Number of live particles times collider density
Result resolution	Every frame	Number of collisions detected
Physics scene access	Every frame	Synchronisation overhead

The collision query is the line that hurts. A dense smoke cloud with three or four overlapping particle systems pushes the live count into the high hundreds, and the per-particle physics queries land squarely in the part of the frame budget that I cannot afford to spend. Frame time spikes when the smoke is dense, dips when it thins, and the headset compositor papers over the worst of it with reprojection, but reprojection on a moving particle effect is exactly the situation where reprojection artefacts are most visible.

There is also a quality problem on top of the cost problem, which is worth naming because it informs the design. Particle collision is a discrete event: a particle either hits a wall or it does not. A smoke billboard whose centre has not quite reached the wall but whose quad pokes through draws into the wall, because the collision system has no opinion about anything except the particle's transform. The right answer for smoke is volumetric containment, not per-particle bouncing. That answer naturally lives in the shader, not in the physics scene.

So I want two things at once: cheaper, and better. The architecture below gets both by moving the per-frame work to the GPU and reducing the CPU's contribution to a one-time scan.

The split: who does what, and how often

The system has two halves. The dividing line is the one that matters most for performance: what runs once, and what runs every frame.

Side	Work	Runs	Scales with
CPU	Scan environment, classify colliders, upload to material	Once at spawn	Number of nearby colliders (small)
GPU	Ray cast from smoke origin to fragment, test against uploaded colliders	Every frame, per fragment	Smoke screen coverage times collider count

The asymmetry is the whole point. The CPU does an expensive thing one time. The GPU does a cheap thing many times, in parallel, on a piece of silicon that was going to be running anyway. The data crossing the boundary is small, a few hundred floats, sent once at spawn and never touched again.

What this buys me, compared to the default Collision module:

Metric	Default particle Collision	Hybrid CPU-GPU
CPU cost per frame	Scales with live particle count	Zero after spawn
GPU cost per frame	Negligible	Scales with smoke screen coverage
Frame time stability	Spikes with smoke density	Flat
Visual quality at boundaries	Per-particle clipping visible	Soft volumetric cull

The GPU column is not zero, and I will get to its limits, but on a fragment-shaded effect that was already running, the GPU is the natural place for this work to live.

CPU side: the one-time scan

The detector lives in the FX namespace on the smoke grenade prefab and runs from Start. Its job is to produce the smallest possible list of relevant colliders, classify them into shader-friendly shapes, and push that list to the smoke material. After this, the CPU is done.

Hemisphere ray distribution

A bounding sphere overlap query would catch everything but would also pull in colliders below ground, which never block ground-detonated smoke. The detector uses a Fibonacci hemisphere distribution instead. Points on a sphere arranged by the golden angle give near-uniform coverage with no clumping, and clamping to the upper hemisphere halves the ray count without losing relevant geometry.

private List<Vector3> GenerateHemisphereDirections(int rayCount)
{
    List<Vector3> directions = new List<Vector3>();
    float goldenAngle = Mathf.PI * (3f - Mathf.Sqrt(5f));
 
    for (int i = 0; i < rayCount; i++)
    {
        float y = 1f - (float)i / (rayCount - 1f) * 1f;
        float radius = Mathf.Sqrt(1f - y * y);
        float theta = goldenAngle * i;
        float x = Mathf.Cos(theta) * radius;
        float z = Mathf.Sin(theta) * radius;
 
        if (y >= 0)
            directions.Add(new Vector3(x, y, z).normalized);
    }
    return directions;
}

Six rays is the default and is enough for typical interior geometry. Increasing it costs more raycasts at spawn but does not affect the per-frame cost at all, which makes it cheap insurance against missing a wall in an awkwardly shaped room.

Filtering: layer first, material as fallback

Two filtering modes are exposed, with a clear preference noted in the inspector tooltip.

Mode	How it works	Cost
Layer	`Physics.Raycast` itself filters by `LayerMask`, no further checks	Roughly 3x faster
Material	All hits are returned, then a renderer material name check runs in C#	Slower, more flexible

Layer mode is the fast path because Unity's physics engine excludes non-matching layers before the raycast even runs. Material mode exists for cases where the layer assignment is wrong or unavailable, but it is a fallback. In a properly authored scene, layer filtering is what you use.

Collider classification

Every relevant collider gets converted to one of two shapes the shader understands.

Source collider	Stored as	Notes
`BoxCollider`	Oriented box (centre, size, quaternion)	Direct mapping
`SphereCollider`	Sphere (centre, radius)	Radius scaled by max transform axis
`MeshCollider`	Oriented box from `mesh.bounds`	Approximation, sufficient for most static geometry

The mesh case is the interesting one. A full triangle mesh on the GPU would be expensive and would also need a completely different shader path. Treating it as the oriented bounding box of its local-space mesh bounds is a heavy approximation, but it works in practice because the geometry the smoke needs to be stopped by, walls, pillars, crates, vehicles, is overwhelmingly box-shaped at the volume the smoke cares about. The OBB approximation buys me a single uniform shader path for every collider type.

Bounds localBounds = meshCollider.sharedMesh.bounds;
Vector3 worldCenter = meshCollider.transform.TransformPoint(localBounds.center);
Vector3 worldSize = Vector3.Scale(localBounds.size, meshCollider.transform.lossyScale);
Quaternion worldRotation = meshCollider.transform.rotation;

Note the mesh.bounds-then-transform pattern. Going through meshCollider.bounds would give the axis-aligned world-space bounds of the rotated mesh, which is the wrong thing. I want the oriented bounding box, which means starting from the mesh's local bounds and applying the transform's rotation separately.

Data marshalling

Quest 3 does not love dynamic shader arrays. The fix is fixed-size pre-allocated buffers, both on the C# side and the HLSL side.

private const int MAX_COLLIDERS = 64;
private Vector4[] boxCentersArray   = new Vector4[MAX_COLLIDERS];
private Vector4[] boxSizesArray     = new Vector4[MAX_COLLIDERS];
private Vector4[] boxRotationsArray = new Vector4[MAX_COLLIDERS];

Sixty-four is a comfortable ceiling that the system has never come close to filling in real scenes. The allocation cost is negligible and Android shader compilers prefer the certainty over a dynamic length.

Three arrays per box rather than one packed structure. Shader Graph custom functions interact with material property arrays by name, and one float4 array per logical channel is the cleanest mapping. Centre uses three components and pads the fourth with zero. Size does the same. Rotation uses all four because it is a quaternion.

GPU side: where the per-frame work actually goes

The shader function gets called from a custom function node in Shader Graph and returns a single float between 0 and 1, which the graph multiplies into the particle's alpha. Zero means cull this fragment, one means draw fully, anything in between is a soft falloff at a near-miss boundary.

The function signature is short:

void CalculateSmokeCollision_float(
    float3 WorldPos,
    float3 Origin,
    float Radius,
    float BoxCount,
    float SphereCount,
    float Softness,
    out float Result)

Inputs come from the graph: WorldPos from a position node, Origin and Radius from material properties (set by the CPU script at spawn and never touched again), Softness from a slider, and the counts from the same one-shot upload.

Early out on radius

The first thing the function does is check whether the fragment is even inside the smoke's bounding sphere. If not, the smoke material is not drawing there anyway, but the early return saves the work of doing two collider loops for a fragment that will not contribute.

float3 rayDir = WorldPos - Origin;
float rayLength = length(rayDir);
 
if (rayLength > Radius)
{
    Result = 1.0;
    return;
}
rayDir = rayDir / rayLength;

Boxes: ray-AABB in local space

The standard slab method for ray-AABB intersection works on axis-aligned boxes. To use it on an oriented box, transform the ray into the box's local space, do the AABB test there, and you are done. The rotation quaternion is what makes that transform cheap.

float3 localOrigin = InverseRotateByQuaternion(
    Origin - _SmokeCollision_BoxCenters[i].xyz,
    _SmokeCollision_BoxRotations[i]);
float3 localDir = InverseRotateByQuaternion(rayDir, _SmokeCollision_BoxRotations[i]);
float3 halfSize = _SmokeCollision_BoxSizes[i].xyz * 0.5;
 
float3 invDir = 1.0 / (localDir + 1e-6);
float3 t1 = (-halfSize - localOrigin) * invDir;
float3 t2 = ( halfSize - localOrigin) * invDir;
float tNear = max(max(min(t1.x, t2.x), min(t1.y, t2.y)), min(t1.z, t2.z));
float tFar  = min(min(max(t1.x, t2.x), max(t1.y, t2.y)), max(t1.z, t2.z));
 
if (tNear <= tFar && tFar >= 0.0 && max(tNear, 0.0) < closestHit)
{
    closestHit = max(tNear, 0.0);
    anyHit = true;
}

The + 1e-6 on invDir is a guard against rays exactly parallel to a box face producing infinity. Small enough not to bias the result, large enough to keep the division finite.

The quaternion rotation helper avoids matrix construction entirely. Two cross products, a scalar multiply, an add. It runs faster than the equivalent matrix transform on a mobile GPU and avoids a precision issue that gets its own section below.

float3 RotateByQuaternion(float3 v, float4 q)
{
    float3 u = q.xyz;
    float s = q.w;
    return v + 2.0 * s * cross(u, v) + 2.0 * cross(u, cross(u, v));
}
 
float3 InverseRotateByQuaternion(float3 v, float4 q)
{
    return RotateByQuaternion(v, float4(-q.xyz, q.w));
}

The inverse of a unit quaternion is its conjugate, which is just negating the vector part. Cheap.

Spheres: ray-sphere quadratic

Sphere intersection is the classic quadratic. Branch-light, no local-space transform needed because spheres are rotationally symmetric.

float3 oc = Origin - _SmokeCollision_SphereCenters[j].xyz;
float b = 2.0 * dot(oc, rayDir);
float c = dot(oc, oc) - _SmokeCollision_SphereRadii[j] * _SmokeCollision_SphereRadii[j];
float disc = b * b - 4 * c;
 
if (disc >= 0.0)
{
    float t = (-b - sqrt(disc)) * 0.5;
    if (t < 0) t = (-b + sqrt(disc)) * 0.5;
    if (t >= 0.0 && t < closestHit)
    {
        closestHit = t;
        anyHit = true;
    }
}

The two roots represent ray entry and exit. I prefer the entry point but fall back to the exit point in case the smoke origin is inside the sphere, which can happen if the grenade lands inside a curved alcove that has been authored with a sphere collider.

Soft edges via signed distance

A hard cull at a collider surface gives sharp visible cuts where the smoke meets a wall. Smoke does not cut sharply, it feathers. To get a soft edge, the shader also tracks the closest distance from the ray to each collider, and if no hit occurred, fades the result based on that distance.

For boxes, the closest-point computation uses a standard box SDF after transforming the closest point on the ray into local space:

float tClosest = clamp(-dot(localOrigin, localDir), 0.0, rayLength);
float3 localPoint = localOrigin + localDir * tClosest;
float dist = sdBox(localPoint, halfSize);
if (dist > 0.0)
    minDist = min(minDist, dist);

For spheres, it is the distance from the closest point on the ray to the sphere centre, minus the radius. At the very end, if no hit happened, the result is the saturated ratio of minimum distance to softness:

if (anyHit && closestHit < rayLength)
    Result = 0.0;
else
    Result = saturate(minDist / soft);

This gives smooth falloff where the smoke approaches but does not penetrate a wall, which reads as a soft volumetric boundary rather than a hard particle clip.

The matrix-versus-quaternion incident

The script has a comment at the top:

// Matrix4x4 failed with a strange bug, to be corrected later if time permits,
// switching back to inv quat math for now

This is worth a paragraph because it cost real time and the lesson generalises.

The original plan was to send a Matrix4x4 per collider as a world-to-local transform, and have the shader multiply the ray endpoints by it. Textbook approach. It worked on PC. On Quest 3, it produced subtly wrong results, the kind where the cull boundary was offset from the visible collider by a few centimetres at certain angles and zero centimetres at others. After enough investigation to rule out coordinate space and major-order issues, I switched to sending the rotation as a quaternion and reconstructing the inverse transform on the shader side. The problem disappeared.

The suspicion is precision. A 4x4 matrix marshalled as four float4s and reassembled in HLSL goes through more arithmetic than a quaternion-and-vector pair, and on a mobile GPU running at fp16 in places I did not anticipate, that arithmetic accumulated. The quaternion path uses fewer floats, fewer multiplies, and stays inside fp32 territory. I never went back to find exactly which step lost precision. The comment is there as a reminder that the answer is not "matrices are wrong", it is " I know this version works and I have other things to ship".

Trade-offs and what it cannot do

The system is fast and predictable but not free of compromises. The honest list:

Limitation	Consequence
One-shot scan at spawn	Doors that open after the smoke arrives are not noticed
OBB approximation for mesh colliders	Concave geometry (an L-shaped wall as one mesh) culls inside its concavity
Per-fragment ray cast	GPU cost scales with screen-space smoke coverage
Sixty-four collider ceiling	Generous in practice, but not infinite
No dynamic moving colliders	The CPU pass would need to re-run and re-upload

Most of these are acceptable for the use case. Training scenarios are authored, not procedural, and level designers can place box colliders explicitly where they want the smoke to stop. The static-world assumption holds for the duration of a typical smoke effect, which is a few seconds. Moving doors and vehicles are rare enough as smoke obstacles that supporting them is not worth the cost of restoring the per-frame CPU bill I just got rid of.

The OBB approximation is the one that gets noticed occasionally. A long L-shaped corridor authored as a single mesh collider produces visible cull inside the corner. The fix is to split the mesh collider into two boxes at authoring time, which is a documentation problem rather than a code problem.

What this generalises to

The pattern, one-shot CPU scan plus per-fragment GPU evaluation against a small uploaded list, is not specific to smoke. It applies to any volumetric effect that needs to respect world geometry on a CPU-constrained target: gas clouds, fog volumes, magic-style energy fields, fire that should not lick through walls. The shape of the data on the GPU is the same. Only the falloff curve and the alpha blending change.

It also is not specific to URP. The HLSL function uses no URP-specific includes, which keeps it portable to Built-in or HDRP with only the shader scaffolding around it changing. The C# script uses only UnityEngine types.

What makes the architecture work, fundamentally, is the recognition that on a CPU-limited platform the most valuable design move is to identify per-frame CPU work that does not need to be per-frame, or does not need to be on the CPU, and to move it. Default particle collision fails both of those tests for a smoke cloud in a mostly-static training scenario. The work it does is genuinely useful, but it is being done on the wrong processor at the wrong frequency. Doing it once on the CPU and forever on the GPU is the correct shape for this hardware, and the visual improvement at the boundary is a bonus that fell out of putting the work in the right place.

Filed under: Shaders← Back to writing