↦ same chunk · adjacent step
Reuse block outputs between adjacent denoising steps. Erased by few-step distillation.
Cross-Chunk Block Caching for Few-Step Autoregressive World Models
Generate the next chunk of a 7-camera driving world without paying full DiT compute. X-Cache reuses block residuals across consecutive chunks instead of across denoising steps — surviving few-step distillation that erases every other caching trick.
Autoregressive video diffusion is the natural backbone for an interactive driving world: it streams chunk by chunk, conditions on a live policy's actions, and never waits for a full clip. To meet real-time budgets, models like X-World are distilled to four denoising steps.
That distillation collapses the redundancy that every prior cache relies on. TeaCache, DeepCache, ΔDiT, BWCache — all reuse block outputs across denoising steps. With S = 4, each step contributes substantial, non-redundant structural updates. There is nothing left to skip on that axis.
Worse, the per-chunk action stream is intentionally non-smooth at chunk boundaries (braking, steering, lane-change), and closed-loop generation forbids look-ahead, ruling out trajectory extrapolation and block-cascading parallelism. Every familiar acceleration tool fails here.
Driving scenes evolve smoothly relative to the generation rate. Two consecutive chunks describe almost the same physical world. The block input at position (t, b) at chunk n is therefore highly similar to the same (t, b) at chunk n−1 — even when there are only four denoising steps inside.
Reuse block outputs between adjacent denoising steps. Erased by few-step distillation.
Reuse block residuals between adjacent generation chunks. Comes from physical scene continuity, survives any S.
Per (denoising step t, block b) slot. The cache key is the block position, not the timestep, so reuse always happens at matching positions.
Subsamples block inputs on the 3D (F, H, W) latent grid (32 tokens), plus a global-mean channel and a flattened action vector. Closed-form, ~free to compute.
Cosine similarity (global direction) AND maximum token deviation (local outlier). A block is skipped only if both pass — anomalies in any view group veto reuse.
An EMA of each cell's own cosine history sets the threshold from below; a hard floor τ = 0.97 guarantees a quality margin.
From noise to a 7-camera chunk, in four passes.
Each chunk starts from Gaussian noise; the rolling KV cache supplies the long-horizon history. The current action vector enters every block via adaLN-Zero.
Per block input, sample 32 tokens on the 3D latent grid (proportional to Fg : Hg : Wg). Append the global mean and the action vector. Cost: a single index-and-flatten.
Compare current fingerprint to the cached fingerprint at (t, b). Skip iff cosine ≥ τ_cos(t, b) AND max-deviation < τ_dev. Otherwise compute the block and refresh the cache.
After the last denoising step, X-World writes a clean key/value into the persistent cache. On that chunk we force-compute every block, refusing to let approximation contaminate the autoregressive context.
The pass that writes clean K/V into the rolling cache is force-computed in full. Skipping here is the only knob that materially breaks quality (PSNR drops from 53.4 → 21.5 dB).
Block 0 always recomputes, so the fresh action and conditioning cascade into every downstream block's fingerprint and naturally reset stale chains.
Off by default — the cosine gate already vetoes step 0 reuse where it would matter. Available as an extra margin under heavy distribution shift.
τ_floor = 0.97 caps how loose the per-cell EMA threshold can drift. Inactive on the current workload, but armed for tails.
Three scenarios, two ways to look at the same comparison. Sweep the seam to view baseline (left) ↔ X-Cache (right) side by side, or open the heatmap to see exactly where (and how often) the cache deviates — concentrated on lane edges and far-field foliage, almost nothing on solid road or sky.
Seven urban clips with the highest static texture density in the test split. The cosine gate runs near-saturated; skip rate stays at 71.4% and the residual sits on lane edges and far-field foliage.
Drag the seam · scrub the timeline · ← / → to step one frame · space to play. Top-right strip is the live DiT block ribbon — gold = anchor, blue = recomputed, green = reused, indigo = KV-update protected.
Three clips. Long depth of field, rapid forward motion, sparse foreground. The gate skips 71.6% of blocks — and decoded 7-cam PSNR climbs to 54.7 dB because most pixels are sky/asphalt that absorb latent perturbations cleanly.
Drag the seam · scrub the timeline · ← / → to step one frame · space to play. Top-right strip is the live DiT block ribbon — gold = anchor, blue = recomputed, green = reused, indigo = KV-update protected.
Three clips where the ego vehicle executes a sharp heading change. Adjacent chunks are the most different in the dataset — yet the cross-chunk fingerprint still survives, with skip rate 71.3% and no chunk-boundary drift visible in the per-frame PSNR trace.
Drag the seam · scrub the timeline · ← / → to step one frame · space to play. Top-right strip is the live DiT block ribbon — gold = anchor, blue = recomputed, green = reused, indigo = KV-update protected.
| Scenario / camera | PSNR ↑ (dB) | SSIM ↑ | LPIPS ↓ | Skip | DiT | Speed |
|---|---|---|---|---|---|---|
| Urban street · n=7 | ||||||
| F-C | 53.83 | 0.9988 | 3.6e-4 | 71.4 % | 1.392 s | 2.7× |
| F-W | 50.27 | 0.9987 | 4.3e-4 | |||
| S-FL | 49.49 | 0.9985 | 5.1e-4 | |||
| S-FR | 48.69 | 0.9984 | 5.2e-4 | |||
| S-RL | 48.59 | 0.9985 | 4.8e-4 | |||
| S-RR | 48.07 | 0.9985 | 5.2e-4 | |||
| Rear | 51.77 | 0.9986 | 4.7e-4 | |||
| 7-cam | 51.37 | 0.9990 | 3.3e-4 | |||
| Highway · n=3 | ||||||
| F-C | 54.87 | 0.9989 | 2.6e-4 | 71.6 % | 1.365 s | 2.7× |
| F-W | 54.38 | 0.9988 | 2.3e-4 | |||
| S-FL | 53.08 | 0.9987 | 2.8e-4 | |||
| S-FR | 52.20 | 0.9987 | 2.9e-4 | |||
| S-RL | 52.48 | 0.9987 | 2.5e-4 | |||
| S-RR | 51.90 | 0.9986 | 3.0e-4 | |||
| Rear | 53.42 | 0.9987 | 3.2e-4 | |||
| 7-cam | 54.67 | 0.9991 | 1.9e-4 | |||
| U-turn · n=3 | ||||||
| F-C | 54.60 | 0.9987 | 4.3e-4 | 71.3 % | 1.364 s | 2.7× |
| F-W | 51.79 | 0.9987 | 3.6e-4 | |||
| S-FL | 49.29 | 0.9985 | 4.6e-4 | |||
| S-FR | 49.18 | 0.9985 | 4.7e-4 | |||
| S-RL | 48.87 | 0.9985 | 4.0e-4 | |||
| S-RR | 48.82 | 0.9984 | 4.9e-4 | |||
| Rear | 52.51 | 0.9986 | 4.2e-4 | |||
| 7-cam | 52.04 | 0.9990 | 3.1e-4 | |||