Technical Report · 2026

X-Cachev1.0

Cross-Chunk Block Caching for Few-Step Autoregressive World Models

Generate the next chunk of a 7-camera driving world without paying full DiT compute. X-Cache reuses block residuals across consecutive chunks instead of across denoising steps — surviving few-step distillation that erases every other caching trick.

DiT speedup
2.7×
Block skip rate
71%
7-cam PSNR
≥ 51 dB
Training cost
0
01 · The problem

Real-time world simulation breaks every existing cache.

Autoregressive video diffusion is the natural backbone for an interactive driving world: it streams chunk by chunk, conditions on a live policy's actions, and never waits for a full clip. To meet real-time budgets, models like X-World are distilled to four denoising steps.

That distillation collapses the redundancy that every prior cache relies on. TeaCache, DeepCache, ΔDiT, BWCache — all reuse block outputs across denoising steps. With S = 4, each step contributes substantial, non-redundant structural updates. There is nothing left to skip on that axis.

Worse, the per-chunk action stream is intentionally non-smooth at chunk boundaries (braking, steering, lane-change), and closed-loop generation forbids look-ahead, ruling out trajectory extrapolation and block-cascading parallelism. Every familiar acceleration tool fails here.

02 · The insight

Cache along a different axis.

Driving scenes evolve smoothly relative to the generation rate. Two consecutive chunks describe almost the same physical world. The block input at position (t, b) at chunk n is therefore highly similar to the same (t, b) at chunk n−1 — even when there are only four denoising steps inside.

Cross-step (others)

↦ same chunk · adjacent step

Reuse block outputs between adjacent denoising steps. Erased by few-step distillation.

Cross-chunk (X-Cache)

↧ same step · adjacent chunk

Reuse block residuals between adjacent generation chunks. Comes from physical scene continuity, survives any S.

Block residuals from chunk n are written into a per-(t, b) cache. At chunk n+1, the gate decides per block whether to recompute or reuse.

Block-level residual cache

Per (denoising step t, block b) slot. The cache key is the block position, not the timestep, so reuse always happens at matching positions.

Structure & action-aware fingerprint

Subsamples block inputs on the 3D (F, H, W) latent grid (32 tokens), plus a global-mean channel and a flattened action vector. Closed-form, ~free to compute.

Dual-metric gate

Cosine similarity (global direction) AND maximum token deviation (local outlier). A block is skipped only if both pass — anomalies in any view group veto reuse.

Per-(t, b) adaptive threshold

An EMA of each cell's own cosine history sets the threshold from below; a hard floor τ = 0.97 guarantees a quality margin.

03 · The pipeline

What happens inside one chunk.

From noise to a 7-camera chunk, in four passes.

01

Sample noise & init context

Each chunk starts from Gaussian noise; the rolling KV cache supplies the long-horizon history. The current action vector enters every block via adaLN-Zero.

02

Compute fingerprints

Per block input, sample 32 tokens on the 3D latent grid (proportional to Fg : Hg : Wg). Append the global mean and the action vector. Cost: a single index-and-flatten.

03

Gate decision

Compare current fingerprint to the cached fingerprint at (t, b). Skip iff cosine ≥ τ_cos(t, b) AND max-deviation < τ_dev. Otherwise compute the block and refresh the cache.

04

KV update — protected

After the last denoising step, X-World writes a clean key/value into the persistent cache. On that chunk we force-compute every block, refusing to let approximation contaminate the autoregressive context.

04 · Safety mechanisms

Four guardrails that keep approximation contained.

KV-update protection

The pass that writes clean K/V into the rolling cache is force-computed in full. Skipping here is the only knob that materially breaks quality (PSNR drops from 53.4 → 21.5 dB).

Anchor block (F_n = 1)

Block 0 always recomputes, so the fresh action and conditioning cascade into every downstream block's fingerprint and naturally reset stale chains.

Step-0 protection (optional)

Off by default — the cosine gate already vetoes step 0 reuse where it would matter. Available as an extra margin under heavy distribution shift.

Adaptive threshold floor

τ_floor = 0.97 caps how loose the per-cell EMA threshold can drift. Inactive on the current workload, but armed for tails.

05 · Lossless · in motion

Drag the curtain. There is no quality drop.

Three scenarios, two ways to look at the same comparison. Sweep the seam to view baseline (left) ↔ X-Cache (right) side by side, or open the heatmap to see exactly where (and how often) the cache deviates — concentrated on lane edges and far-field foliage, almost nothing on solid road or sky.

Urban

Dense traffic, pedestrians, storefronts.

Seven urban clips with the highest static texture density in the test split. The cosine gate runs near-saturated; skip rate stays at 71.4% and the residual sits on lane edges and far-field foliage.

  • 7 clips
  • 264 frames each
  • PSNR 51.4 dB
  • skip 71.4%
  • speedup 2.7×
frame 0 / 264
skip 71.4%
speedup 2.7×
Baseline · full compute X-Cache · 2.7× faster
0:00 / 0:22

Drag the seam · scrub the timeline · ← / → to step one frame · space to play. Top-right strip is the live DiT block ribbon — gold = anchor, blue = recomputed, green = reused, indigo = KV-update protected.

Highway

Elevated express ring & ordinary motorway.

Three clips. Long depth of field, rapid forward motion, sparse foreground. The gate skips 71.6% of blocks — and decoded 7-cam PSNR climbs to 54.7 dB because most pixels are sky/asphalt that absorb latent perturbations cleanly.

  • 3 clips
  • PSNR 54.7 dB
  • skip 71.6%
  • speedup 2.7×
frame 0 / 264
skip 71.6%
speedup 2.7×
Baseline · full compute X-Cache · 2.7× faster
0:00 / 0:22

Drag the seam · scrub the timeline · ← / → to step one frame · space to play. Top-right strip is the live DiT block ribbon — gold = anchor, blue = recomputed, green = reused, indigo = KV-update protected.

U-turn

Maximum cross-chunk motion in the split.

Three clips where the ego vehicle executes a sharp heading change. Adjacent chunks are the most different in the dataset — yet the cross-chunk fingerprint still survives, with skip rate 71.3% and no chunk-boundary drift visible in the per-frame PSNR trace.

  • 3 clips
  • PSNR 52.0 dB
  • skip 71.3%
  • speedup 2.7×
frame 0 / 264
skip 71.3%
speedup 2.7×
Baseline · full compute X-Cache · 2.7× faster
0:00 / 0:22

Drag the seam · scrub the timeline · ← / → to step one frame · space to play. Top-right strip is the live DiT block ribbon — gold = anchor, blue = recomputed, green = reused, indigo = KV-update protected.

06 · Numbers

Same seed, same conditioning, 2.7× faster — and the pixels prove it.

Per-scenario summary against the no-cache baseline. Per-camera rows are F-C (front centre), F-W (front wide), S-FL/FR/RL/RR (sides) and Rear; the highlighted 7-cam row is the stitched 7-view image used in the qualitative figures.
Scenario / camera PSNR ↑ (dB) SSIM ↑ LPIPS ↓ Skip DiT Speed
Urban street · n=7
F-C 53.83 0.9988 3.6e-4 71.4 % 1.392 s 2.7×
F-W 50.27 0.9987 4.3e-4
S-FL 49.49 0.9985 5.1e-4
S-FR 48.69 0.9984 5.2e-4
S-RL 48.59 0.9985 4.8e-4
S-RR 48.07 0.9985 5.2e-4
Rear 51.77 0.9986 4.7e-4
7-cam 51.37 0.9990 3.3e-4
Highway · n=3
F-C 54.87 0.9989 2.6e-4 71.6 % 1.365 s 2.7×
F-W 54.38 0.9988 2.3e-4
S-FL 53.08 0.9987 2.8e-4
S-FR 52.20 0.9987 2.9e-4
S-RL 52.48 0.9987 2.5e-4
S-RR 51.90 0.9986 3.0e-4
Rear 53.42 0.9987 3.2e-4
7-cam 54.67 0.9991 1.9e-4
U-turn · n=3
F-C 54.60 0.9987 4.3e-4 71.3 % 1.364 s 2.7×
F-W 51.79 0.9987 3.6e-4
S-FL 49.29 0.9985 4.6e-4
S-FR 49.18 0.9985 4.7e-4
S-RL 48.87 0.9985 4.0e-4
S-RR 48.82 0.9984 4.9e-4
Rear 52.51 0.9986 4.2e-4
7-cam 52.04 0.9990 3.1e-4