AI & ML Paradigm Shift

Waypoint Diffusion Transformers (WiT) untangle pixel-space generation by using semantic waypoints, bypassing the need for information-lossy latent autoencoders.

March 17, 2026

Original Paper

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Hainuo Wang, Mingjia Li, Xiaojie Guo

arXiv · 2603.15132

The Takeaway

Most modern image generators rely on VAE-based latents which lose detail; WiT proves that pixel-space diffusion is viable and efficient when generation trajectories are semantically guided. This allows for high-fidelity image synthesis without the reconstruction bottlenecks of standard architectures.

From the abstract

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (