The first training-free framework for high-fidelity appearance transfer specifically designed for Diffusion Transformers (DiTs).
March 31, 2026
Original Paper
A training-free framework for high-fidelity appearance transfer via diffusion transformers
arXiv · 2603.26767
The Takeaway
DiTs are replacing U-Nets as the standard architecture for generative models (e.g., Sora), but they are notoriously difficult to control for reference-based editing. This method enables precise material and texture transfer at 1024px without the need for expensive fine-tuning or LoRAs.
From the abstract
Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity in