Two denoising steps are enough to push diffusion‑based lip sync into real‑time, overturning the long‑standing belief that dozens of iterations are mandatory. Lip Forcing makes this possible by reshaping the conditioning pipeline rather than by scaling model size.
Before this work, diffusion videos for lip synchronization relied on 50 + bidirectional denoising steps and full‑sequence attention, which forced inference latencies in the several‑seconds range and ruled out interactive use. Prior systems such as the 14 B teacher model, which uses full‑sequence bidirectional attention, were not designed for real‑time streaming and run significantly slower than the proposed students.
The 1.3 B student crosses into real‑time streaming at 31 FPS, 17.6× faster than its same‑scale bidirectional model. “The 1.3B student crosses into real-time streaming at 31 FPS, 17.6× faster than its same‑scale bidirectional model.” [1] The speed gains come with a modest fidelity‑sync trade‑off, as the authors note that the two‑step student maintains comparable reference fidelity while improving synchronization and speed.
The speedup comes directly from a two‑step inference schedule that discards classifier‑free guidance at test time. “At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization.” [1] The schedule is paired with a Sync‑Window DMD and a SyncNet‑based reward that keep the audio‑visual alignment tight despite the drastic step reduction.
The result is limited to autoregressive lip‑sync chunks and hinges on a large teacher model that must first be distilled, a pipeline that may be costly to train and that has only been verified on speaking faces. Moreover, the paper itself notes a fidelity‑sync tradeoff that could surface in more diverse video domains, suggesting that the two‑step recipe might need adaptation for tasks with higher texture complexity or longer temporal horizons.
If diffusion can be collapsed to two passes for lip sync, every V2V benchmark that still reports 50‑step runtimes should be revisited with a Lip‑Forcing style student. Real‑time interactive avatars, live streaming filters, and on‑device speech‑driven animation can replace heavy bidirectional backbones with lightweight two‑step students while preserving visual quality.
Will the next wave of video diffusion models drop the step count to one, making truly instantaneous generation a default?













