SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

Bringing real-world video properties back to generative dynamics

Xu Zhang1, Yu Lu1, Ruijie Quan1, Zhaozheng Chen2, Bohan Wang2, Yi Yang1,*
1ReLER, College of Artificial Intelligence, Zhejiang University, 2Huawei Central Research Institute (* Corresponding Author)
Baseline (Wan2.2)
Ours (SpecLoR-Zero)
Ours (SpecLoR-Adapter)

SpecLoR solves trajectory drift and motion inconsistencies in latent Flow Matching. It recovers realistic, highly coherent dynamics by rectifying spatial-temporal amplitudes in the frequency domain.

Abstract

Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies, such as physically implausible motions. Yet, directly correcting these drifted noisy latents is challenging: timestep-dependent noise obscures reliable structural cues, and spatial interventions risk disrupting fragile local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction. It also circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical natural-video priors are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies only the amplitude to match the statistical prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs in a 40-step schedule).

Interactive Refinement Gallery

Select a category and a specific prompt to see how SpecLoR rectifies video motion coherence.

Current Prompt: Loading prompt...
Baseline (Wan2.2)
Ours (SpecLoR-Zero)
Ours (SpecLoR-Adapter)

How does SpecLoR work?

SpecLoR works entirely during inference-time sampling. Click on each stage below to explore its mechanism.

Stage 1: Lookahead Projection

Diagnosing trajectory drift directly within the intermediate noisy latent space $z_t$ is intractable due to the dominance of timestep-dependent noise. To bypass this, we utilize the current predicted vector field $v_\theta(z_t, t)$ to project the state into a noise-free space via a single Euler step:

$$z_{t,0} = z_t - t \cdot v_\theta(z_t, t)$$

This lookahead prediction $z_{t,0}$ acts as a diagnostic window, revealing early structural ambiguities before they solidify into physical artifacts.

Stage 2: Frequency Decoupling

Directly modifying the spatial representation of $z_{t,0}$ is challenging because global motion structures and fine geometric details are tightly coupled. We overcome this by shifting intervention to the frequency domain via a 3D Fast Fourier Transform (FFT) across $(T, H, W)$.

This transformation explicitly decouples the macroscopic energy (Amplitude $\mathcal{A}$), which is susceptible to trajectory drift yet safe to rectify, from the highly vulnerable, geometry-encoding Phase $\mathcal{P}$.

Stage 3: Amplitude Rectification

We rectify only the corrupted amplitude spectrum while keeping the delicate phase strictly locked. We propose two variants:

  • SpecLoR-Zero: Utilizes a universal zero-cost global prior. Real-world videos exhibit a $1/f^\alpha$ power-law decay. We bound the amplitude within this natural variance envelope: $$\mathcal{A}_{rect} = \mathcal{A}_{curr} + \lambda \cdot (\mathcal{A}_{tgt} - \mathcal{A}_{curr})$$
  • SpecLoR-Adapter: A lightweight context-aware DiT that predicts an instance-specific residual amplitude map $\Delta\mathcal{A}$ optimized via log-magnitude MSE.

Stage 4: Re-Noising & Inference

We recombine the rectified amplitude $\mathcal{A}_{rect}$ with the preserved phase $\mathcal{P}$, applying a 3D Inverse FFT to yield a purified clean anchor $\hat{z}_{t,0}$.

Finally, we integrate this corrected state back into the ODE solver by re-noising $\hat{z}_{t,0}$ to the current timestep $t$ using noise $\epsilon$:

$$\hat{z}_t = (1 - t) \cdot \hat{z}_{t,0} + t \cdot \epsilon$$

This provides a structurally sound anchor for the subsequent flow matching trajectory.

SpecLoR Inference Pipeline

SpecLoR Pipeline Flowchart

The noisy latent is projected to a clean lookahead state, transformed via 3D FFT, rectified in the amplitude domain, and re-noised back to resume ODE integration.

Additional Results

Baseline (Wan2.2)
Ours (SpecLoR-Zero)
Ours (SpecLoR-Adapter)
Baseline (Wan2.2)
Ours (SpecLoR-Zero)
Ours (SpecLoR-Adapter)
Baseline (Wan2.2)
Ours (SpecLoR-Zero)
Ours (SpecLoR-Adapter)
Baseline (Wan2.2)
Ours (SpecLoR-Zero)
Ours (SpecLoR-Adapter)

BibTeX

@article{zhang2026speclor,
  author    = {Xu Zhang and Yu Lu and Ruijie Quan and Zhaozheng Chen and Bohan Wang and Yi Yang},
  title     = {SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation},
  journal   = {arXiv preprint arXiv:2606.11969},
  year      = {2026},
}