This week’s AI tip is about: Enhancing the performance and scalability of image and video generation tasks
Today we’ll look at a new technology for Transformers models. Diffusion Transformers (DiT) are a novel class of diffusion models that leverage the transformer architecture to improve the performance and scalability of image and video generation tasks.
Diffusion Models Overview
Diffusion models are generative models that create data samples from random noise by simulating a Markov chain to transition from a simple prior distribution to the data distribution. They operate using a two-way process:
- Forward diffusion: The original image is gradually noised in small stochastic steps by adding noise drawn from a Gaussian distribution. This maps the image into a high-dimensional latent space.
- Backward diffusion: The model learns to reconstruct the original, or a new image, from the noised latent representation by removing the noise in a step-by-step denoising process.
Diffusion Transformer Architecture
DiT replaces the commonly used U-Net backbone in diffusion models with a transformer that operates on latent patches.
The input spatial representation is converted into a sequence of tokens using a "patchify" layer that linearly embeds each patch. The number of tokens is determined by the patch size – smaller patches lead to more tokens and higher computational cost (Gflops).
The transformer blocks in DiT are similar to standard Vision Transformer (ViT) blocks, with a few modifications to inject conditional inputs like diffusion timesteps or class labels:
- Adaptive Layer Normalization (AdaLN) layers modulate activations prior to residual connections. This enables the model to adapt its behavior at different denoising steps.
- Cross-Attention layers enable interactions between the diffusion network and an image encoder to capture both local and global information.
DiT models demonstrate impressive scalability, with higher Gflops consistently leading to lower Frechet Inception Distance (FID) scores, which indicate better image quality. Scaling can be achieved by increasing the transformer depth/width or the number of input tokens.
Diffusion Transformers have shown state-of-the-art performance on various tasks:
- Text-to-Image Generation: Large DiT models like DiT-XL/2 outperform prior diffusion models on class-conditional ImageNet benchmarks. Models like Stable Diffusion 3 leverage DiT for high-quality image synthesis.
- Text-to-Video Generation: DiT has been extended to video generation in models like OpenAI's SORA. Novel techniques like motion-free guidance are used to enhance the visual quality of generated videos.
- Creative Applications: The flexibility of the DiT architecture enables experimentation with different image resolutions, aspect ratios and durations, thereby opening up possibilities for creative use cases.
In summary, Diffusion Transformers represent an important advancement in generative modeling by bringing the power of transformers to diffusion models. Their scalability, architectural flexibility, and impressive results across various domains make them a promising direction for future research and applications to pursue.