Stable Diffusion 3: Cutting Edge AI Image Generation

Cover Image for Stable Diffusion 3: Cutting Edge AI Image Generation

Introduction

The integration of artificial intelligence with creativity has spurred significant advancements in image generation, with diffusion models playing a pivotal role in this arena. These models are adept at creating data from noise and have been instrumental in generating new data points that align with the distribution of training data, particularly in modeling high-dimensional perceptual data like images. Although diffusion models have been successful, research has focused on enhancing their efficiency and sampling times, with considerations for forward paths crucial to sampling quality and speed.

One particular forward path, Rectified Flow, offers promising theoretical properties but has yet to gain widespread adoption. Despite demonstrated advantages in smaller experiments, its efficacy in larger-scale models remains uncertain. To address this gap, a novel re-weighting of noise scales in rectified flow models is introduced, aiming to improve their performance. Additionally, a new architecture is proposed for text-to-image synthesis, facilitating bidirectional information flow between image and text tokens within the network, thus enhancing scalability and performance.

Key contributions of this work, led by the Stability AI team (Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach), include a systematic study on diffusion model formulations, the development of improved noise samplers for rectified flow models, and the introduction of a scalable architecture for text-to-image synthesis. Results demonstrate predictable scaling trends and strong correlations between lower validation loss and improved text-to-image performance. Public availability of results, code, and model weights underscores the transparency and reproducibility of the study.

Simulation-Free Training of Flows

Generative models often rely on a mapping between data samples and noise distributions, described by an ordinary differential equation (ODE) that evolves over time. Previous methods attempted to directly solve these equations, but this approach proved computationally intensive. An alternative, more efficient method involves regressing a vector field that generates a probability path between the data and noise distributions. This method, known as Rectified Flow, offers computational advantages and improved performance.

Rectified Flow Transformers play a crucial role in high-resolution image synthesis by defining marginals consistent with both data and noise distributions. By introducing conditional vector fields, it becomes possible to construct marginal vector fields that generate probability paths. These vector fields are regressed using Flow Matching objectives, with Conditional Flow Matching providing a tractable alternative. By reparameterizing loss objectives, it's feasible to introduce time-dependent weighting, facilitating optimization while preserving the desired solution.

Flow Trajectories

In the exploration of Flow Trajectories, different variants of generative models are considered. One key model class is Rectified Flows (RFs), which define the forward process as straight paths between the data distribution and a standard normal distribution. Essentially, RFs create a trajectory from the initial data point to a noise distribution over time. This allows for a smooth transition between the data and noise distributions, facilitating the generation of new data points.

Another approach is the Exponential Diffusion Model (EDM), which employs a forward process where the trajectory is determined by an exponential function. In this model, the trajectory is adjusted based on the signal-to-noise ratio, which follows a normal distribution. By utilizing this approach, EDM can effectively model the relationship between the initial data point and the noise distribution, leading to the generation of realistic data points.

Additionally, the Linear Diffusion Model (LDM) is discussed, which is a variant of Linear models that preserves variance. LDM uses diffusion coefficients to adjust the trajectory of the data points over time. By defining these coefficients for discrete timesteps, LDM can accurately model the evolution of the data distribution. This approach offers a flexible framework for generating data points that adhere to the desired distribution.

These models represent different strategies for modeling flow trajectories in generative modeling. Each approach has its own strengths and weaknesses, but collectively they contribute to the advancement of generative modeling techniques, enabling the creation of realistic and diverse datasets.

Text-to-Image Architecture

The text-to-image architecture described in this section leverages pretrained models to obtain suitable representations for both text and images. Following the setup of Linear Diffusion Models (LDM), the architecture operates in the latent space of a pretrained autoencoder for training text-to-image models. Text conditioning is encoded using frozen pretrained text models, similar to previous approaches in the field.

The architecture builds upon the DiT (Diverse Transformer) architecture, which is initially designed for class-conditional image generation. In this multimodal diffusion backbone, embeddings of the timestep and text condition are used as inputs to the modulation mechanism. However, due to the coarse-grained nature of pooled text representations, additional information from the sequence representation of text is required. To address this, a sequence consisting of embeddings of both text and image inputs is constructed. Positional encodings are added, and 2 × 2 patches of the latent pixel representation are flattened to create a patch encoding sequence, which is then concatenated with the text sequence. Subsequently, modulated attention and MLPs are applied to the concatenated sequence.

Given the conceptual differences between text and image embeddings, separate sets of weights are used for each modality. While two independent transformers are employed for each modality, the sequences of both modalities are combined for the attention operation. This allows both representations to operate in their respective spaces while considering information from the other modality. The size of the model is parameterized in terms of depth, with the hidden size and the number of attention heads determined accordingly for scaling experiments.

Experiments

5.1. Improving Rectified Flows

The objective of this experiment is to determine the most efficient approach for simulation-free training of normalizing flows, as described in Equation 1. To facilitate fair comparisons across different approaches, several factors are controlled, including the optimization algorithm, model architecture, dataset, and samplers. Since the losses of different approaches are not directly comparable and may not reflect the quality of output samples, evaluation metrics that allow for a meaningful comparison are needed.

Models are trained on two datasets: ImageNet and CC12M. Evaluation is conducted on both the training and Exponential Moving Average (EMA) weights of the models during training. Various metrics are used for evaluation, including validation losses, CLIP scores, and Frechet Inception Distance (FID). CLIP scores are calculated based on the Contrastive Language-Image Pretraining (CLIP) model, while FID is calculated on CLIP features. These metrics are assessed under different sampler settings, including different guidance scales and sampling steps.

The evaluation is performed on the COCO-2014 validation split. Detailed information regarding training and sampling hyperparameters can be found in Appendix B.3 of the study.

5.1.1 Results

The researchers train 61 different formulations on two datasets, including variants such as:

  • ϵ- and v-prediction loss with linear (eps/linear, v/linear) and cosine (eps/cos, v/cos) schedules.
  • RF loss with πmode(t; s) (rf/mode(s)) with 7 values for s uniformly chosen between -1 and 1.75, and additionally for s = 1.0 and s = 0 which correspond to uniform timestep sampling (rf/mode).
  • RF loss with πln(t; m, s) (rf/lognorm(m, s)) with 30 values for (m, s) in a grid with m uniform between -1 and 1, and s uniform between 0.2 and 2.2.
  • RF loss with πCosMap(t) (rf/cosmap).
  • EDM (edm(Pm, Ps)) with 15 values for Pm chosen uniformly between -1.2 and 1.2 and Ps uniformly between 0.6 and 1.8.
  • EDM with schedules matching the log-SNR weighting of rf (edm/rf) and v/cos (edm/cos).

Each run selects the step with minimal validation loss when evaluated with EMA weights, collecting CLIP scores and FID with 6 sampler settings with and without EMA weights. Variants are ranked using a non-dominated sorting algorithm based on CLIP and FID scores, averaged over 24 different control settings.

The results show that rf/lognorm(0.00, 1.00) consistently achieves a good rank, outperforming formulations with uniform timestep sampling. Some variants perform well in specific settings but worse in others, indicating a trade-off between performance metrics and sampling steps. Rectified flow formulations generally perform well across metrics and datasets, with rf/lognorm(0.00, 1.00) achieving good performance across various settings.

For more detailed information, refer to the study Appendix B.3

5.2. Improving Modality Specific Representations

After identifying a formulation in the previous section that enables rectified flow models to surpass established diffusion formulations like LDM-Linear and EDM, their focus shifts to applying this formulation to high-resolution text-to-image synthesis.

The success of their algorithm relies not only on the training formulation but also on the parameterization through a neural network and the quality of the image and text representations. In the upcoming sections, the researchers detail the enhancements made to all these components before proceeding to scale their final method in Section 5.3.

Figure 3. Rectified flows exhibit superior sample efficiency, particularly when sampling fewer steps. Only rf/lognorm(0.00, 1.00) remains competitive with eps/linear when sampling 25 or more steps.

Improved Autoencoders

The table showcases the reconstruction performance metrics for different channel configurations. The downsampling factor for all models is set to f = 8. Table 3

5.2.1. Improved Autoencoders

Latent diffusion models achieve high efficiency by operating in the latent space of a pretrained autoencoder (Rombach et al., 2022), which maps an input RGB X ∈ H×W×3 into a lower-dimensional space x = E(X) ∈ h×w×d. The reconstruction quality of this autoencoder provides an upper bound on the achievable image quality after latent diffusion training.

Similar to Dai et al. (2023), increasing the number of latent channels d significantly boosts reconstruction performance, as shown in Table 3. Intuitively, predicting latents with higher d is a more difficult task, and thus models with increased capacity should perform better for larger d, ultimately achieving higher image quality. This hypothesis is confirmed in Figure 10, where it can be observed that the d = 16 autoencoder exhibits better scaling performance in terms of sample FID. Therefore, for the remainder of this paper, the researchers choose d = 16.

5.2.2 Improved Captions

The utilization of synthetically generated captions alongside original ones significantly enhances the performance of text-to-image models, particularly at scale. Traditional human-generated captions from large-scale image datasets tend to overly focus on image subjects, often omitting essential details about the scene or displayed text. Leveraging an off-the-shelf vision-language model like CogVLM, synthetic annotations are generated to complement the original captions. To evaluate the impact of training on this mixed caption dataset, two d = 15 MM-DiT models are trained for 250k steps: one solely on original captions and the other on a 50/50 mix of original and synthetic captions. Evaluation using the GenEval benchmark reveals that models trained with the mixed caption dataset consistently outperform those trained solely on original captions, leading to the adoption of the 50/50 synthetic/original caption mix for subsequent work.

5.2.3 Improved Text-to-Image Backbones

In this section, the researchers assess the performance of existing transformer-based diffusion backbones in comparison to their novel multimodal transformer-based diffusion backbone, MM-DiT, introduced in Section 4. MM-DiT is tailored to handle diverse domains, such as text and image tokens, employing separate sets of trainable model weights. Following the experimental setup from Section 5.1, the researchers evaluate text-to-image performance on CC12M across DiT, CrossDiT (a variant of DiT with cross-attending to text tokens instead of sequence-wise concatenation), and MM-DiT. For MM-DiT, models with two sets of weights and three sets of weights are compared, with the latter handling CLIP and T5 tokens separately (as discussed in Section 4). DiT, with concatenation of text and image tokens, represents a special case of MM-DiT with shared weights for all modalities.

The UViT architecture, a hybrid of UNets and transformer variants, is also considered for comparison. Analyzing the convergence behavior of these architectures in Figure 4, the researchers observe that vanilla DiT performs below UViT, while the CrossDiT variant achieves superior performance over UViT, albeit with UViT demonstrating faster initial learning. However, their MM-DiT variant significantly outperforms both the CrossDiT and vanilla variants. Although using three parameter sets yields only marginal gains at the expense of increased parameter count and VRAM usage, the researchers opt for the two-parameter set option for the remainder of this study.

5.3. Training at Scale

5.3.1. Data Preprocessing

Pre-Training Mitigations

Data filtering significantly influences the capabilities of generative models. Hence, before scaling up, the researchers filter their data for explicit content using NSFW-detection models, remove low-rated images based on aesthetics, and deduplicate perceptual and semantic duplicates from the training data.

Precomputing Image and Text Embeddings

To ensure efficiency during training, the researchers precompute the output of pretrained, frozen networks (autoencoder latents and text encoder representations) for the entire dataset.

5.3.2. Finetuning on High Resolutions

QK-Normalization

Initially, their models are pretrained on low-resolution images (256x256 pixels) and then finetuned on higher resolutions with mixed aspect ratios. The researchers observe instability during mixed precision training at high resolutions, which can be mitigated by normalizing Q and K before the attention operation, as proposed by Dehghani et al. (2023). This approach prevents attention entropy from growing uncontrollably, enabling stable training even with bf16-mixed precision.

Positional Encodings for Varying Aspect Ratios

After training on a fixed 256x256 resolution, the goal is to increase the resolution and enable inference with flexible aspect ratios. Due to the use of 2D positional frequency embeddings, adjustments are necessary based on the resolution. In a multi-aspect ratio setting, direct interpolation of embeddings as in (Dosovitskiy et al., 2020) would not reflect side lengths accurately. Instead, a combination of extended and interpolated position grids is utilized, followed by frequency embedding.

For a target resolution of S^2 pixels, bucketed sampling is employed to ensure each batch consists of images with a homogeneous size H x W, where H * W ≈ S^2. This leads to maximum values for width (W_max) and height (H_max), from which corresponding sizes in latent space (h_max = H_max/16, w_max = W_max/16, s = S/16) are derived. Based on these values, a vertical position grid is constructed, and a similar process is followed for horizontal positions. The resulting positional 2D grid is then center-cropped before embedding.

Resolution-dependent Shifting of Timestep Schedules

Higher resolutions require more noise to disrupt their signal, impacting the timestep schedules. By considering a "constant" image where every pixel has the same value, the forward process produces observations of a random variable Y = (1 - t)c + tη. The standard error of the mean for Y decreases with increasing resolution. To map a timestep tn at resolution n to a timestep tm at resolution m resulting in the same uncertainty degree, a shifting function is derived. A shift value of α = 3.0 is used during both training and sampling at resolution 1024x1024.

5.3.3. Results

In this section, the researchers analyze the performance of their MM-DiT model at scale for both images and videos.

Image Scaling Study

The researchers conduct a large scaling study for image training, varying the number of parameters and training steps. Models are trained for 500k steps on 256x256 resolution using preencoded data, with a batch size of 4096. Validation losses on the CoCo dataset are reported every 50k steps. The researchers observe a smooth decrease in validation loss with increasing model size and training steps, indicating improved performance.

Video Scaling Study

A preliminary scaling study of MM-DiT on videos is conducted, starting from pretrained image weights and employing 2x temporal patching. Models are trained for 140k steps with a batch size of 512 on videos comprising 16 frames with 256x256 pixels. Validation losses on the Kinetics dataset are reported every 5k steps. Similar to image training, the researchers observe a smooth decrease in validation loss with increasing model size and training steps.

Model Performance

Validation loss correlates highly with comprehensive evaluation metrics (CompBench, GenEval) and human preference, serving as a simple measure of model performance. Larger models trained for longer periods exhibit improved sample quality. Our biggest model outperforms current state-of-the-art models in prompt comprehension and human preference evaluation on various benchmarks.

Flexible Text Encoders

The main motivation for using multiple text encoders is to boost overall model performance. However, the researchers demonstrate that this choice also increases the flexibility of their MM-DiT model. By using an arbitrary subset of all three text encoders, the researchers can trade off model performance for improved memory efficiency, which is particularly relevant for models with large numbers of parameters.

The researchers find that using only the two CLIP-based text encoders for text prompts and replacing T5 embeddings with zeros results in limited performance drops. This approach allows for significant memory savings, especially for models like T5-XXL with 4.7B parameters, which require significant amounts of VRAM. The researchers observe that the removal of T5 has minimal effect on aesthetic quality ratings and only a small impact on prompt adherence, while its contribution to generating written text is more significant.

These observations are supported by human preference evaluation results, where models without T5 achieve comparable performance in terms of aesthetic quality and prompt adherence but show a slight decrease in text generation capabilities.

Conclusion

In this study, the researchers conducted a comprehensive analysis of scaling rectified flow models for text-to-image synthesis. The researchers introduced a novel timestep sampling method for rectified flow training, which enhances previous diffusion training formulations for latent diffusion models while preserving the favorable characteristics of rectified flows in short-step sampling scenarios. Additionally, the researchers presented the advantages of their transformer-based MM-DiT architecture, designed to address the multi-modal nature of the text-to-image task.

Through a scaling study of their proposed approach up to a model size of 8B parameters and 5 × 10^22 training FLOPs, the researchers demonstrated that improvements in validation loss correlate with existing text-to-image benchmarks and human preference evaluations. Our enhancements in generative modeling and scalable, multimodal architectures have enabled us to achieve competitive performance comparable to state-of-the-art proprietary models. Importantly, the researchers observed no signs of saturation in the scaling trend, suggesting the potential for further improvements in model performance in the future.

You can find the research paper in more detail here.