Introducing Stable Diffusion XL (SDXL): the future of AI-driven art
The rise of artificial intelligence in creative arenas has led to transformative changes, with Stable Diffusion emerging as a key example. This text-to-image deep learning model, introduced in 2022, has significantly reshaped the AI-generated art scene.
Developed collaboratively by the CompVis Group at Ludwig Maximilian University of Munich and Runway, and supported by a compute donation from Stability AI, Stable Diffusion and Stable Diffusion XL (SDXL) are remarkable for their capability to produce detailed images based on textual prompts, while also being openly accessible.
Transforming text into art: the capabilities of Stable Diffusion XL (SDXL)
Stable Diffusion is a highly advanced text-to-image model. When provided with a text prompt, it employs deep learning techniques to generate a corresponding AI-generated image.
This model adeptly interprets and visualizes the given text description, producing images that closely align with the prompt's content and intent.
What are diffusion models?
Diffusion models are a type of generative model used in machine learning. They work by gradually introducing random noise into an image or dataset and then learning to reverse this process. This methodology allows the model to generate new data that mimics the characteristics of the training data.
Training diffusion models
The training of the Diffusion Model can be divided into two parts:
Forward Process: The model starts with an image and incrementally adds Gaussian noise across several steps, until the image is completely transformed into random noise.
Reverse Process: In the reverse diffusion process, the model iteratively removes noise to reconstruct the image or data. It uses a learned understanding of how noise was added to reverse the process accurately.
Addressing the speed challenge in diffusion models
The diffusion process, when operating in image space, faces inherent challenges due to its slow and computationally intensive nature. For instance, a 1024x1024 image with three color channels (red, green, and blue) involves a staggering 3,145,728 dimensions!
Processing such a high volume of data requires considerable computational resources, surpassing the capabilities of standard GPUs.
This computational demand results in a very slow processing speed, rendering the model impractical for average laptops. The diffusion model, particularly when dealing with large image sizes and numerous diffusing steps, is significantly slowed down as it repeatedly feeds full-sized images into the U-Net to achieve the final output.
Stable Diffusion XL (SDXL) has been developed specifically to address these challenges.
How does Stable Diffusion XL (SDXL) work?
Stable Diffusion XL (SDXL), initially known as the "Latent Diffusion Model" (LDM), revolutionizes the image generation process by operating in the latent space.
The latent space
This method significantly reduces the computational complexity by compressing images into a latent space that is 48 times smaller than the high-dimensional image space. This efficiency leads to faster processing speeds, setting it apart from traditional diffusion models.
This efficiency comes from the fact that the latent representation captures the essential features of the data while filtering out unnecessary details.
Now, combining the diffusion models and the latent space:
Latent Diffusion Models work by applying the diffusion process not directly to the full-resolution data (like a high-resolution image) but rather to a latent representation of that data. This approach offers several advantages:
Efficiency: Since the latent space representation is typically more compact than the full-resolution data, the diffusion process in LDMs can be more computationally efficient.
Quality: Despite the reduction in direct data dimensionality, LDMs can still generate high-quality results. This is because the latent space captures the core features and structures necessary for realistic generation.
Architecture and mechanism
At its core, Stable Diffusion XL (SDXL) is a latent diffusion model (LDM). It uses a diffusion model (DM) technique that involves training with the objective of removing Gaussian noise from training images. This process can be thought of as a sequence of denoising autoencoders. The model consists of three main parts:
Text Encoder (CLIP): CLIP, a transformers-based model, serves as the text encoder. It takes the input prompt text and converts it into token embeddings. Each word in the text is represented by these embeddings. CLIP is unique because it's trained on a dataset of images and their captions, combining an image encoder with a text encoder. This allows the model to understand and encode the semantics of the text in a way that's conducive to image generation.
U-Net: Following the text encoding, a U-Net model takes over. This model is crucial for the diffusion process. It receives the token embeddings from CLIP, along with an array of noisy inputs. Through a series of iterative steps, the U-Net processes the input latent tensor and produces a new latent space tensor. This new tensor better represents the input text and is less noisy. The U-Net's iterative processing effectively 'cleans up' the image, step by step, bringing it closer to what the text description depicts.
Auto Encoder-Decoder: This final stage, involving a Variational Autoencoder (VAE), transforms the denoised latent output back into detailed images. The VAE comprises an encoder that compresses a 1024x1024x3 image into a smaller latent representation (e.g., 128x128x4) for training. The decoder then reconstructs actual images from these refined latents during inference, ensuring the final visuals closely match the text prompt.
Differences between stable diffusion models
SD 1.4 & 1.5
Depends on OpenAI’s CLIP ViT-L/14
LAION 5B dataset
Beginner friendly, 1.4 is more artistic, 1.5 is stronger on portrait
Long prompts, lower resolution
SD 2.0 & 2.1
Uses LAION’s OpenCLIP-ViT/H for prompt interpretation, requires more effort in the negative prompt
LAION 5B dataset with LAION-NSFW classifier
Shorter prompts, richer colors
Aggressive censoring, medium resolution
SD XL 1.0
Uses OpenCLIP-ViT/G and CLIP-ViT/L for better inference on prompts
Shorter prompts, high resolution
Resource intensive, GPU required
The Stable Diffusion v2 models face issues with style control and generating celebrity likenesses, likely due to variations in training data.
Stability AI hasn't explicitly excluded such content, but its effect appears more limited in v2. This might be because OpenAI's exclusive dataset could have a more extensive collection of art and celebrity images. As a result, users have preferred the fine-tuned v1 models over v2.
Yet, with the launch of Stable Diffusion XL (SDXL), offering enhanced features and higher resolution, there's a noticeable shift in user preference towards this advanced model.
Prompt: 'Model in trendy streetwear, City street with neon signs and pedestrians, Cinematic, Close up shot, Mirrorless, 35mm lens, f/1.8 aperture, ISO 400, slight color grading',
Negative prompt: 'low resolution, ugly, deformed'
Inference steps: 50
Stable Diffusion XL (SDXL) model: improving image quality with the refiner
The Stable Diffusion XL (SDXL) model effectively comprises two distinct models working in tandem:
1. Initially, the base model is deployed to establish the overall composition of the image.
2. Following this, an optional refiner model can be applied, which is responsible for adding more intricate details to the image.
SDXL Turbo: A Real-Time Text-to-Image Generation Model
Following the success of the SDXL model, SDXL Turbo emerges as the next evolution in the field of text-to-image generation technology. Released late November 2023, SDXL Turbo builds upon the foundations set by its predecessor, SDXL, introducing notable speed performance improvements.
Key Features and Technology of SDXL Turbo
Adversarial Diffusion Distillation (ADD): At the core of SDXL Turbo is the novel ADD technology, which allows for high-quality image generation in real-time. This method combines aspects of adversarial training and score distillation, enabling the model to synthesize image outputs in a single step while maintaining high sampling fidelity.
Efficiency and Speed: SDXL Turbo dramatically reduces the steps required for image synthesis. Traditional multi-step models require 50 to 100 steps, but SDXL Turbo can generate images in just 1-4 steps. This efficiency makes it significantly faster than previous models, capable of generating a 512x512 image in about 207ms on an A100 GPU.
Real-Time Applications: The model's real-time generation capability makes it suitable for dynamic environments like video games, virtual reality, and instant content creation for social media or marketing.
Trade-offs and Limitations
Lower Resolution: A notable drawback is its lower resolution output compared to the original SDXL model. SDXL Turbo is currently limited to producing images at a resolution of 512×512 pixels.
Text Rendering Challenges: The model faces difficulties in rendering clear, legible text, falling short of the performance level of SDXL and other similar models.
Facial Rendering: There is an ongoing challenge in accurately generating faces and people.
Lack of Photorealism: SDXL Turbo generally does not achieve a completely photorealistic rendering.
Despite these limitations, SDXL Turbo is incredibly promising, particularly in terms of its performance.
Get started with Ikomia API
Using the Ikomia API, you can effortlessly create images with Stable Diffusion in just a few lines of code.
To get started, you need to install the API in a virtual environment .
pip install ikomia
Run Stable Diffusion XL (SDXL) with a few lines of code
You can also directly charge the notebook we have prepared.
Note: This workflow uses 13GB GPU on Google Colab (T4).
from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display
# Init your workflow
wf = Workflow()
# Add algorithm
stable_diff = wf.add_task(ik.infer_hf_stable_diffusion(
prompt='Super Mario style jumping, vibrant, cute, cartoony, fantasy, playful, reminiscent of Super Mario series',
negative_prompt='low resolution, ugly, deformed',
# Run your workflow
# Display the image
model_name (str) - default 'stabilityai/stable-diffusion-2-base': Name of the stable diffusion model. Other models available:
- stabilityai/sdxl-turbo: requires Torch >= 1.13, by default it will not work on Python < 3.10
prompt (str): Input prompt to guide the image generation.
negative_prompt (str, optional): The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_inference_steps (int) - default '50': Number of denoising steps (minimum: 1; maximum: 500).
width (int) - default '512': Output width. If not divisible by 8 it will be automatically modified to a multiple of 8.
height (int) - default '512': Output height. If not divisible by 8 it will be automatically modified to a multiple of 8.
seed (int) - default '-1': Seed value. '-1' generates a random number between 0 and 191965535.
use_refiner (bool) - default 'False': Further process the output of the base model (xl-base-1.0 only) with a refinement model specialized for the final denoising steps.
Create your workflow using Stable Diffusion inpainting
In this article, we've explored image creation with Stable Diffusion.
Beyond just generating images, Stable Diffusion models also excel in inpainting, which allows for specific areas within an image to be altered based on text prompts.
The Ikomia API enhances this process by integrating diverse algorithms from various frameworks. For example, you can segment a portion of an image using the Segment Anything Model and then seamlessly replace it using Stable Diffusion's inpainting, all guided by your text input.
A key advantage of the Ikomia API is its ability to connect algorithms from different sources (YOLO, Hugging Face, OpenMMLab, …), while eliminating the need for complex dependency installations.
For a detailed guide on using the API, refer to the Ikomia documentation. Additionally, the Ikomia HUB offers a selection of cutting-edge algorithms, and Ikomia STUDIO provides a user-friendly interface, maintaining all the functionalities of the API.