Mastering Art with Stable Diffusion XL (SDXL): A Complete Guide

Allan Kouidri
Stable diffusion XL (sdxl) illustration

Introducing Stable Diffusion XL (SDXL): the future of AI-driven art

The rise of artificial intelligence in creative arenas has led to transformative changes, with Stable Diffusion emerging as a key example. This text-to-image deep learning model, introduced in 2022, has significantly reshaped the AI-generated art scene.

Developed collaboratively by the CompVis Group at Ludwig Maximilian University of Munich and Runway, and supported by a compute donation from Stability AI, Stable Diffusion and Stable Diffusion XL (SDXL) are remarkable for their capability to produce detailed images based on textual prompts, while also being openly accessible.

Transforming text into art: the capabilities of Stable Diffusion XL (SDXL)

Stable Diffusion is a highly advanced text-to-image model. When provided with a text prompt, it employs deep learning techniques to generate a corresponding AI-generated image.

This model adeptly interprets and visualizes the given text description, producing images that closely align with the prompt's content and intent.

What are diffusion models?

Diffusion models are a type of generative model used in machine learning. They work by gradually introducing random noise into an image or dataset and then learning to reverse this process. This methodology allows the model to generate new data that mimics the characteristics of the training data.

Training diffusion models

The training of the Diffusion Model can be divided into two parts:

  • Forward Process: The model starts with an image and incrementally adds Gaussian noise across several steps, until the image is completely transformed into random noise.
  • Reverse Process: In the reverse diffusion process, the model iteratively removes noise to reconstruct the image or data. It uses a learned understanding of how noise was added to reverse the process accurately.
Overview of the Diffusion model [1]

Addressing the speed challenge in diffusion models

The diffusion process, when operating in image space, faces inherent challenges due to its slow and computationally intensive nature. For instance, a 1024x1024 image with three color channels (red, green, and blue) involves a staggering 3,145,728 dimensions! 

Processing such a high volume of data requires considerable computational resources, surpassing the capabilities of standard GPUs. 

This computational demand results in a very slow processing speed, rendering the model impractical for average laptops. The diffusion model, particularly when dealing with large image sizes and numerous diffusing steps, is significantly slowed down as it repeatedly feeds full-sized images into the U-Net to achieve the final output. 

Stable Diffusion XL (SDXL) has been developed specifically to address these challenges.

How does Stable Diffusion XL (SDXL) work?

Stable Diffusion XL (SDXL), initially known as the "Latent Diffusion Model" (LDM), revolutionizes the image generation process by operating in the latent space

The latent space

This method significantly reduces the computational complexity by compressing images into a latent space that is 48 times smaller than the high-dimensional image space. This efficiency leads to faster processing speeds, setting it apart from traditional diffusion models.

This efficiency comes from the fact that the latent representation captures the essential features of the data while filtering out unnecessary details.

Now, combining the diffusion models and the latent space:

Latent Diffusion Models work by applying the diffusion process not directly to the full-resolution data (like a high-resolution image) but rather to a latent representation of that data. This approach offers several advantages:

  • Efficiency: Since the latent space representation is typically more compact than the full-resolution data, the diffusion process in LDMs can be more computationally efficient.
  • Quality: Despite the reduction in direct data dimensionality, LDMs can still generate high-quality results. This is because the latent space captures the core features and structures necessary for realistic generation.

Architecture and mechanism

At its core, Stable Diffusion XL (SDXL) is a latent diffusion model (LDM). It uses a diffusion model (DM) technique that involves training with the objective of removing Gaussian noise from training images. This process can be thought of as a sequence of denoising autoencoders. The model consists of three main parts:

  • Text Encoder (CLIP): CLIP, a transformers-based model, serves as the text encoder. It takes the input prompt text and converts it into token embeddings. Each word in the text is represented by these embeddings. CLIP is unique because it's trained on a dataset of images and their captions, combining an image encoder with a text encoder. This allows the model to understand and encode the semantics of the text in a way that's conducive to image generation.

  • U-Net: Following the text encoding, a U-Net model takes over. This model is crucial for the diffusion process. It receives the token embeddings from CLIP, along with an array of noisy inputs. Through a series of iterative steps, the U-Net processes the input latent tensor and produces a new latent space tensor. This new tensor better represents the input text and is less noisy. The U-Net's iterative processing effectively 'cleans up' the image, step by step, bringing it closer to what the text description depicts.

  • Auto Encoder-Decoder: This final stage, involving a Variational Autoencoder (VAE), transforms the denoised latent output back into detailed images. The VAE comprises an encoder that compresses a 1024x1024x3 image into a smaller latent representation (e.g., 128x128x4) for training. The decoder then reconstructs actual images from these refined latents during inference, ensuring the final visuals closely match the text prompt.

Overview of Stable Diffusion XL (SDXL) architecture

Differences between stable diffusion models

Model Release date Resolution Parameters Prompts Training Data Strengths Weaknesses
SD 1.4 & 1.5 Mid 2022 512x512 860 million Depends on OpenAI’s CLIP ViT-L/14 LAION 5B dataset Beginner friendly, 1.4 is more artistic, 1.5 is stronger on portrait Long prompts, lower resolution
SD 2.0 & 2.1 Late 2022 768x768 860 million Uses LAION’s OpenCLIP-ViT/H for prompt interpretation, requires more effort in the negative prompt LAION 5B dataset with LAION-NSFW classifier Shorter prompts, richer colors Aggressive censoring, medium resolution
SD XL 1.0 July, 2023 1024x1024 3.5 billion Uses OpenCLIP-ViT/G and CLIP-ViT/L for better inference on prompts n/a Shorter prompts, high resolution Resource intensive, GPU required

The Stable Diffusion v2 models face issues with style control and generating celebrity likenesses, likely due to variations in training data.

Stability AI hasn't explicitly excluded such content, but its effect appears more limited in v2. This might be because OpenAI's exclusive dataset could have a more extensive collection of art and celebrity images. As a result, users have preferred the fine-tuned v1 models over v2.

Yet, with the launch of Stable Diffusion XL (SDXL), offering enhanced features and higher resolution, there's a noticeable shift in user preference towards this advanced model.

  • Prompt: 'Model in trendy streetwear, City street with neon signs and pedestrians, Cinematic, Close up shot, Mirrorless, 35mm lens, f/1.8 aperture, ISO 400, slight color grading',
  • Negative prompt: 'low resolution, ugly, deformed'
  • Guidance: 7.5
  • Inference steps: 50
  • Seed: 54428

Stable Diffusion XL (SDXL) model: improving image quality with the refiner

The Stable Diffusion XL (SDXL) model effectively comprises two distinct models working in tandem:

      1. Initially, the base model is deployed to establish the overall composition of the image. 

      2. Following this, an optional refiner model can be applied, which is responsible for adding more intricate details to the image.

The SDXL pipeline consists of a base model and a refiner model [2]

SDXL Turbo: A Real-Time Text-to-Image Generation Model

Following the success of the SDXL model, SDXL Turbo emerges as the next evolution in the field of text-to-image generation technology. Released late November 2023, SDXL Turbo builds upon the foundations set by its predecessor, SDXL, introducing notable speed performance improvements.

Key Features and Technology of SDXL Turbo

  • Adversarial Diffusion Distillation (ADD): At the core of SDXL Turbo is the novel ADD technology, which allows for high-quality image generation in real-time. This method combines aspects of adversarial training and score distillation, enabling the model to synthesize image outputs in a single step while maintaining high sampling fidelity.

  • Efficiency and Speed: SDXL Turbo dramatically reduces the steps required for image synthesis. Traditional multi-step models require 50 to 100 steps, but SDXL Turbo can generate images in just 1-4 steps. This efficiency makes it significantly faster than previous models, capable of generating a 512x512 image in about 207ms on an A100 GPU.

  •  Real-Time Applications: The model's real-time generation capability makes it suitable for dynamic environments like video games, virtual reality, and instant content creation for social media or marketing.
Example of images generated with SDXL Turbo
Example of images generated with SDXL Turbo


Trade-offs and Limitations

  •  Lower Resolution: A notable drawback is its lower resolution output compared to the original SDXL model. SDXL Turbo is currently limited to producing images at a resolution of 512×512 pixels.

  •  Text Rendering Challenges: The model faces difficulties in rendering clear, legible text, falling short of the performance level of SDXL and other similar models.

  •  Facial Rendering: There is an ongoing challenge in accurately generating faces and people.

  •  Lack of Photorealism: SDXL Turbo generally does not achieve a completely photorealistic rendering.


Despite these limitations, SDXL Turbo is incredibly promising, particularly in terms of its performance.

Get started with Ikomia API 

Using the Ikomia API, you can effortlessly create images with Stable Diffusion in just a few lines of code.

To get started, you need to install the API in a virtual environment [3].

pip install ikomia

Run Stable Diffusion XL (SDXL) with a few lines of code

You can also directly charge the notebook we have prepared. 

Note: This workflow uses 13GB GPU on Google Colab (T4).

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add algorithm
stable_diff = wf.add_task(ik.infer_hf_stable_diffusion(
    prompt='Super Mario style jumping, vibrant, cute, cartoony, fantasy, playful, reminiscent of Super Mario series',
    negative_prompt='low resolution, ugly, deformed',

# Run your workflow

# Display the image

  • model_name (str) - default 'stabilityai/stable-diffusion-2-base': Name of the stable diffusion model. Other models available:

              - CompVis/stable-diffusion-v1-4

              - runwayml/stable-diffusion-v1-5

              - stabilityai/stable-diffusion-2-base

              - stabilityai/stable-diffusion-2

              - stabilityai/stable-diffusion-2-1-base

              - stabilityai/stable-diffusion-2-1

              - stabilityai/stable-diffusion-xl-base-1.0

              - stabilityai/sdxl-turbo: requires Torch >= 1.13, by default it will not work on Python < 3.10

  • prompt (str): Input prompt to guide the image generation. 
  • negative_prompt (str, optional): The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_inference_steps (int) - default '50': Number of denoising steps (minimum: 1; maximum: 500).
  • guidance_scale (float) - default '7.5': Scale for classifier-free guidance (minimum: 1; maximum: 20).
  • width (int) - default '512': Output width. If not divisible by 8 it will be automatically modified to a multiple of 8.
  • height (int) - default '512': Output height. If not divisible by 8 it will be automatically modified to a multiple of 8.
  • seed (int) - default '-1': Seed value. '-1' generates a random number between 0 and 191965535.
  • use_refiner (bool) - default 'False': Further process the output of the base model (xl-base-1.0 only) with a refinement model specialized for the final denoising steps.

Create your workflow using Stable Diffusion inpainting

In this article, we've explored image creation with Stable Diffusion. 

Beyond just generating images, Stable Diffusion models also excel in inpainting, which allows for specific areas within an image to be altered based on text prompts. 

The Ikomia API enhances this process by integrating diverse algorithms from various frameworks. For example, you can segment a portion of an image using the Segment Anything Model and then seamlessly replace it using Stable Diffusion's inpainting, all guided by your text input.

Explore Inpainting with SAM and Stable Diffusion→

A key advantage of the Ikomia API is its ability to connect algorithms from different sources (YOLO, Hugging Face, OpenMMLab, …), while eliminating the need for complex dependency installations.

For a detailed guide on using the API, refer to the Ikomia documentation. Additionally, the Ikomia HUB offers a selection of cutting-edge algorithms, and Ikomia STUDIO provides a user-friendly interface, maintaining all the functionalities of the API.


[1] Stable diffusion clearly explained - 

[2] How stable diffusion work? -

[3] How to create a virtual environment

No items found.

Build with Python API


Create with STUDIO app