Stable Diffusion XL (SDXL) Guide: Mastering AI Art Generation

‍Introducing Stable Diffusion XL (SDXL): the future of AI-driven art

‍

Introduced in 2022, Stable Diffusion and its more advanced counterpart, Stable Diffusion XL (SDXL), have quietly revolutionized the AI-generated art world. Developed by the CompVis Group at Ludwig Maximilian University of Munich and Runway, with a compute donation from Stability AI, these models stand out for their ability to craft detailed images from textual prompts.

SDXL takes it a step further with a larger UNet, an extra text encoder, and 1024×1024 pixel resolution capabilities. Notably, both models are open-source, making them widely accessible and fostering innovation in AI art creation across the globe.

‍

Transforming text into art: the capabilities of Stable Diffusion XL (SDXL)

Stable Diffusion is a advanced text-to-image model. When provided with a text prompt, it employs deep learning techniques to generate a corresponding AI-generated image.

This model adeptly interprets and visualizes the given text description, producing images that closely align with the prompt's content and intent.

‍

What are diffusion models?

Diffusion models are a type of generative models found in machine learning that essentially learn by first messing up an image or dataset—by adding random noise to it—and then figuring out how to clean it up. They’re pretty good at generating new pieces of data that look a lot like the stuff they were trained on.

‍

Here's a quick breakdown of how training them works:

‍

Forward Process: It starts with an actual image, and then bit by bit, it adds Gaussian noise (think of it like sprinkling digital dust) over a bunch of steps, until the image looks like static on an old TV screen—completely random noise.

‍

Reverse Process: This is where the magic happens. The model tries to take that noise-filled image and clean it back up to what it was originally. It uses what it learned about how it messed up the image in the first place to do this accurately. So, it's kind of like the model learns to make a mess and then clean it up, and through that process, it gets really good at understanding what the original data should look like.

‍Overview of the Diffusion model [1]

‍

Addressing the speed challenge in diffusion models

The diffusion process, when operating in image space, faces inherent challenges due to its slow and computationally intensive nature. For instance, a 1024x1024 image with three color channels (red, green, and blue) involves a staggering 3,145,728 dimensions!

‍

Processing such a high volume of data requires considerable computational resources, surpassing the capabilities of standard GPUs. This immense need for computational power translates into slow processing speeds, making the model less than ideal for use on the average laptop. The nature of the diffusion model, with its need to process large image sizes through numerous diffusing steps, means that it must continually process full-sized images.

‍

This repeated action significantly hampers its speed, turning what is a breakthrough in image generation into a challenging endeavor for those without access to high-end computing resources.

‍

Stable Diffusion XL (SDXL) has been developed specifically to address these challenges.

‍

How does Stable Diffusion XL (SDXL) work?

Stable Diffusion XL (SDXL), initially known as the "Latent Diffusion Model" (LDM), revolutionizes the image generation process by operating in the latent space.

‍

The latent space

Diving into the technical riffs, SDXL jams in the latent space rather than wrestling with the high-definition, heavyweight image space. Imagine compressing an image down to a format that's 48 times more compact without losing the essence of its beauty. That's the latent space for you – a backstage pass to faster processing speeds without the crowd.

‍

The real magic happens because this latent representation is like the essence of an image, capturing its soul while leaving behind the unnecessary background noise.

‍

Now, when SDXL combines the powers of diffusion models with the grace of the latent space, it’s like forming a supergroup. This ensemble doesn't play directly to the stadium-sized full-resolution data. Instead, it performs an intimate gig in the latent representation, bringing a couple of head-turning benefits to the stage:

‍

Efficiency: By playing in a more compact space, SDXL doesn't need as much computational power to produce stunning visuals. It’s like setting up a world-class concert with just an acoustic guitar and a microphone.
Quality: Even though it’s working with what seems like a simplified version of the data, SDXL still pulls off a high-quality performance. The latent space captures the heart and soul of the image, ensuring that the final output feels just as real and detailed as the original.

‍

Architecture and mechanism

Diving into the architecture and mechanism behind Stable Diffusion XL (SDXL) feels like peeking behind the curtain of a magic show. At its heart lies the latent diffusion model (LDM), a clever setup that transforms chaos into clarity, pixel by pixel. Let’s break down this trio of tech wizards that make SDXL tick:

‍

Text Encoder (CLIP): Picture CLIP as the master linguist, turning your words into a digital fingerprint. It's not just any text encoder; it's transformers-based, which means it’s smart enough to grasp the nuances of your prompt. By digesting a feast of images and their captions, CLIP understands the essence of your words, encoding them into token embeddings. This process sets the stage, ensuring that the generated image resonates with your textual vision.
U-Net: Next in line is the U-Net, the artist of the group. Fed with the token embeddings and a canvas smeared with digital noise, it embarks on an iterative journey of refinement. With each step, the U-Net reshapes the chaotic inputs, aligning them more closely with the intended imagery. It’s a meticulous process, akin to an artist adding layer upon layer to a painting, gradually revealing the masterpiece envisioned by the prompt.
Auto Encoder-Decoder (Variational Autoencoder - VAE): Finally, the VAE steps in like a skilled craftsman, turning abstract concepts back into tangible visuals. It works in two phases: first, compressing a high-resolution image into a more manageable latent form (think of it as distilling the essence of the image), and then, during inference, it reverses the process. The decoder reconstructs detailed images from the polished latents, ensuring that the final output mirrors the initial text prompt with uncanny accuracy.

Overview of Stable Diffusion XL (SDXL) architecture

‍

Differences between stable diffusion models

Model	Release date	Resolution	Parameters	Prompts	Training Data	Strengths	Weaknesses
SD 1.4 & 1.5	Mid 2022	512x512	860 million	Depends on OpenAI’s CLIP ViT-L/14	LAION 5B dataset	Beginner friendly, 1.4 is more artistic, 1.5 is stronger on portrait	Long prompts, lower resolution
SD 2.0 & 2.1	Late 2022	768x768	860 million	Uses LAION’s OpenCLIP-ViT/H for prompt interpretation, requires more effort in the negative prompt	LAION 5B dataset with LAION-NSFW classifier	Shorter prompts, richer colors	Aggressive censoring, medium resolution
SD XL 1.0	July, 2023	1024x1024	3.5 billion	Uses OpenCLIP-ViT/G and CLIP-ViT/L for better inference on prompts	n/a	Shorter prompts, high resolution	Resource intensive, GPU required

‍

The Stable Diffusion v2 models face issues with style control and generating celebrity likenesses, likely due to variations in training data.

‍

Stability AI hasn't explicitly excluded such content, but its effect appears more limited in v2. This might be because OpenAI's exclusive dataset could have a more extensive collection of art and celebrity images. As a result, users have preferred the fine-tuned v1 models over v2.

‍

Yet, with the launch of Stable Diffusion XL (SDXL), offering enhanced features and higher resolution, there's a noticeable shift in user preference towards this advanced model.

Prompt: 'Model in trendy streetwear, City street with neon signs and pedestrians, Cinematic, Close up shot, Mirrorless, 35mm lens, f/1.8 aperture, ISO 400, slight color grading',
Negative prompt: 'low resolution, ugly, deformed'
Guidance: 7.5
Inference steps: 50
Seed: 54428

‍

Stable Diffusion XL (SDXL) model: improving image quality with the refiner

The Stable Diffusion XL (SDXL) model effectively comprises two distinct models working in tandem:

‍

1. Initially, the base model is deployed to establish the overall composition of the image.

2. Following this, an optional refiner model can be applied, which is responsible for adding more intricate details to the image.

The SDXL pipeline consists of a base model and a refiner model [2]

‍

SDXL Turbo: A Real-Time Text-to-Image Generation Model

With the SDXL model paving the way, SDXL Turbo steps up as the latest advancement in text-to-image generation. Launched in late November 2023, SDXL Turbo builds on the groundwork laid by SDXL, bringing significant improvements in speed.

‍

Core Features and Technology Behind SDXL Turbo:

‍

Adversarial Diffusion Distillation (ADD): At its core, SDXL Turbo incorporates the innovative ADD technology, enabling high-quality, real-time image generation. This method merges adversarial training with score distillation, allowing for the creation of images in a single step without losing quality.

‍

Efficiency and Speed: Marking a leap in efficiency, SDXL Turbo cuts down the synthesis process from the 50-100 steps required by older models to just 1-4 steps. This boost in efficiency makes it much quicker, capable of producing a 512x512 image in around 207 milliseconds on an A100 GPU.

‍

Use in Real-Time Settings: The model's ability to generate images in real time makes it ideal for use in fluid situations, such as video games, virtual reality, and on-the-spot content creation for social media or marketing purposes.

Example of images generated with SDXL Turbo

Trade-offs and Limitations

Lower Resolution: A notable drawback is its lower resolution output compared to the original SDXL model. SDXL Turbo is currently limited to producing images at a resolution of 512×512 pixels.

‍

Text Rendering Challenges: The model faces difficulties in rendering clear, legible text, falling short of the performance level of SDXL and other similar models.

‍

Facial Rendering: There is an ongoing challenge in accurately generating faces and people.

‍

Lack of Photorealism: SDXL Turbo generally does not achieve a completely photorealistic rendering.

Despite these limitations, SDXL Turbo is incredibly promising, particularly in terms of its performance.

‍

Get started with Ikomia API

Using the Ikomia API, you can effortlessly create images with Stable Diffusion in just a few lines of code.

‍

To get started, you need to install the API in a virtual environment [3].


pip install ikomia

‍

Run Stable Diffusion XL (SDXL) with a few lines of code

You can also directly charge the notebook we have prepared.

Go to notebook

Go to Colab

Note: This workflow uses 13GB GPU on Google Colab (T4).


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
stable_diff = wf.add_task(ik.infer_hf_stable_diffusion(
    model_name='stabilityai/stable-diffusion-xl-base-1.0',
    prompt='Super Mario style jumping, vibrant, cute, cartoony, fantasy, playful, reminiscent of Super Mario series',
    guidance_scale='7.5',
    negative_prompt='low resolution, ugly, deformed',
    num_inference_steps='50',
    width='1024',
    height='1024',
    seed='19632893',
    use_refiner='True'
    )
)

# Run your workflow
wf.run()

# Display the image
display(stable_diff.get_output(0).get_image())

model_name (str) - default 'stabilityai/stable-diffusion-2-base': Name of the stable diffusion model. Other models available:

- CompVis/stable-diffusion-v1-4

- runwayml/stable-diffusion-v1-5

- stabilityai/stable-diffusion-2-base

- stabilityai/stable-diffusion-2

- stabilityai/stable-diffusion-2-1-base

- stabilityai/stable-diffusion-2-1

- stabilityai/stable-diffusion-xl-base-1.0

- stabilityai/sdxl-turbo: requires Torch >= 1.13, by default it will not work on Python < 3.10

prompt (str): Input prompt to guide the image generation.
negative_prompt (str, optional): The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_inference_steps (int) - default '50': Number of denoising steps (minimum: 1; maximum: 500).
guidance_scale (float) - default '7.5': Scale for classifier-free guidance (minimum: 1; maximum: 20).
width (int) - default '512': Output width. If not divisible by 8 it will be automatically modified to a multiple of 8.
height (int) - default '512': Output height. If not divisible by 8 it will be automatically modified to a multiple of 8.
seed (int) - default '-1': Seed value. '-1' generates a random number between 0 and 191965535.
use_refiner (bool) - default 'False': Further process the output of the base model (xl-base-1.0 only) with a refinement model specialized for the final denoising steps.

‍

Create your workflow using Stable Diffusion inpainting

In this article, we've explored image creation with Stable Diffusion.

‍

Beyond just generating images, Stable Diffusion models also excel in inpainting, which allows for specific areas within an image to be altered based on text prompts.

The Ikomia API enhances this process by integrating diverse algorithms from various frameworks. For example, you can segment a portion of an image using the Segment Anything Model and then seamlessly replace it using Stable Diffusion's inpainting, all guided by your text input.

‍

Explore Inpainting with SAM and Stable Diffusion→

‍

A key advantage of the Ikomia API is its ability to connect algorithms from different sources (YOLO, Hugging Face, OpenMMLab, …), while eliminating the need for complex dependency installations.

‍

For a detailed guide on using the API, refer to the Ikomia documentation. Additionally, the Ikomia HUB offers a selection of cutting-edge algorithms, and Ikomia STUDIO provides a user-friendly interface, maintaining all the functionalities of the API.

‍