azula.plugins.sd

Stable Diffusion (SD) plugin.

This plugin depends on diffusers and transformers. To use it, install the dependencies in your environment

pip install diffusers transformers accelerate

before importing the plugin.

from azula.plugins import sd

References

High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2021)

Classes

AutoEncoder

Creates an auto-encoder wrapper.

TextEncoder

Creates a text encoder.

StableDenoiser

Creates a stable denoiser.

Functions

load_model

Loads a pre-trained stable latent denoiser.

Descriptions

class azula.plugins.sd.AutoEncoder(vae, scale=1.0)[source]

Creates an auto-encoder wrapper.

encode(x)[source]

Encodes images to latents.

Parameters:

x (Tensor) – A batch of images \(x\), with shape \((B, 3, H, W)\). Pixel values are expected to range between 0 and 1.

Returns:

A batch of latents \(z \sim q(Z \mid x)\), with shape \((B, 4, H / 8, W / 8)\).

Return type:

Tensor

decode(z)[source]

Decodes latents to images.

Parameters:

z (Tensor) – A batch of latents \(z\), with shape \((B, 4, H / 8, W / 8)\).

Returns:

A batch of images \(x = D(z)\), with shape \((B, 3, H, W)\).

Return type:

Tensor

class azula.plugins.sd.TextEncoder(clip, tokenizer)[source]

Creates a text encoder.

forward(prompt)[source]
Parameters:

prompt (str | Sequence[str]) – A text prompt or list of text prompts.

Returns:

The CLIP encoded prompt(s).

Return type:

dict[str, Tensor]

class azula.plugins.sd.StableDenoiser(backbone, sigmas, schedule=None, prediction='epsilon')[source]

Creates a stable denoiser.

Parameters:
  • backbone (Module) – A time conditional network.

  • sigmas (Tensor) – The discrete noise schedule used during training.

  • schedule (Schedule) – A noise schedule. If None, use azula.noise.VPSchedule instead.

  • prediction (str) – The backbone prediction type.

forward(z_t, t, prompt_embeds, **kwargs)[source]
Parameters:
  • z_t (Tensor) – A noisy tensor \(z_t\), with shape \((B, C, H, W)\).

  • t (Tensor) – The time \(t\), with shape \(()\) or \((B)\).

  • prompt_embeds (Tensor) – The CLIP-encoded text prompt \(y\), with shape \((B, L, D)\).

  • kwargs – Optional keyword arguments.

Returns:

The Gaussian \(\mathcal{N}(Z \mid \mu_\phi(z_t \mid y), \Sigma_\phi(z_t \mid y)\).

Return type:

DiracPosterior

azula.plugins.sd.load_model(name, **kwargs)[source]

Loads a pre-trained stable latent denoiser.

Parameters:
  • name (str) – The pre-trained model name.

  • kwargs – Keyword arguments passed to diffusers.StableDiffusionPipeline.from_pretrained.

Returns:

A pre-trained latent denoiser and the corresponding auto-encoder and text encoder.

Return type:

tuple[Denoiser, AutoEncoder, TextEncoder]