azula.plugins.sana

Sana plugin.

This plugin depends on diffusers and transformers. To use it, install the dependencies in your environment

pip install diffusers transformers accelerate

before importing the plugin.

from azula.plugins import sana

References

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers (Xie et al., 2024)

Classes

AutoEncoder

Creates an auto-encoder wrapper.

TextEncoder

Creates a text encoder.

SanaDenoiser

Creates a Sana denoiser.

Functions

load_model

Loads a pre-trained Sana latent denoiser.

Descriptions

class azula.plugins.sana.AutoEncoder(ae, scale=1.0)[source]

Creates an auto-encoder wrapper.

encode(x)[source]

Encodes images to latents.

Parameters:

x (Tensor) – A batch of images \(x\), with shape \((B, 3, H, W)\). Pixel values are expected to range between -1 and 1.

Returns:

A batch of latents \(z \sim E(x)\), with shape \((B, 32, H / 32, W / 32)\).

Return type:

Tensor

decode(z)[source]

Decodes latents to images.

Parameters:

z (Tensor) – A batch of latents \(z\), with shape \((B, 32, H / 32, W / 32)\).

Returns:

A batch of images \(x = D(z)\), with shape \((B, 3, H, W)\).

Return type:

Tensor

class azula.plugins.sana.TextEncoder(gemma, tokenizer, max_length=300)[source]

Creates a text encoder.

forward(prompt, instructions=["Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:", '- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.', '- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.', 'Here are examples of how to transform or refine prompts:', '- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.', '- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.', 'Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:', 'User Prompt: '])[source]
Parameters:
  • prompt (str | Sequence[str]) – A text prompt or list of text prompts.

  • instructions (Sequence[str]) – A set of human instructions to prepend to each prompt.

Returns:

The Gemma-encoded prompt(s).

Return type:

dict[str, Tensor]

class azula.plugins.sana.SanaDenoiser(backbone, schedule=None)[source]

Creates a Sana denoiser.

Parameters:
forward(z_t, t, prompt_embeds, prompt_mask, **kwargs)[source]
Parameters:
  • z_t (Tensor) – A noisy tensor \(z_t\), with shape \((B, C, H, W)\).

  • t (Tensor) – The time \(t\), with shape \(()\) or \((B)\).

  • prompt_embeds (Tensor) – The Gemma-encoded text prompt \(y\), with shape \((B, L, D)\).

  • prompt_mask (Tensor) – The text attention mask, with shape \((B, L)\).

  • kwargs – Optional keyword arguments.

Returns:

The Gaussian \(\mathcal{N}(Z \mid \mu_\phi(z_t \mid y), \Sigma_\phi(z_t \mid y)\).

Return type:

DiracPosterior

azula.plugins.sana.load_model(name, **kwargs)[source]

Loads a pre-trained Sana latent denoiser.

Parameters:
  • name (str) – The pre-trained model name.

  • kwargs – Keyword arguments passed to diffusers.SanaPipeline.from_pretrained.

Returns:

A pre-trained latent denoiser and the corresponding auto-encoder and text encoder.

Return type:

tuple[Denoiser, AutoEncoder, TextEncoder]