azula.nn.vit¶
Vision Transformer (ViT) building blocks.
References
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021)
Classes¶
Creates a modulated ViT-like module. |
Descriptions¶
- class azula.nn.vit.ViT(in_channels, out_channels, cond_channels=0, mod_features=0, hid_channels=1024, hid_blocks=3, spatial=2, patch_size=1, unpatch_size=None, **kwargs)¶[source]
Creates a modulated ViT-like module.
- Parameters:
in_channels (int) – The number of input channels \(C_i\).
out_channels (int) – The number of output channels \(C_o\).
cond_channels (int) – The number of condition channels \(C_c\).
mod_features (int) – The number of modulating features \(D\).
hid_channels (int) – The numbers of hidden token channels.
hid_blocks (int) – The number of hidden transformer blocks.
spatial (int) – The number of spatial dimensions \(N\).
unpatch_size (int | Sequence[int] | None) – The unpatch size or shape.
kwargs – Keyword arguments passed to
ViTBlock.