+
+๐ค Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, ๐ค Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction).
+
+๐ค Diffusers offers three core components:
+
+- State-of-the-art [diffusion pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code.
+- Interchangeable noise [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview) for different diffusion speeds and output quality.
+- Pretrained [models](https://huggingface.co/docs/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.
+
+## Installation
+
+We recommend installing ๐ค Diffusers in a virtual environment from PyPI or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation.
+
+### PyTorch
+
+With `pip` (official package):
+
+```bash
+pip install --upgrade diffusers[torch]
+```
+
+With `conda` (maintained by the community):
+
+```sh
+conda install -c conda-forge diffusers
+```
+
+### Flax
+
+With `pip` (official package):
+
+```bash
+pip install --upgrade diffusers[flax]
+```
+
+### Apple Silicon (M1/M2) support
+
+Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggingface.co/docs/diffusers/optimization/mps) guide.
+
+## Quickstart
+
+Generating outputs is super easy with ๐ค Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 19000+ checkpoints):
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+pipeline.to("cuda")
+pipeline("An image of a squirrel in Picasso style").images[0]
+```
+
+You can also dig into the models and schedulers toolbox to build your own diffusion system:
+
+```python
+from diffusers import DDPMScheduler, UNet2DModel
+from PIL import Image
+import torch
+
+scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
+model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
+scheduler.set_timesteps(50)
+
+sample_size = model.config.sample_size
+noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")
+input = noise
+
+for t in scheduler.timesteps:
+ with torch.no_grad():
+ noisy_residual = model(input, t).sample
+ prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
+ input = prev_noisy_sample
+
+image = (input / 2 + 0.5).clamp(0, 1)
+image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
+image = Image.fromarray((image * 255).round().astype("uint8"))
+image
+```
+
+Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to launch your diffusion journey today!
+
+## How to navigate the documentation
+
+| **Documentation** | **What can I learn?** |
+|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview) | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model. |
+| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading_overview) | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. |
+| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/pipeline_overview) | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. |
+| [Optimization](https://huggingface.co/docs/diffusers/optimization/opt_overview) | Guides for how to optimize your diffusion model to run faster and consume less memory. |
+| [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques. |
+## Contribution
+
+We โค๏ธ contributions from the open-source community!
+If you want to contribute to this library, please check out our [Contribution guide](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md).
+You can look out for [issues](https://github.com/huggingface/diffusers/issues) you'd like to tackle to contribute to the library.
+- See [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) for general opportunities to contribute
+- See [New model/pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) to contribute exciting new diffusion models / diffusion pipelines
+- See [New scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22)
+
+Also, say ๐ in our public Discord channel . We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out โ.
+
+
+## Popular Tasks & Pipelines
+
+
+
+## Popular libraries using ๐งจ Diffusers
+
+- https://github.com/microsoft/TaskMatrix
+- https://github.com/invoke-ai/InvokeAI
+- https://github.com/apple/ml-stable-diffusion
+- https://github.com/Sanster/lama-cleaner
+- https://github.com/IDEA-Research/Grounded-Segment-Anything
+- https://github.com/ashawkey/stable-dreamfusion
+- https://github.com/deep-floyd/IF
+- https://github.com/bentoml/BentoML
+- https://github.com/bmaltais/kohya_ss
+- +8000 other amazing GitHub repositories ๐ช
+
+Thank you for using us โค๏ธ.
+
+## Credits
+
+This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:
+
+- @CompVis' latent diffusion models library, available [here](https://github.com/CompVis/latent-diffusion)
+- @hojonathanho original DDPM implementation, available [here](https://github.com/hojonathanho/diffusion) as well as the extremely useful translation into PyTorch by @pesser, available [here](https://github.com/pesser/pytorch_diffusion)
+- @ermongroup's DDIM implementation, available [here](https://github.com/ermongroup/ddim)
+- @yang-song's Score-VE and Score-VP implementations, available [here](https://github.com/yang-song/score_sde_pytorch)
+
+We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available [here](https://github.com/heejkoo/Awesome-Diffusion-Models) as well as @crowsonkb and @rromb for useful discussions and insights.
+
+## Citation
+
+```bibtex
+@misc{von-platen-etal-2022-diffusers,
+ author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
+ title = {Diffusers: State-of-the-art diffusion models},
+ year = {2022},
+ publisher = {GitHub},
+ journal = {GitHub repository},
+ howpublished = {\url{https://github.com/huggingface/diffusers}}
+}
+```
diff --git a/docker/diffusers-flax-cpu/Dockerfile b/docker/diffusers-flax-cpu/Dockerfile
new file mode 100644
index 0000000..36d036e
--- /dev/null
+++ b/docker/diffusers-flax-cpu/Dockerfile
@@ -0,0 +1,44 @@
+FROM ubuntu:20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ python3.8 \
+ python3-pip \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+# follow the instructions here: https://cloud.google.com/tpu/docs/run-in-container#train_a_jax_model_in_a_docker_container
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m uv pip install --upgrade --no-cache-dir \
+ clu \
+ "jax[cpu]>=0.2.16,!=0.3.2" \
+ "flax>=0.4.1" \
+ "jaxlib>=0.1.65" && \
+ python3 -m uv pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers
+
+CMD ["/bin/bash"]
\ No newline at end of file
diff --git a/docker/diffusers-flax-tpu/Dockerfile b/docker/diffusers-flax-tpu/Dockerfile
new file mode 100644
index 0000000..78d5f97
--- /dev/null
+++ b/docker/diffusers-flax-tpu/Dockerfile
@@ -0,0 +1,46 @@
+FROM ubuntu:20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ python3.8 \
+ python3-pip \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+# follow the instructions here: https://cloud.google.com/tpu/docs/run-in-container#train_a_jax_model_in_a_docker_container
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m pip install --no-cache-dir \
+ "jax[tpu]>=0.2.16,!=0.3.2" \
+ -f https://storage.googleapis.com/jax-releases/libtpu_releases.html && \
+ python3 -m uv pip install --upgrade --no-cache-dir \
+ clu \
+ "flax>=0.4.1" \
+ "jaxlib>=0.1.65" && \
+ python3 -m uv pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers
+
+CMD ["/bin/bash"]
\ No newline at end of file
diff --git a/docker/diffusers-onnxruntime-cpu/Dockerfile b/docker/diffusers-onnxruntime-cpu/Dockerfile
new file mode 100644
index 0000000..0d032d9
--- /dev/null
+++ b/docker/diffusers-onnxruntime-cpu/Dockerfile
@@ -0,0 +1,44 @@
+FROM ubuntu:20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ python3.8 \
+ python3-pip \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m uv pip install --no-cache-dir \
+ torch==2.1.2 \
+ torchvision==0.16.2 \
+ torchaudio==2.1.2 \
+ onnxruntime \
+ --extra-index-url https://download.pytorch.org/whl/cpu && \
+ python3 -m uv pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers
+
+CMD ["/bin/bash"]
\ No newline at end of file
diff --git a/docker/diffusers-onnxruntime-cuda/Dockerfile b/docker/diffusers-onnxruntime-cuda/Dockerfile
new file mode 100644
index 0000000..34e611d
--- /dev/null
+++ b/docker/diffusers-onnxruntime-cuda/Dockerfile
@@ -0,0 +1,44 @@
+FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ python3.8 \
+ python3-pip \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m uv pip install --no-cache-dir \
+ torch \
+ torchvision \
+ torchaudio \
+ "onnxruntime-gpu>=1.13.1" \
+ --extra-index-url https://download.pytorch.org/whl/cu117 && \
+ python3 -m uv pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers
+
+CMD ["/bin/bash"]
\ No newline at end of file
diff --git a/docker/diffusers-pytorch-compile-cuda/Dockerfile b/docker/diffusers-pytorch-compile-cuda/Dockerfile
new file mode 100644
index 0000000..b4f507f
--- /dev/null
+++ b/docker/diffusers-pytorch-compile-cuda/Dockerfile
@@ -0,0 +1,45 @@
+FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ libgl1 \
+ python3.9 \
+ python3.9-dev \
+ python3-pip \
+ python3.9-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3.9 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+RUN python3.9 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3.9 -m uv pip install --no-cache-dir \
+ torch \
+ torchvision \
+ torchaudio \
+ invisible_watermark && \
+ python3.9 -m pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers
+
+CMD ["/bin/bash"]
diff --git a/docker/diffusers-pytorch-cpu/Dockerfile b/docker/diffusers-pytorch-cpu/Dockerfile
new file mode 100644
index 0000000..288559b
--- /dev/null
+++ b/docker/diffusers-pytorch-cpu/Dockerfile
@@ -0,0 +1,45 @@
+FROM ubuntu:20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ python3.8 \
+ python3-pip \
+ libgl1 \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m uv pip install --no-cache-dir \
+ torch \
+ torchvision \
+ torchaudio \
+ invisible_watermark \
+ --extra-index-url https://download.pytorch.org/whl/cpu && \
+ python3 -m uv pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers
+
+CMD ["/bin/bash"]
diff --git a/docker/diffusers-pytorch-cuda/Dockerfile b/docker/diffusers-pytorch-cuda/Dockerfile
new file mode 100644
index 0000000..8e0e56b
--- /dev/null
+++ b/docker/diffusers-pytorch-cuda/Dockerfile
@@ -0,0 +1,45 @@
+FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ libgl1 \
+ python3.8 \
+ python3-pip \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m uv pip install --no-cache-dir \
+ torch \
+ torchvision \
+ torchaudio \
+ invisible_watermark && \
+ python3 -m pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers \
+ pytorch-lightning
+
+CMD ["/bin/bash"]
diff --git a/docker/diffusers-pytorch-xformers-cuda/Dockerfile b/docker/diffusers-pytorch-xformers-cuda/Dockerfile
new file mode 100644
index 0000000..90ac0d1
--- /dev/null
+++ b/docker/diffusers-pytorch-xformers-cuda/Dockerfile
@@ -0,0 +1,45 @@
+FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
+LABEL maintainer="Hugging Face"
+LABEL repository="diffusers"
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+ apt install -y bash \
+ build-essential \
+ git \
+ git-lfs \
+ curl \
+ ca-certificates \
+ libsndfile1-dev \
+ libgl1 \
+ python3.8 \
+ python3-pip \
+ python3.8-venv && \
+ rm -rf /var/lib/apt/lists
+
+# make sure to use venv
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
+RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
+ python3 -m pip install --no-cache-dir \
+ torch \
+ torchvision \
+ torchaudio \
+ invisible_watermark && \
+ python3 -m uv pip install --no-cache-dir \
+ accelerate \
+ datasets \
+ hf-doc-builder \
+ huggingface-hub \
+ Jinja2 \
+ librosa \
+ numpy \
+ scipy \
+ tensorboard \
+ transformers \
+ xformers
+
+CMD ["/bin/bash"]
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..e7aa8c4
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,268 @@
+
+
+# Generating the documentation
+
+To generate the documentation, you first have to build it. Several packages are necessary to build the doc,
+you can install them with the following command, at the root of the code repository:
+
+```bash
+pip install -e ".[docs]"
+```
+
+Then you need to install our open source documentation builder tool:
+
+```bash
+pip install git+https://github.com/huggingface/doc-builder
+```
+
+---
+**NOTE**
+
+You only need to generate the documentation to inspect it locally (if you're planning changes and want to
+check how they look before committing for instance). You don't have to commit the built documentation.
+
+---
+
+## Previewing the documentation
+
+To preview the docs, first install the `watchdog` module with:
+
+```bash
+pip install watchdog
+```
+
+Then run the following command:
+
+```bash
+doc-builder preview {package_name} {path_to_docs}
+```
+
+For example:
+
+```bash
+doc-builder preview diffusers docs/source/en
+```
+
+The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.
+
+---
+**NOTE**
+
+The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).
+
+---
+
+## Adding a new element to the navigation bar
+
+Accepted files are Markdown (.md).
+
+Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
+the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml) file.
+
+## Renaming section headers and moving sections
+
+It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums, and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.
+
+Therefore, we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor.
+
+So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file:
+
+```md
+Sections that were moved:
+
+[ Section A ]
+```
+and of course, if you moved it to another file, then:
+
+```md
+Sections that were moved:
+
+[ Section A ]
+```
+
+Use the relative style to link to the new file so that the versioned docs continue to work.
+
+For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).
+
+
+## Writing Documentation - Specification
+
+The `huggingface/diffusers` documentation follows the
+[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style for docstrings,
+although we can write them directly in Markdown.
+
+### Adding a new tutorial
+
+Adding a new tutorial or section is done in two steps:
+
+- Add a new Markdown (.md) file under `docs/source/`.
+- Link that file in `docs/source//_toctree.yml` on the correct toc-tree.
+
+Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so
+depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or four.
+
+### Adding a new pipeline/scheduler
+
+When adding a new pipeline:
+
+- Create a file `xxx.md` under `docs/source//api/pipelines` (don't hesitate to copy an existing file as template).
+- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.md`, along with the link to the paper, and a colab notebook (if available).
+- Write a short overview of the diffusion model:
+ - Overview with paper & authors
+ - Paper abstract
+ - Tips and tricks and how to use it best
+ - Possible an end-to-end example of how to use it
+- Add all the pipeline classes that should be linked in the diffusion model. These classes should be added using our Markdown syntax. By default as follows:
+
+```
+[[autodoc]] XXXPipeline
+ - all
+ - __call__
+```
+
+This will include every public method of the pipeline that is documented, as well as the `__call__` method that is not documented by default. If you just want to add additional methods that are not documented, you can put the list of all methods to add in a list that contains `all`.
+
+```
+[[autodoc]] XXXPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+```
+
+You can follow the same process to create a new scheduler under the `docs/source//api/schedulers` folder.
+
+### Writing source documentation
+
+Values that should be put in `code` should either be surrounded by backticks: \`like so\`. Note that argument names
+and objects like True, None, or any strings should usually be put in `code`.
+
+When mentioning a class, function, or method, it is recommended to use our syntax for internal links so that our tool
+adds a link to its documentation with this syntax: \[\`XXXClass\`\] or \[\`function\`\]. This requires the class or
+function to be in the main package.
+
+If you want to create a link to some internal class or function, you need to
+provide its path. For instance: \[\`pipelines.ImagePipelineOutput\`\]. This will be converted into a link with
+`pipelines.ImagePipelineOutput` in the description. To get rid of the path and only keep the name of the object you are
+linking to in the description, add a ~: \[\`~pipelines.ImagePipelineOutput\`\] will generate a link with `ImagePipelineOutput` in the description.
+
+The same works for methods so you can either use \[\`XXXClass.method\`\] or \[\`~XXXClass.method\`\].
+
+#### Defining arguments in a method
+
+Arguments should be defined with the `Args:` (or `Arguments:` or `Parameters:`) prefix, followed by a line return and
+an indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon, and its
+description:
+
+```
+ Args:
+ n_layers (`int`): The number of layers of the model.
+```
+
+If the description is too long to fit in one line, another indentation is necessary before writing the description
+after the argument.
+
+Here's an example showcasing everything so far:
+
+```
+ Args:
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ Indices of input sequence tokens in the vocabulary.
+
+ Indices can be obtained using [`AlbertTokenizer`]. See [`~PreTrainedTokenizer.encode`] and
+ [`~PreTrainedTokenizer.__call__`] for details.
+
+ [What are input IDs?](../glossary#input-ids)
+```
+
+For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the
+following signature:
+
+```py
+def my_function(x: str=None, a: float=3.14):
+```
+
+then its documentation should look like this:
+
+```
+ Args:
+ x (`str`, *optional*):
+ This argument controls ...
+ a (`float`, *optional*, defaults to `3.14`):
+ This argument is used to ...
+```
+
+Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even
+if the first line describing your argument type and its default gets long, you can't break it on several lines. You can
+however write as many lines as you want in the indented description (see the example above with `input_ids`).
+
+#### Writing a multi-line code block
+
+Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown:
+
+
+````
+```
+# first line of code
+# second line
+# etc
+```
+````
+
+#### Writing a return block
+
+The return block should be introduced with the `Returns:` prefix, followed by a line return and an indentation.
+The first line should be the type of the return, followed by a line return. No need to indent further for the elements
+building the return.
+
+Here's an example of a single value return:
+
+```
+ Returns:
+ `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
+```
+
+Here's an example of a tuple return, comprising several objects:
+
+```
+ Returns:
+ `tuple(torch.FloatTensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs:
+ - ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.FloatTensor` of shape `(1,)` --
+ Total loss is the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+ - **prediction_scores** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) --
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+```
+
+#### Adding an image
+
+Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
+the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
+them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
+If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
+to this dataset.
+
+## Styling the docstring
+
+We have an automatic script running with the `make style` command that will make sure that:
+- the docstrings fully take advantage of the line width
+- all code examples are formatted using black, like the code of the Transformers library
+
+This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's
+recommended to commit your changes before running `make style`, so you can revert the changes done by that script
+easily.
diff --git a/docs/TRANSLATING.md b/docs/TRANSLATING.md
new file mode 100644
index 0000000..f88bec8
--- /dev/null
+++ b/docs/TRANSLATING.md
@@ -0,0 +1,69 @@
+
+
+### Translating the Diffusers documentation into your language
+
+As part of our mission to democratize machine learning, we'd love to make the Diffusers library available in many more languages! Follow the steps below if you want to help translate the documentation into your language ๐.
+
+**๐๏ธ Open an issue**
+
+To get started, navigate to the [Issues](https://github.com/huggingface/diffusers/issues) page of this repo and check if anyone else has opened an issue for your language. If not, open a new issue by selecting the "๐ Translating a New Language?" from the "New issue" button.
+
+Once an issue exists, post a comment to indicate which chapters you'd like to work on, and we'll add your name to the list.
+
+
+**๐ด Fork the repository**
+
+First, you'll need to [fork the Diffusers repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo). You can do this by clicking on the **Fork** button on the top-right corner of this repo's page.
+
+Once you've forked the repo, you'll want to get the files on your local machine for editing. You can do that by cloning the fork with Git as follows:
+
+```bash
+git clone https://github.com//diffusers.git
+```
+
+**๐ Copy-paste the English version with a new language code**
+
+The documentation files are in one leading directory:
+
+- [`docs/source`](https://github.com/huggingface/diffusers/tree/main/docs/source): All the documentation materials are organized here by language.
+
+You'll only need to copy the files in the [`docs/source/en`](https://github.com/huggingface/diffusers/tree/main/docs/source/en) directory, so first navigate to your fork of the repo and run the following:
+
+```bash
+cd ~/path/to/diffusers/docs
+cp -r source/en source/
+```
+
+Here, `` should be one of the ISO 639-1 or ISO 639-2 language codes -- see [here](https://www.loc.gov/standards/iso639-2/php/code_list.php) for a handy table.
+
+**โ๏ธ Start translating**
+
+The fun part comes - translating the text!
+
+The first thing we recommend is translating the part of the `_toctree.yml` file that corresponds to your doc chapter. This file is used to render the table of contents on the website.
+
+> ๐ If the `_toctree.yml` file doesn't yet exist for your language, you can create one by copy-pasting from the English version and deleting the sections unrelated to your chapter. Just make sure it exists in the `docs/source//` directory!
+
+The fields you should add are `local` (with the name of the file containing the translation; e.g. `autoclass_tutorial`), and `title` (with the title of the doc in your language; e.g. `Load pretrained instances with an AutoClass`) -- as a reference, here is the `_toctree.yml` for [English](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml):
+
+```yaml
+- sections:
+ - local: pipeline_tutorial # Do not change this! Use the same name for your .md file
+ title: Pipelines for inference # Translate this!
+ ...
+ title: Tutorials # Translate this!
+```
+
+Once you have translated the `_toctree.yml` file, you can start translating the [MDX](https://mdxjs.com/) files associated with your docs chapter.
+
+> ๐ If you'd like others to help you with the translation, you should [open an issue](https://github.com/huggingface/diffusers/issues) and tag @patrickvonplaten.
diff --git a/docs/source/_config.py b/docs/source/_config.py
new file mode 100644
index 0000000..3d0d73d
--- /dev/null
+++ b/docs/source/_config.py
@@ -0,0 +1,9 @@
+# docstyle-ignore
+INSTALL_CONTENT = """
+# Diffusers installation
+! pip install diffusers transformers datasets accelerate
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/diffusers.git
+"""
+
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
new file mode 100644
index 0000000..ba94de5
--- /dev/null
+++ b/docs/source/en/_toctree.yml
@@ -0,0 +1,446 @@
+- sections:
+ - local: index
+ title: ๐งจ Diffusers
+ - local: quicktour
+ title: Quicktour
+ - local: stable_diffusion
+ title: Effective and efficient diffusion
+ - local: installation
+ title: Installation
+ title: Get started
+- sections:
+ - local: tutorials/tutorial_overview
+ title: Overview
+ - local: using-diffusers/write_own_pipeline
+ title: Understanding pipelines, models and schedulers
+ - local: tutorials/autopipeline
+ title: AutoPipeline
+ - local: tutorials/basic_training
+ title: Train a diffusion model
+ - local: tutorials/using_peft_for_inference
+ title: Load LoRAs for inference
+ - local: tutorials/fast_diffusion
+ title: Accelerate inference of text-to-image diffusion models
+ title: Tutorials
+- sections:
+ - sections:
+ - local: using-diffusers/loading_overview
+ title: Overview
+ - local: using-diffusers/loading
+ title: Load pipelines, models, and schedulers
+ - local: using-diffusers/schedulers
+ title: Load and compare different schedulers
+ - local: using-diffusers/custom_pipeline_overview
+ title: Load community pipelines and components
+ - local: using-diffusers/using_safetensors
+ title: Load safetensors
+ - local: using-diffusers/other-formats
+ title: Load different Stable Diffusion formats
+ - local: using-diffusers/loading_adapters
+ title: Load adapters
+ - local: using-diffusers/push_to_hub
+ title: Push files to the Hub
+ title: Loading & Hub
+ - sections:
+ - local: using-diffusers/pipeline_overview
+ title: Overview
+ - local: using-diffusers/unconditional_image_generation
+ title: Unconditional image generation
+ - local: using-diffusers/conditional_image_generation
+ title: Text-to-image
+ - local: using-diffusers/img2img
+ title: Image-to-image
+ - local: using-diffusers/inpaint
+ title: Inpainting
+ - local: using-diffusers/text-img2vid
+ title: Text or image-to-video
+ - local: using-diffusers/depth2img
+ title: Depth-to-image
+ title: Tasks
+ - sections:
+ - local: using-diffusers/textual_inversion_inference
+ title: Textual inversion
+ - local: using-diffusers/ip_adapter
+ title: IP-Adapter
+ - local: using-diffusers/merge_loras
+ title: Merge LoRAs
+ - local: training/distributed_inference
+ title: Distributed inference with multiple GPUs
+ - local: using-diffusers/reusing_seeds
+ title: Improve image quality with deterministic generation
+ - local: using-diffusers/control_brightness
+ title: Control image brightness
+ - local: using-diffusers/weighted_prompts
+ title: Prompt weighting
+ - local: using-diffusers/freeu
+ title: Improve generation quality with FreeU
+ title: Techniques
+ - sections:
+ - local: using-diffusers/pipeline_overview
+ title: Overview
+ - local: using-diffusers/sdxl
+ title: Stable Diffusion XL
+ - local: using-diffusers/sdxl_turbo
+ title: SDXL Turbo
+ - local: using-diffusers/kandinsky
+ title: Kandinsky
+ - local: using-diffusers/controlnet
+ title: ControlNet
+ - local: using-diffusers/shap-e
+ title: Shap-E
+ - local: using-diffusers/diffedit
+ title: DiffEdit
+ - local: using-diffusers/distilled_sd
+ title: Distilled Stable Diffusion inference
+ - local: using-diffusers/callback
+ title: Pipeline callbacks
+ - local: using-diffusers/reproducibility
+ title: Create reproducible pipelines
+ - local: using-diffusers/custom_pipeline_examples
+ title: Community pipelines
+ - local: using-diffusers/contribute_pipeline
+ title: Contribute a community pipeline
+ - local: using-diffusers/inference_with_lcm_lora
+ title: Latent Consistency Model-LoRA
+ - local: using-diffusers/inference_with_lcm
+ title: Latent Consistency Model
+ - local: using-diffusers/svd
+ title: Stable Video Diffusion
+ title: Specific pipeline examples
+ - sections:
+ - local: training/overview
+ title: Overview
+ - local: training/create_dataset
+ title: Create a dataset for training
+ - local: training/adapt_a_model
+ title: Adapt a model to a new task
+ - sections:
+ - local: training/unconditional_training
+ title: Unconditional image generation
+ - local: training/text2image
+ title: Text-to-image
+ - local: training/sdxl
+ title: Stable Diffusion XL
+ - local: training/kandinsky
+ title: Kandinsky 2.2
+ - local: training/wuerstchen
+ title: Wuerstchen
+ - local: training/controlnet
+ title: ControlNet
+ - local: training/t2i_adapters
+ title: T2I-Adapters
+ - local: training/instructpix2pix
+ title: InstructPix2Pix
+ title: Models
+ - sections:
+ - local: training/text_inversion
+ title: Textual Inversion
+ - local: training/dreambooth
+ title: DreamBooth
+ - local: training/lora
+ title: LoRA
+ - local: training/custom_diffusion
+ title: Custom Diffusion
+ - local: training/lcm_distill
+ title: Latent Consistency Distillation
+ - local: training/ddpo
+ title: Reinforcement learning training with DDPO
+ title: Methods
+ title: Training
+ - sections:
+ - local: using-diffusers/other-modalities
+ title: Other Modalities
+ title: Taking Diffusers Beyond Images
+ title: Using Diffusers
+- sections:
+ - local: optimization/opt_overview
+ title: Overview
+ - sections:
+ - local: optimization/fp16
+ title: Speed up inference
+ - local: optimization/memory
+ title: Reduce memory usage
+ - local: optimization/torch2.0
+ title: PyTorch 2.0
+ - local: optimization/xformers
+ title: xFormers
+ - local: optimization/tome
+ title: Token merging
+ - local: optimization/deepcache
+ title: DeepCache
+ title: General optimizations
+ - sections:
+ - local: using-diffusers/stable_diffusion_jax_how_to
+ title: JAX/Flax
+ - local: optimization/onnx
+ title: ONNX
+ - local: optimization/open_vino
+ title: OpenVINO
+ - local: optimization/coreml
+ title: Core ML
+ title: Optimized model types
+ - sections:
+ - local: optimization/mps
+ title: Metal Performance Shaders (MPS)
+ - local: optimization/habana
+ title: Habana Gaudi
+ title: Optimized hardware
+ title: Optimization
+- sections:
+ - local: conceptual/philosophy
+ title: Philosophy
+ - local: using-diffusers/controlling_generation
+ title: Controlled generation
+ - local: conceptual/contribution
+ title: How to contribute?
+ - local: conceptual/ethical_guidelines
+ title: Diffusers' Ethical Guidelines
+ - local: conceptual/evaluation
+ title: Evaluating Diffusion Models
+ title: Conceptual Guides
+- sections:
+ - sections:
+ - local: api/configuration
+ title: Configuration
+ - local: api/logging
+ title: Logging
+ - local: api/outputs
+ title: Outputs
+ title: Main Classes
+ - sections:
+ - local: api/loaders/ip_adapter
+ title: IP-Adapter
+ - local: api/loaders/lora
+ title: LoRA
+ - local: api/loaders/single_file
+ title: Single files
+ - local: api/loaders/textual_inversion
+ title: Textual Inversion
+ - local: api/loaders/unet
+ title: UNet
+ - local: api/loaders/peft
+ title: PEFT
+ title: Loaders
+ - sections:
+ - local: api/models/overview
+ title: Overview
+ - local: api/models/unet
+ title: UNet1DModel
+ - local: api/models/unet2d
+ title: UNet2DModel
+ - local: api/models/unet2d-cond
+ title: UNet2DConditionModel
+ - local: api/models/unet3d-cond
+ title: UNet3DConditionModel
+ - local: api/models/unet-motion
+ title: UNetMotionModel
+ - local: api/models/uvit2d
+ title: UViT2DModel
+ - local: api/models/vq
+ title: VQModel
+ - local: api/models/autoencoderkl
+ title: AutoencoderKL
+ - local: api/models/asymmetricautoencoderkl
+ title: AsymmetricAutoencoderKL
+ - local: api/models/autoencoder_tiny
+ title: Tiny AutoEncoder
+ - local: api/models/consistency_decoder_vae
+ title: ConsistencyDecoderVAE
+ - local: api/models/transformer2d
+ title: Transformer2D
+ - local: api/models/transformer_temporal
+ title: Transformer Temporal
+ - local: api/models/prior_transformer
+ title: Prior Transformer
+ - local: api/models/controlnet
+ title: ControlNet
+ title: Models
+ - sections:
+ - local: api/pipelines/overview
+ title: Overview
+ - local: api/pipelines/amused
+ title: aMUSEd
+ - local: api/pipelines/animatediff
+ title: AnimateDiff
+ - local: api/pipelines/attend_and_excite
+ title: Attend-and-Excite
+ - local: api/pipelines/audioldm
+ title: AudioLDM
+ - local: api/pipelines/audioldm2
+ title: AudioLDM 2
+ - local: api/pipelines/auto_pipeline
+ title: AutoPipeline
+ - local: api/pipelines/blip_diffusion
+ title: BLIP-Diffusion
+ - local: api/pipelines/consistency_models
+ title: Consistency Models
+ - local: api/pipelines/controlnet
+ title: ControlNet
+ - local: api/pipelines/controlnet_sdxl
+ title: ControlNet with Stable Diffusion XL
+ - local: api/pipelines/dance_diffusion
+ title: Dance Diffusion
+ - local: api/pipelines/ddim
+ title: DDIM
+ - local: api/pipelines/ddpm
+ title: DDPM
+ - local: api/pipelines/deepfloyd_if
+ title: DeepFloyd IF
+ - local: api/pipelines/diffedit
+ title: DiffEdit
+ - local: api/pipelines/dit
+ title: DiT
+ - local: api/pipelines/i2vgenxl
+ title: I2VGen-XL
+ - local: api/pipelines/pix2pix
+ title: InstructPix2Pix
+ - local: api/pipelines/kandinsky
+ title: Kandinsky 2.1
+ - local: api/pipelines/kandinsky_v22
+ title: Kandinsky 2.2
+ - local: api/pipelines/kandinsky3
+ title: Kandinsky 3
+ - local: api/pipelines/latent_consistency_models
+ title: Latent Consistency Models
+ - local: api/pipelines/latent_diffusion
+ title: Latent Diffusion
+ - local: api/pipelines/panorama
+ title: MultiDiffusion
+ - local: api/pipelines/musicldm
+ title: MusicLDM
+ - local: api/pipelines/paint_by_example
+ title: Paint by Example
+ - local: api/pipelines/pia
+ title: Personalized Image Animator (PIA)
+ - local: api/pipelines/pixart
+ title: PixArt-ฮฑ
+ - local: api/pipelines/self_attention_guidance
+ title: Self-Attention Guidance
+ - local: api/pipelines/semantic_stable_diffusion
+ title: Semantic Guidance
+ - local: api/pipelines/shap_e
+ title: Shap-E
+ - local: api/pipelines/stable_cascade
+ title: Stable Cascade
+ - sections:
+ - local: api/pipelines/stable_diffusion/overview
+ title: Overview
+ - local: api/pipelines/stable_diffusion/text2img
+ title: Text-to-image
+ - local: api/pipelines/stable_diffusion/img2img
+ title: Image-to-image
+ - local: api/pipelines/stable_diffusion/svd
+ title: Image-to-video
+ - local: api/pipelines/stable_diffusion/inpaint
+ title: Inpainting
+ - local: api/pipelines/stable_diffusion/depth2img
+ title: Depth-to-image
+ - local: api/pipelines/stable_diffusion/image_variation
+ title: Image variation
+ - local: api/pipelines/stable_diffusion/stable_diffusion_safe
+ title: Safe Stable Diffusion
+ - local: api/pipelines/stable_diffusion/stable_diffusion_2
+ title: Stable Diffusion 2
+ - local: api/pipelines/stable_diffusion/stable_diffusion_xl
+ title: Stable Diffusion XL
+ - local: api/pipelines/stable_diffusion/sdxl_turbo
+ title: SDXL Turbo
+ - local: api/pipelines/stable_diffusion/latent_upscale
+ title: Latent upscaler
+ - local: api/pipelines/stable_diffusion/upscale
+ title: Super-resolution
+ - local: api/pipelines/stable_diffusion/k_diffusion
+ title: K-Diffusion
+ - local: api/pipelines/stable_diffusion/ldm3d_diffusion
+ title: LDM3D Text-to-(RGB, Depth), Text-to-(RGB-pano, Depth-pano), LDM3D Upscaler
+ - local: api/pipelines/stable_diffusion/adapter
+ title: Stable Diffusion T2I-Adapter
+ - local: api/pipelines/stable_diffusion/gligen
+ title: GLIGEN (Grounded Language-to-Image Generation)
+ title: Stable Diffusion
+ - local: api/pipelines/stable_unclip
+ title: Stable unCLIP
+ - local: api/pipelines/text_to_video
+ title: Text-to-video
+ - local: api/pipelines/text_to_video_zero
+ title: Text2Video-Zero
+ - local: api/pipelines/unclip
+ title: unCLIP
+ - local: api/pipelines/unidiffuser
+ title: UniDiffuser
+ - local: api/pipelines/value_guided_sampling
+ title: Value-guided sampling
+ - local: api/pipelines/wuerstchen
+ title: Wuerstchen
+ title: Pipelines
+ - sections:
+ - local: api/schedulers/overview
+ title: Overview
+ - local: api/schedulers/cm_stochastic_iterative
+ title: CMStochasticIterativeScheduler
+ - local: api/schedulers/consistency_decoder
+ title: ConsistencyDecoderScheduler
+ - local: api/schedulers/ddim_inverse
+ title: DDIMInverseScheduler
+ - local: api/schedulers/ddim
+ title: DDIMScheduler
+ - local: api/schedulers/ddpm
+ title: DDPMScheduler
+ - local: api/schedulers/deis
+ title: DEISMultistepScheduler
+ - local: api/schedulers/multistep_dpm_solver_inverse
+ title: DPMSolverMultistepInverse
+ - local: api/schedulers/multistep_dpm_solver
+ title: DPMSolverMultistepScheduler
+ - local: api/schedulers/dpm_sde
+ title: DPMSolverSDEScheduler
+ - local: api/schedulers/singlestep_dpm_solver
+ title: DPMSolverSinglestepScheduler
+ - local: api/schedulers/euler_ancestral
+ title: EulerAncestralDiscreteScheduler
+ - local: api/schedulers/euler
+ title: EulerDiscreteScheduler
+ - local: api/schedulers/heun
+ title: HeunDiscreteScheduler
+ - local: api/schedulers/ipndm
+ title: IPNDMScheduler
+ - local: api/schedulers/stochastic_karras_ve
+ title: KarrasVeScheduler
+ - local: api/schedulers/dpm_discrete_ancestral
+ title: KDPM2AncestralDiscreteScheduler
+ - local: api/schedulers/dpm_discrete
+ title: KDPM2DiscreteScheduler
+ - local: api/schedulers/lcm
+ title: LCMScheduler
+ - local: api/schedulers/lms_discrete
+ title: LMSDiscreteScheduler
+ - local: api/schedulers/pndm
+ title: PNDMScheduler
+ - local: api/schedulers/repaint
+ title: RePaintScheduler
+ - local: api/schedulers/score_sde_ve
+ title: ScoreSdeVeScheduler
+ - local: api/schedulers/score_sde_vp
+ title: ScoreSdeVpScheduler
+ - local: api/schedulers/tcd
+ title: TCDScheduler
+ - local: api/schedulers/unipc
+ title: UniPCMultistepScheduler
+ - local: api/schedulers/vq_diffusion
+ title: VQDiffusionScheduler
+ title: Schedulers
+ - sections:
+ - local: api/internal_classes_overview
+ title: Overview
+ - local: api/attnprocessor
+ title: Attention Processor
+ - local: api/activations
+ title: Custom activation functions
+ - local: api/normalization
+ title: Custom normalization layers
+ - local: api/utilities
+ title: Utilities
+ - local: api/image_processor
+ title: VAE Image Processor
+ title: Internal classes
+ title: API
diff --git a/docs/source/en/api/activations.md b/docs/source/en/api/activations.md
new file mode 100644
index 0000000..3bef28a
--- /dev/null
+++ b/docs/source/en/api/activations.md
@@ -0,0 +1,27 @@
+
+
+# Activation functions
+
+Customized activation functions for supporting various models in ๐ค Diffusers.
+
+## GELU
+
+[[autodoc]] models.activations.GELU
+
+## GEGLU
+
+[[autodoc]] models.activations.GEGLU
+
+## ApproximateGELU
+
+[[autodoc]] models.activations.ApproximateGELU
diff --git a/docs/source/en/api/attnprocessor.md b/docs/source/en/api/attnprocessor.md
new file mode 100644
index 0000000..ab89d4d
--- /dev/null
+++ b/docs/source/en/api/attnprocessor.md
@@ -0,0 +1,57 @@
+
+
+# Attention Processor
+
+An attention processor is a class for applying different types of attention mechanisms.
+
+## AttnProcessor
+[[autodoc]] models.attention_processor.AttnProcessor
+
+## AttnProcessor2_0
+[[autodoc]] models.attention_processor.AttnProcessor2_0
+
+## AttnAddedKVProcessor
+[[autodoc]] models.attention_processor.AttnAddedKVProcessor
+
+## AttnAddedKVProcessor2_0
+[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0
+
+## CrossFrameAttnProcessor
+[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor
+
+## CustomDiffusionAttnProcessor
+[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor
+
+## CustomDiffusionAttnProcessor2_0
+[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0
+
+## CustomDiffusionXFormersAttnProcessor
+[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
+
+## FusedAttnProcessor2_0
+[[autodoc]] models.attention_processor.FusedAttnProcessor2_0
+
+## LoRAAttnAddedKVProcessor
+[[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor
+
+## LoRAXFormersAttnProcessor
+[[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor
+
+## SlicedAttnProcessor
+[[autodoc]] models.attention_processor.SlicedAttnProcessor
+
+## SlicedAttnAddedKVProcessor
+[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
+
+## XFormersAttnProcessor
+[[autodoc]] models.attention_processor.XFormersAttnProcessor
diff --git a/docs/source/en/api/configuration.md b/docs/source/en/api/configuration.md
new file mode 100644
index 0000000..31d7023
--- /dev/null
+++ b/docs/source/en/api/configuration.md
@@ -0,0 +1,30 @@
+
+
+# Configuration
+
+Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file.
+
+
+
+To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
+
+
+
+## ConfigMixin
+
+[[autodoc]] ConfigMixin
+ - load_config
+ - from_config
+ - save_config
+ - to_json_file
+ - to_json_string
diff --git a/docs/source/en/api/image_processor.md b/docs/source/en/api/image_processor.md
new file mode 100644
index 0000000..e5aba85
--- /dev/null
+++ b/docs/source/en/api/image_processor.md
@@ -0,0 +1,27 @@
+
+
+# VAE Image Processor
+
+The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays.
+
+All pipelines with [`VaeImageProcessor`] accept PIL Image, PyTorch tensor, or NumPy arrays as image inputs and return outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="latent"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines.
+
+## VaeImageProcessor
+
+[[autodoc]] image_processor.VaeImageProcessor
+
+## VaeImageProcessorLDM3D
+
+The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs.
+
+[[autodoc]] image_processor.VaeImageProcessorLDM3D
diff --git a/docs/source/en/api/internal_classes_overview.md b/docs/source/en/api/internal_classes_overview.md
new file mode 100644
index 0000000..38e8124
--- /dev/null
+++ b/docs/source/en/api/internal_classes_overview.md
@@ -0,0 +1,15 @@
+
+
+# Overview
+
+The APIs in this section are more experimental and prone to breaking changes. Most of them are used internally for development, but they may also be useful to you if you're interested in building a diffusion model with some custom parts or if you're interested in some of our helper utilities for working with ๐ค Diffusers.
diff --git a/docs/source/en/api/loaders/ip_adapter.md b/docs/source/en/api/loaders/ip_adapter.md
new file mode 100644
index 0000000..a10f30e
--- /dev/null
+++ b/docs/source/en/api/loaders/ip_adapter.md
@@ -0,0 +1,29 @@
+
+
+# IP-Adapter
+
+[IP-Adapter](https://hf.co/papers/2308.06721) is a lightweight adapter that enables prompting a diffusion model with an image. This method decouples the cross-attention layers of the image and text features. The image features are generated from an image encoder.
+
+
+
+Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading](../../using-diffusers/loading_adapters#ip-adapter) guide, and you can see how to use it in the [usage](../../using-diffusers/ip_adapter) guide.
+
+
+
+## IPAdapterMixin
+
+[[autodoc]] loaders.ip_adapter.IPAdapterMixin
+
+## IPAdapterMaskProcessor
+
+[[autodoc]] image_processor.IPAdapterMaskProcessor
\ No newline at end of file
diff --git a/docs/source/en/api/loaders/lora.md b/docs/source/en/api/loaders/lora.md
new file mode 100644
index 0000000..3a4d21c
--- /dev/null
+++ b/docs/source/en/api/loaders/lora.md
@@ -0,0 +1,32 @@
+
+
+# LoRA
+
+LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the UNet, text encoder or both. There are two classes for loading LoRA weights:
+
+- [`LoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model.
+- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`LoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model.
+
+
+
+To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
+
+
+
+## LoraLoaderMixin
+
+[[autodoc]] loaders.lora.LoraLoaderMixin
+
+## StableDiffusionXLLoraLoaderMixin
+
+[[autodoc]] loaders.lora.StableDiffusionXLLoraLoaderMixin
\ No newline at end of file
diff --git a/docs/source/en/api/loaders/peft.md b/docs/source/en/api/loaders/peft.md
new file mode 100644
index 0000000..ecb82c4
--- /dev/null
+++ b/docs/source/en/api/loaders/peft.md
@@ -0,0 +1,25 @@
+
+
+# PEFT
+
+Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`] to load an adapter.
+
+
+
+Refer to the [Inference with PEFT](../../tutorials/using_peft_for_inference.md) tutorial for an overview of how to use PEFT in Diffusers for inference.
+
+
+
+## PeftAdapterMixin
+
+[[autodoc]] loaders.peft.PeftAdapterMixin
diff --git a/docs/source/en/api/loaders/single_file.md b/docs/source/en/api/loaders/single_file.md
new file mode 100644
index 0000000..359ed60
--- /dev/null
+++ b/docs/source/en/api/loaders/single_file.md
@@ -0,0 +1,37 @@
+
+
+# Single files
+
+Diffusers supports loading pretrained pipeline (or model) weights stored in a single file, such as a `ckpt` or `safetensors` file. These single file types are typically produced from community trained models. There are three classes for loading single file weights:
+
+- [`FromSingleFileMixin`] supports loading pretrained pipeline weights stored in a single file, which can either be a `ckpt` or `safetensors` file.
+- [`FromOriginalVAEMixin`] supports loading a pretrained [`AutoencoderKL`] from pretrained ControlNet weights stored in a single file, which can either be a `ckpt` or `safetensors` file.
+- [`FromOriginalControlnetMixin`] supports loading pretrained ControlNet weights stored in a single file, which can either be a `ckpt` or `safetensors` file.
+
+
+
+To learn more about how to load single file weights, see the [Load different Stable Diffusion formats](../../using-diffusers/other-formats) loading guide.
+
+
+
+## FromSingleFileMixin
+
+[[autodoc]] loaders.single_file.FromSingleFileMixin
+
+## FromOriginalVAEMixin
+
+[[autodoc]] loaders.autoencoder.FromOriginalVAEMixin
+
+## FromOriginalControlnetMixin
+
+[[autodoc]] loaders.controlnet.FromOriginalControlNetMixin
\ No newline at end of file
diff --git a/docs/source/en/api/loaders/textual_inversion.md b/docs/source/en/api/loaders/textual_inversion.md
new file mode 100644
index 0000000..c900e22
--- /dev/null
+++ b/docs/source/en/api/loaders/textual_inversion.md
@@ -0,0 +1,27 @@
+
+
+# Textual Inversion
+
+Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images. The file produced from training is extremely small (a few KBs) and the new embeddings can be loaded into the text encoder.
+
+[`TextualInversionLoaderMixin`] provides a function for loading Textual Inversion embeddings from Diffusers and Automatic1111 into the text encoder and loading a special token to activate the embeddings.
+
+
+
+To learn more about how to load Textual Inversion embeddings, see the [Textual Inversion](../../using-diffusers/loading_adapters#textual-inversion) loading guide.
+
+
+
+## TextualInversionLoaderMixin
+
+[[autodoc]] loaders.textual_inversion.TextualInversionLoaderMixin
\ No newline at end of file
diff --git a/docs/source/en/api/loaders/unet.md b/docs/source/en/api/loaders/unet.md
new file mode 100644
index 0000000..d8cfab6
--- /dev/null
+++ b/docs/source/en/api/loaders/unet.md
@@ -0,0 +1,27 @@
+
+
+# UNet
+
+Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.LoraLoaderMixin.load_lora_weights`] function instead.
+
+The [`UNet2DConditionLoadersMixin`] class provides functions for loading and saving weights, fusing and unfusing LoRAs, disabling and enabling LoRAs, and setting and deleting adapters.
+
+
+
+To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
+
+
+
+## UNet2DConditionLoadersMixin
+
+[[autodoc]] loaders.unet.UNet2DConditionLoadersMixin
\ No newline at end of file
diff --git a/docs/source/en/api/logging.md b/docs/source/en/api/logging.md
new file mode 100644
index 0000000..1b21964
--- /dev/null
+++ b/docs/source/en/api/logging.md
@@ -0,0 +1,96 @@
+
+
+# Logging
+
+๐ค Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`.
+
+To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level.
+
+```python
+import diffusers
+
+diffusers.logging.set_verbosity_info()
+```
+
+You can also use the environment variable `DIFFUSERS_VERBOSITY` to override the default verbosity. You can set it
+to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example:
+
+```bash
+DIFFUSERS_VERBOSITY=error ./myprogram.py
+```
+
+Additionally, some `warnings` can be disabled by setting the environment variable
+`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by
+[`logger.warning_advice`]. For example:
+
+```bash
+DIFFUSERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py
+```
+
+Here is an example of how to use the same logger as the library in your own module or script:
+
+```python
+from diffusers.utils import logging
+
+logging.set_verbosity_info()
+logger = logging.get_logger("diffusers")
+logger.info("INFO")
+logger.warning("WARN")
+```
+
+
+All methods of the logging module are documented below. The main methods are
+[`logging.get_verbosity`] to get the current level of verbosity in the logger and
+[`logging.set_verbosity`] to set the verbosity to the level of your choice.
+
+In order from the least verbose to the most verbose:
+
+| Method | Integer value | Description |
+|----------------------------------------------------------:|--------------:|----------------------------------------------------:|
+| `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` | 50 | only report the most critical errors |
+| `diffusers.logging.ERROR` | 40 | only report errors |
+| `diffusers.logging.WARNING` or `diffusers.logging.WARN` | 30 | only report errors and warnings (default) |
+| `diffusers.logging.INFO` | 20 | only report errors, warnings, and basic information |
+| `diffusers.logging.DEBUG` | 10 | report all information |
+
+By default, `tqdm` progress bars are displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] are used to enable or disable this behavior.
+
+## Base setters
+
+[[autodoc]] utils.logging.set_verbosity_error
+
+[[autodoc]] utils.logging.set_verbosity_warning
+
+[[autodoc]] utils.logging.set_verbosity_info
+
+[[autodoc]] utils.logging.set_verbosity_debug
+
+## Other functions
+
+[[autodoc]] utils.logging.get_verbosity
+
+[[autodoc]] utils.logging.set_verbosity
+
+[[autodoc]] utils.logging.get_logger
+
+[[autodoc]] utils.logging.enable_default_handler
+
+[[autodoc]] utils.logging.disable_default_handler
+
+[[autodoc]] utils.logging.enable_explicit_format
+
+[[autodoc]] utils.logging.reset_format
+
+[[autodoc]] utils.logging.enable_progress_bar
+
+[[autodoc]] utils.logging.disable_progress_bar
diff --git a/docs/source/en/api/normalization.md b/docs/source/en/api/normalization.md
new file mode 100644
index 0000000..ef4b694
--- /dev/null
+++ b/docs/source/en/api/normalization.md
@@ -0,0 +1,31 @@
+
+
+# Normalization layers
+
+Customized normalization layers for supporting various models in ๐ค Diffusers.
+
+## AdaLayerNorm
+
+[[autodoc]] models.normalization.AdaLayerNorm
+
+## AdaLayerNormZero
+
+[[autodoc]] models.normalization.AdaLayerNormZero
+
+## AdaLayerNormSingle
+
+[[autodoc]] models.normalization.AdaLayerNormSingle
+
+## AdaGroupNorm
+
+[[autodoc]] models.normalization.AdaGroupNorm
diff --git a/docs/source/en/api/outputs.md b/docs/source/en/api/outputs.md
new file mode 100644
index 0000000..7594448
--- /dev/null
+++ b/docs/source/en/api/outputs.md
@@ -0,0 +1,67 @@
+
+
+# Outputs
+
+All model outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries.
+
+For example:
+
+```python
+from diffusers import DDIMPipeline
+
+pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32")
+outputs = pipeline()
+```
+
+The `outputs` object is a [`~pipelines.ImagePipelineOutput`] which means it has an image attribute.
+
+You can access each attribute as you normally would or with a keyword lookup, and if that attribute is not returned by the model, you will get `None`:
+
+```python
+outputs.images
+outputs["images"]
+```
+
+When considering the `outputs` object as a tuple, it only considers the attributes that don't have `None` values.
+For instance, retrieving an image by indexing into it returns the tuple `(outputs.images)`:
+
+```python
+outputs[:1]
+```
+
+
+
+To check a specific pipeline or model output, refer to its corresponding API documentation.
+
+
+
+## BaseOutput
+
+[[autodoc]] utils.BaseOutput
+ - to_tuple
+
+## ImagePipelineOutput
+
+[[autodoc]] pipelines.ImagePipelineOutput
+
+## FlaxImagePipelineOutput
+
+[[autodoc]] pipelines.pipeline_flax_utils.FlaxImagePipelineOutput
+
+## AudioPipelineOutput
+
+[[autodoc]] pipelines.AudioPipelineOutput
+
+## ImageTextPipelineOutput
+
+[[autodoc]] ImageTextPipelineOutput
diff --git a/docs/source/en/api/pipelines/amused.md b/docs/source/en/api/pipelines/amused.md
new file mode 100644
index 0000000..d25e20d
--- /dev/null
+++ b/docs/source/en/api/pipelines/amused.md
@@ -0,0 +1,48 @@
+
+
+# aMUSEd
+
+aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen.
+
+Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
+
+Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
+
+The abstract from the paper is:
+
+*We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.*
+
+| Model | Params |
+|-------|--------|
+| [amused-256](https://huggingface.co/amused/amused-256) | 603M |
+| [amused-512](https://huggingface.co/amused/amused-512) | 608M |
+
+## AmusedPipeline
+
+[[autodoc]] AmusedPipeline
+ - __call__
+ - all
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+[[autodoc]] AmusedImg2ImgPipeline
+ - __call__
+ - all
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+[[autodoc]] AmusedInpaintPipeline
+ - __call__
+ - all
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
\ No newline at end of file
diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md
new file mode 100644
index 0000000..e3a8237
--- /dev/null
+++ b/docs/source/en/api/pipelines/animatediff.md
@@ -0,0 +1,510 @@
+
+
+# Text-to-Video Generation with AnimateDiff
+
+## Overview
+
+[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai.
+
+The abstract of the paper is the following:
+
+*With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at [this https URL](https://animatediff.github.io/).*
+
+## Available Pipelines
+
+| Pipeline | Tasks | Demo
+|---|---|:---:|
+| [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* |
+| [AnimateDiffVideoToVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py) | *Video-to-Video Generation with AnimateDiff* |
+
+## Available checkpoints
+
+Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5.
+
+## Usage example
+
+### AnimateDiffPipeline
+
+AnimateDiff works with a MotionAdapter checkpoint and a Stable Diffusion model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in Stable Diffusion UNet.
+
+The following example demonstrates how to use a *MotionAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
+
+```python
+import torch
+from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+
+# Load the motion adapter
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+# load SD 1.5 based finetuned model
+model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
+pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
+scheduler = DDIMScheduler.from_pretrained(
+ model_id,
+ subfolder="scheduler",
+ clip_sample=False,
+ timestep_spacing="linspace",
+ beta_schedule="linear",
+ steps_offset=1,
+)
+pipe.scheduler = scheduler
+
+# enable memory savings
+pipe.enable_vae_slicing()
+pipe.enable_model_cpu_offload()
+
+output = pipe(
+ prompt=(
+ "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
+ "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
+ "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
+ "golden hour, coastal landscape, seaside scenery"
+ ),
+ negative_prompt="bad quality, worse quality",
+ num_frames=16,
+ guidance_scale=7.5,
+ num_inference_steps=25,
+ generator=torch.Generator("cpu").manual_seed(42),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+
+```
+
+Here are some sample outputs:
+
+
+
+
+ masterpiece, bestquality, sunset.
+
+
+
+
+
+
+
+
+AnimateDiff tends to work better with finetuned Stable Diffusion models. If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the AnimateDiff checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`.
+
+
+
+### AnimateDiffVideoToVideoPipeline
+
+AnimateDiff can also be used to generate visually similar videos or enable style/character/background or other edits starting from an initial video, allowing you to seamlessly explore creative possibilities.
+
+```python
+import imageio
+import requests
+import torch
+from diffusers import AnimateDiffVideoToVideoPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+from io import BytesIO
+from PIL import Image
+
+# Load the motion adapter
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+# load SD 1.5 based finetuned model
+model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
+pipe = AnimateDiffVideoToVideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16).to("cuda")
+scheduler = DDIMScheduler.from_pretrained(
+ model_id,
+ subfolder="scheduler",
+ clip_sample=False,
+ timestep_spacing="linspace",
+ beta_schedule="linear",
+ steps_offset=1,
+)
+pipe.scheduler = scheduler
+
+# enable memory savings
+pipe.enable_vae_slicing()
+pipe.enable_model_cpu_offload()
+
+# helper function to load videos
+def load_video(file_path: str):
+ images = []
+
+ if file_path.startswith(('http://', 'https://')):
+ # If the file_path is a URL
+ response = requests.get(file_path)
+ response.raise_for_status()
+ content = BytesIO(response.content)
+ vid = imageio.get_reader(content)
+ else:
+ # Assuming it's a local file path
+ vid = imageio.get_reader(file_path)
+
+ for frame in vid:
+ pil_image = Image.fromarray(frame)
+ images.append(pil_image)
+
+ return images
+
+video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif")
+
+output = pipe(
+ video = video,
+ prompt="panda playing a guitar, on a boat, in the ocean, high quality",
+ negative_prompt="bad quality, worse quality",
+ guidance_scale=7.5,
+ num_inference_steps=25,
+ strength=0.5,
+ generator=torch.Generator("cpu").manual_seed(42),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+```
+
+Here are some sample outputs:
+
+
+
+
Source Video
+
Output Video
+
+
+
+ raccoon playing a guitar
+
+
+
+
+ panda playing a guitar
+
+
+
+
+
+
+ closeup of margot robbie, fireworks in the background, high quality
+
+
+
+
+ closeup of tony stark, robert downey jr, fireworks
+
+
+
+
+
+
+## Using Motion LoRAs
+
+Motion LoRAs are a collection of LoRAs that work with the `guoyww/animatediff-motion-adapter-v1-5-2` checkpoint. These LoRAs are responsible for adding specific types of motion to the animations.
+
+```python
+import torch
+from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+
+# Load the motion adapter
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+# load SD 1.5 based finetuned model
+model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
+pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
+pipe.load_lora_weights(
+ "guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out"
+)
+
+scheduler = DDIMScheduler.from_pretrained(
+ model_id,
+ subfolder="scheduler",
+ clip_sample=False,
+ beta_schedule="linear",
+ timestep_spacing="linspace",
+ steps_offset=1,
+)
+pipe.scheduler = scheduler
+
+# enable memory savings
+pipe.enable_vae_slicing()
+pipe.enable_model_cpu_offload()
+
+output = pipe(
+ prompt=(
+ "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
+ "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
+ "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
+ "golden hour, coastal landscape, seaside scenery"
+ ),
+ negative_prompt="bad quality, worse quality",
+ num_frames=16,
+ guidance_scale=7.5,
+ num_inference_steps=25,
+ generator=torch.Generator("cpu").manual_seed(42),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+
+```
+
+
+
+## Using FreeInit
+
+[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu.
+
+FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper.
+
+The following example demonstrates the usage of FreeInit.
+
+```python
+import torch
+from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
+from diffusers.utils import export_to_gif
+
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
+model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
+pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16).to("cuda")
+pipe.scheduler = DDIMScheduler.from_pretrained(
+ model_id,
+ subfolder="scheduler",
+ beta_schedule="linear",
+ clip_sample=False,
+ timestep_spacing="linspace",
+ steps_offset=1
+)
+
+# enable memory savings
+pipe.enable_vae_slicing()
+pipe.enable_vae_tiling()
+
+# enable FreeInit
+# Refer to the enable_free_init documentation for a full list of configurable parameters
+pipe.enable_free_init(method="butterworth", use_fast_sampling=True)
+
+# run inference
+output = pipe(
+ prompt="a panda playing a guitar, on a boat, in the ocean, high quality",
+ negative_prompt="bad quality, worse quality",
+ num_frames=16,
+ guidance_scale=7.5,
+ num_inference_steps=20,
+ generator=torch.Generator("cpu").manual_seed(666),
+)
+
+# disable FreeInit
+pipe.disable_free_init()
+
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+```
+
+
+
+FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models).
+
+
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## Using AnimateLCM
+
+[AnimateLCM](https://animatelcm.github.io/) is a motion module checkpoint and an [LCM LoRA](https://huggingface.co/docs/diffusers/using-diffusers/inference_with_lcm_lora) that have been created using a consistency learning strategy that decouples the distillation of the image generation priors and the motion generation priors.
+
+```python
+import torch
+from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+
+adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")
+pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")
+
+pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora")
+
+pipe.enable_vae_slicing()
+pipe.enable_model_cpu_offload()
+
+output = pipe(
+ prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
+ negative_prompt="bad quality, worse quality, low resolution",
+ num_frames=16,
+ guidance_scale=1.5,
+ num_inference_steps=6,
+ generator=torch.Generator("cpu").manual_seed(0),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animatelcm.gif")
+```
+
+
+
+
+ A space rocket, 4K.
+
+
+
+
+
+
+AnimateLCM is also compatible with existing [Motion LoRAs](https://huggingface.co/collections/dn6/animatediff-motion-loras-654cb8ad732b9e3cf4d3c17e).
+
+```python
+import torch
+from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+
+adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")
+pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")
+
+pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora")
+pipe.load_lora_weights("guoyww/animatediff-motion-lora-tilt-up", adapter_name="tilt-up")
+
+pipe.set_adapters(["lcm-lora", "tilt-up"], [1.0, 0.8])
+pipe.enable_vae_slicing()
+pipe.enable_model_cpu_offload()
+
+output = pipe(
+ prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
+ negative_prompt="bad quality, worse quality, low resolution",
+ num_frames=16,
+ guidance_scale=1.5,
+ num_inference_steps=6,
+ generator=torch.Generator("cpu").manual_seed(0),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animatelcm-motion-lora.gif")
+```
+
+
+
+
+ A space rocket, 4K.
+
+
+
+
+
+
+
+## AnimateDiffPipeline
+
+[[autodoc]] AnimateDiffPipeline
+ - all
+ - __call__
+
+## AnimateDiffVideoToVideoPipeline
+
+[[autodoc]] AnimateDiffVideoToVideoPipeline
+ - all
+ - __call__
+
+## AnimateDiffPipelineOutput
+
+[[autodoc]] pipelines.animatediff.AnimateDiffPipelineOutput
diff --git a/docs/source/en/api/pipelines/attend_and_excite.md b/docs/source/en/api/pipelines/attend_and_excite.md
new file mode 100644
index 0000000..fd8dd95
--- /dev/null
+++ b/docs/source/en/api/pipelines/attend_and_excite.md
@@ -0,0 +1,37 @@
+
+
+# Attend-and-Excite
+
+Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation.
+
+The abstract from the paper is:
+
+*Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.*
+
+You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableDiffusionAttendAndExcitePipeline
+
+[[autodoc]] StableDiffusionAttendAndExcitePipeline
+ - all
+ - __call__
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/audioldm.md b/docs/source/en/api/pipelines/audioldm.md
new file mode 100644
index 0000000..95d41b9
--- /dev/null
+++ b/docs/source/en/api/pipelines/audioldm.md
@@ -0,0 +1,50 @@
+
+
+# AudioLDM
+
+AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
+is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
+latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
+sound effects, human speech and music.
+
+The abstract from the paper is:
+
+*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at [this https URL](https://audioldm.github.io/).*
+
+The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM).
+
+## Tips
+
+When constructing a prompt, keep in mind:
+
+* Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream").
+* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
+
+During inference:
+
+* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
+* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## AudioLDMPipeline
+[[autodoc]] AudioLDMPipeline
+ - all
+ - __call__
+
+## AudioPipelineOutput
+[[autodoc]] pipelines.AudioPipelineOutput
diff --git a/docs/source/en/api/pipelines/audioldm2.md b/docs/source/en/api/pipelines/audioldm2.md
new file mode 100644
index 0000000..b29bea9
--- /dev/null
+++ b/docs/source/en/api/pipelines/audioldm2.md
@@ -0,0 +1,78 @@
+
+
+# AudioLDM 2
+
+AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
+
+Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap) and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs.
+
+The abstract of the paper is the following:
+
+*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at [this https URL](https://audioldm.github.io/audioldm2).*
+
+This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
+
+## Tips
+
+### Choosing a checkpoint
+
+AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation.
+
+All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet.
+See table below for details on the three checkpoints:
+
+| Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
+|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
+| [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k |
+| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k |
+| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k |
+
+### Constructing a prompt
+
+* Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
+* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
+* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality."
+
+### Controlling inference
+
+* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
+* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
+
+### Evaluating generated waveforms:
+
+* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation.
+* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
+
+The following example demonstrates how to construct good music generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## AudioLDM2Pipeline
+[[autodoc]] AudioLDM2Pipeline
+ - all
+ - __call__
+
+## AudioLDM2ProjectionModel
+[[autodoc]] AudioLDM2ProjectionModel
+ - forward
+
+## AudioLDM2UNet2DConditionModel
+[[autodoc]] AudioLDM2UNet2DConditionModel
+ - forward
+
+## AudioPipelineOutput
+[[autodoc]] pipelines.AudioPipelineOutput
diff --git a/docs/source/en/api/pipelines/auto_pipeline.md b/docs/source/en/api/pipelines/auto_pipeline.md
new file mode 100644
index 0000000..ce1e18e
--- /dev/null
+++ b/docs/source/en/api/pipelines/auto_pipeline.md
@@ -0,0 +1,71 @@
+
+
+# AutoPipeline
+
+`AutoPipeline` is designed to:
+
+1. make it easy for you to load a checkpoint for a task without knowing the specific pipeline class to use
+2. use multiple pipelines in your workflow
+
+Based on the task, the `AutoPipeline` class automatically retrieves the relevant pipeline given the name or path to the pretrained weights with the `from_pretrained()` method.
+
+To seamlessly switch between tasks with the same checkpoint without reallocating additional memory, use the `from_pipe()` method to transfer the components from the original pipeline to the new one.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+image = pipeline(prompt, num_inference_steps=25).images[0]
+```
+
+
+
+Check out the [AutoPipeline](../../tutorials/autopipeline) tutorial to learn how to use this API!
+
+
+
+`AutoPipeline` supports text-to-image, image-to-image, and inpainting for the following diffusion models:
+
+- [Stable Diffusion](./stable_diffusion/overview)
+- [ControlNet](./controlnet)
+- [Stable Diffusion XL (SDXL)](./stable_diffusion/stable_diffusion_xl)
+- [DeepFloyd IF](./deepfloyd_if)
+- [Kandinsky 2.1](./kandinsky)
+- [Kandinsky 2.2](./kandinsky_v22)
+
+
+## AutoPipelineForText2Image
+
+[[autodoc]] AutoPipelineForText2Image
+ - all
+ - from_pretrained
+ - from_pipe
+
+## AutoPipelineForImage2Image
+
+[[autodoc]] AutoPipelineForImage2Image
+ - all
+ - from_pretrained
+ - from_pipe
+
+## AutoPipelineForInpainting
+
+[[autodoc]] AutoPipelineForInpainting
+ - all
+ - from_pretrained
+ - from_pipe
diff --git a/docs/source/en/api/pipelines/blip_diffusion.md b/docs/source/en/api/pipelines/blip_diffusion.md
new file mode 100644
index 0000000..ada47ca
--- /dev/null
+++ b/docs/source/en/api/pipelines/blip_diffusion.md
@@ -0,0 +1,41 @@
+
+
+# BLIP-Diffusion
+
+BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
+
+
+The abstract from the paper is:
+
+*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).*
+
+The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization.
+
+`BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+
+## BlipDiffusionPipeline
+[[autodoc]] BlipDiffusionPipeline
+ - all
+ - __call__
+
+## BlipDiffusionControlNetPipeline
+[[autodoc]] BlipDiffusionControlNetPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/consistency_models.md b/docs/source/en/api/pipelines/consistency_models.md
new file mode 100644
index 0000000..680abaa
--- /dev/null
+++ b/docs/source/en/api/pipelines/consistency_models.md
@@ -0,0 +1,56 @@
+
+
+# Consistency Models
+
+Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
+
+The abstract from the paper is:
+
+*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.*
+
+The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai).
+
+The pipeline was contributed by [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues). โค๏ธ
+
+## Tips
+
+For an additional speed-up, use `torch.compile` to generate multiple images in <1 second:
+
+```diff
+ import torch
+ from diffusers import ConsistencyModelPipeline
+
+ device = "cuda"
+ # Load the cd_bedroom256_lpips checkpoint.
+ model_id_or_path = "openai/diffusers-cd_bedroom256_lpips"
+ pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+ pipe.to(device)
+
++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ # Multistep sampling
+ # Timesteps can be explicitly specified; the particular timesteps below are from the original GitHub repo:
+ # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83
+ for _ in range(10):
+ image = pipe(timesteps=[17, 0]).images[0]
+ image.show()
+```
+
+
+## ConsistencyModelPipeline
+[[autodoc]] ConsistencyModelPipeline
+ - all
+ - __call__
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/controlnet.md b/docs/source/en/api/pipelines/controlnet.md
new file mode 100644
index 0000000..6b00902
--- /dev/null
+++ b/docs/source/en/api/pipelines/controlnet.md
@@ -0,0 +1,78 @@
+
+
+# ControlNet
+
+ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
+
+With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
+
+The abstract from the paper is:
+
+*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
+
+This model was contributed by [takuma104](https://huggingface.co/takuma104). โค๏ธ
+
+The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile.
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableDiffusionControlNetPipeline
+[[autodoc]] StableDiffusionControlNetPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - load_textual_inversion
+
+## StableDiffusionControlNetImg2ImgPipeline
+[[autodoc]] StableDiffusionControlNetImg2ImgPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - load_textual_inversion
+
+## StableDiffusionControlNetInpaintPipeline
+[[autodoc]] StableDiffusionControlNetInpaintPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - load_textual_inversion
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+
+## FlaxStableDiffusionControlNetPipeline
+[[autodoc]] FlaxStableDiffusionControlNetPipeline
+ - all
+ - __call__
+
+## FlaxStableDiffusionControlNetPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/controlnet_sdxl.md b/docs/source/en/api/pipelines/controlnet_sdxl.md
new file mode 100644
index 0000000..2de7cbf
--- /dev/null
+++ b/docs/source/en/api/pipelines/controlnet_sdxl.md
@@ -0,0 +1,55 @@
+
+
+# ControlNet with Stable Diffusion XL
+
+ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
+
+With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
+
+The abstract from the paper is:
+
+*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
+
+You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the ๐ค [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub.
+
+
+
+๐งช Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve!
+
+
+
+If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](../../../../../examples/controlnet/README_sdxl).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableDiffusionXLControlNetPipeline
+[[autodoc]] StableDiffusionXLControlNetPipeline
+ - all
+ - __call__
+
+## StableDiffusionXLControlNetImg2ImgPipeline
+[[autodoc]] StableDiffusionXLControlNetImg2ImgPipeline
+ - all
+ - __call__
+
+## StableDiffusionXLControlNetInpaintPipeline
+[[autodoc]] StableDiffusionXLControlNetInpaintPipeline
+ - all
+ - __call__
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/dance_diffusion.md b/docs/source/en/api/pipelines/dance_diffusion.md
new file mode 100644
index 0000000..efba3c3
--- /dev/null
+++ b/docs/source/en/api/pipelines/dance_diffusion.md
@@ -0,0 +1,32 @@
+
+
+# Dance Diffusion
+
+[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans.
+
+Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org).
+
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## DanceDiffusionPipeline
+[[autodoc]] DanceDiffusionPipeline
+ - all
+ - __call__
+
+## AudioPipelineOutput
+[[autodoc]] pipelines.AudioPipelineOutput
diff --git a/docs/source/en/api/pipelines/ddim.md b/docs/source/en/api/pipelines/ddim.md
new file mode 100644
index 0000000..6802da7
--- /dev/null
+++ b/docs/source/en/api/pipelines/ddim.md
@@ -0,0 +1,29 @@
+
+
+# DDIM
+
+[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
+
+The abstract from the paper is:
+
+*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10ร to 50ร faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
+
+The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim).
+
+## DDIMPipeline
+[[autodoc]] DDIMPipeline
+ - all
+ - __call__
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/ddpm.md b/docs/source/en/api/pipelines/ddpm.md
new file mode 100644
index 0000000..81ddb5e
--- /dev/null
+++ b/docs/source/en/api/pipelines/ddpm.md
@@ -0,0 +1,35 @@
+
+
+# DDPM
+
+[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the ๐ค Diffusers library, DDPM refers to the *discrete denoising scheduler* from the paper as well as the pipeline.
+
+The abstract from the paper is:
+
+*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.*
+
+The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+# DDPMPipeline
+[[autodoc]] DDPMPipeline
+ - all
+ - __call__
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/deepfloyd_if.md b/docs/source/en/api/pipelines/deepfloyd_if.md
new file mode 100644
index 0000000..0044198
--- /dev/null
+++ b/docs/source/en/api/pipelines/deepfloyd_if.md
@@ -0,0 +1,506 @@
+
+
+# DeepFloyd IF
+
+## Overview
+
+DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding.
+The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules:
+- Stage 1: a base model that generates 64x64 px image based on text prompt,
+- Stage 2: a 64x64 px => 256x256 px super-resolution model, and
+- Stage 3: a 256x256 px => 1024x1024 px super-resolution model
+Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling.
+Stage 3 is [Stability AI's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler).
+The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset.
+Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.
+
+## Usage
+
+Before you can use IF, you need to accept its usage conditions. To do so:
+1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in.
+2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models.
+3. Make sure to login locally. Install `huggingface_hub`:
+```sh
+pip install huggingface_hub --upgrade
+```
+
+run the login function in a Python shell:
+
+```py
+from huggingface_hub import login
+
+login()
+```
+
+and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens).
+
+Next we install `diffusers` and dependencies:
+
+```sh
+pip install -q diffusers accelerate transformers
+```
+
+The following sections give more in-detail examples of how to use IF. Specifically:
+
+- [Text-to-Image Generation](#text-to-image-generation)
+- [Image-to-Image Generation](#text-guided-image-to-image-generation)
+- [Inpainting](#text-guided-inpainting-generation)
+- [Reusing model weights](#converting-between-different-pipelines)
+- [Speed optimization](#optimizing-for-speed)
+- [Memory optimization](#optimizing-for-memory)
+
+**Available checkpoints**
+- *Stage-1*
+ - [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0)
+ - [DeepFloyd/IF-I-L-v1.0](https://huggingface.co/DeepFloyd/IF-I-L-v1.0)
+ - [DeepFloyd/IF-I-M-v1.0](https://huggingface.co/DeepFloyd/IF-I-M-v1.0)
+
+- *Stage-2*
+ - [DeepFloyd/IF-II-L-v1.0](https://huggingface.co/DeepFloyd/IF-II-L-v1.0)
+ - [DeepFloyd/IF-II-M-v1.0](https://huggingface.co/DeepFloyd/IF-II-M-v1.0)
+
+- *Stage-3*
+ - [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler)
+
+
+**Google Colab**
+[data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
+
+### Text-to-Image Generation
+
+By default diffusers makes use of [model cpu offloading](../../optimization/memory#model-offloading) to run the whole IF pipeline with as little as 14 GB of VRAM.
+
+```python
+from diffusers import DiffusionPipeline
+from diffusers.utils import pt_to_pil, make_image_grid
+import torch
+
+# stage 1
+stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+stage_1.enable_model_cpu_offload()
+
+# stage 2
+stage_2 = DiffusionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
+)
+stage_2.enable_model_cpu_offload()
+
+# stage 3
+safety_modules = {
+ "feature_extractor": stage_1.feature_extractor,
+ "safety_checker": stage_1.safety_checker,
+ "watermarker": stage_1.watermarker,
+}
+stage_3 = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
+)
+stage_3.enable_model_cpu_offload()
+
+prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
+generator = torch.manual_seed(1)
+
+# text embeds
+prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
+
+# stage 1
+stage_1_output = stage_1(
+ prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
+).images
+#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")
+
+# stage 2
+stage_2_output = stage_2(
+ image=stage_1_output,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ generator=generator,
+ output_type="pt",
+).images
+#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png")
+
+# stage 3
+stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images
+#stage_3_output[0].save("./if_stage_III.png")
+make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3)
+```
+
+### Text Guided Image-to-Image Generation
+
+The same IF model weights can be used for text-guided image-to-image translation or image variation.
+In this case just make sure to load the weights using the [`IFImg2ImgPipeline`] and [`IFImg2ImgSuperResolutionPipeline`] pipelines.
+
+**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
+without loading them twice by making use of the [`~DiffusionPipeline.components`] argument as explained [here](#converting-between-different-pipelines).
+
+```python
+from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline
+from diffusers.utils import pt_to_pil, load_image, make_image_grid
+import torch
+
+# download image
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+original_image = load_image(url)
+original_image = original_image.resize((768, 512))
+
+# stage 1
+stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+stage_1.enable_model_cpu_offload()
+
+# stage 2
+stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
+)
+stage_2.enable_model_cpu_offload()
+
+# stage 3
+safety_modules = {
+ "feature_extractor": stage_1.feature_extractor,
+ "safety_checker": stage_1.safety_checker,
+ "watermarker": stage_1.watermarker,
+}
+stage_3 = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
+)
+stage_3.enable_model_cpu_offload()
+
+prompt = "A fantasy landscape in style minecraft"
+generator = torch.manual_seed(1)
+
+# text embeds
+prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
+
+# stage 1
+stage_1_output = stage_1(
+ image=original_image,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ generator=generator,
+ output_type="pt",
+).images
+#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")
+
+# stage 2
+stage_2_output = stage_2(
+ image=stage_1_output,
+ original_image=original_image,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ generator=generator,
+ output_type="pt",
+).images
+#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png")
+
+# stage 3
+stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images
+#stage_3_output[0].save("./if_stage_III.png")
+make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4)
+```
+
+### Text Guided Inpainting Generation
+
+The same IF model weights can be used for text-guided image-to-image translation or image variation.
+In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines.
+
+**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
+without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines).
+
+```python
+from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline
+from diffusers.utils import pt_to_pil, load_image, make_image_grid
+import torch
+
+# download image
+url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png"
+original_image = load_image(url)
+
+# download mask
+url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png"
+mask_image = load_image(url)
+
+# stage 1
+stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+stage_1.enable_model_cpu_offload()
+
+# stage 2
+stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
+)
+stage_2.enable_model_cpu_offload()
+
+# stage 3
+safety_modules = {
+ "feature_extractor": stage_1.feature_extractor,
+ "safety_checker": stage_1.safety_checker,
+ "watermarker": stage_1.watermarker,
+}
+stage_3 = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
+)
+stage_3.enable_model_cpu_offload()
+
+prompt = "blue sunglasses"
+generator = torch.manual_seed(1)
+
+# text embeds
+prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
+
+# stage 1
+stage_1_output = stage_1(
+ image=original_image,
+ mask_image=mask_image,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ generator=generator,
+ output_type="pt",
+).images
+#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")
+
+# stage 2
+stage_2_output = stage_2(
+ image=stage_1_output,
+ original_image=original_image,
+ mask_image=mask_image,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ generator=generator,
+ output_type="pt",
+).images
+#pt_to_pil(stage_1_output)[0].save("./if_stage_II.png")
+
+# stage 3
+stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images
+#stage_3_output[0].save("./if_stage_III.png")
+make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5)
+```
+
+### Converting between different pipelines
+
+In addition to being loaded with `from_pretrained`, Pipelines can also be loaded directly from each other.
+
+```python
+from diffusers import IFPipeline, IFSuperResolutionPipeline
+
+pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
+pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")
+
+
+from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline
+
+pipe_1 = IFImg2ImgPipeline(**pipe_1.components)
+pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)
+
+
+from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline
+
+pipe_1 = IFInpaintingPipeline(**pipe_1.components)
+pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)
+```
+
+### Optimizing for speed
+
+The simplest optimization to run IF faster is to move all model components to the GPU.
+
+```py
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.to("cuda")
+```
+
+You can also run the diffusion process for a shorter number of timesteps.
+
+This can either be done with the `num_inference_steps` argument:
+
+```py
+pipe("", num_inference_steps=30)
+```
+
+Or with the `timesteps` argument:
+
+```py
+from diffusers.pipelines.deepfloyd_if import fast27_timesteps
+
+pipe("", timesteps=fast27_timesteps)
+```
+
+When doing image variation or inpainting, you can also decrease the number of timesteps
+with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process.
+A smaller number will vary the image less but run faster.
+
+```py
+pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.to("cuda")
+
+image = pipe(image=image, prompt="", strength=0.3).images
+```
+
+You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile`
+with IF and it might not give expected results.
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.to("cuda")
+
+pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True)
+pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+### Optimizing for memory
+
+When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs.
+
+Either the model based CPU offloading,
+
+```py
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+```
+
+or the more aggressive layer based CPU offloading.
+
+```py
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.enable_sequential_cpu_offload()
+```
+
+Additionally, T5 can be loaded in 8bit precision
+
+```py
+from transformers import T5EncoderModel
+
+text_encoder = T5EncoderModel.from_pretrained(
+ "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
+)
+
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained(
+ "DeepFloyd/IF-I-XL-v1.0",
+ text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder
+ unet=None,
+ device_map="auto",
+)
+
+prompt_embeds, negative_embeds = pipe.encode_prompt("")
+```
+
+For CPU RAM constrained machines like Google Colab free tier where we can't load all model components to the CPU at once, we can manually only load the pipeline with
+the text encoder or UNet when the respective model components are needed.
+
+```py
+from diffusers import IFPipeline, IFSuperResolutionPipeline
+import torch
+import gc
+from transformers import T5EncoderModel
+from diffusers.utils import pt_to_pil, make_image_grid
+
+text_encoder = T5EncoderModel.from_pretrained(
+ "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
+)
+
+# text to image
+pipe = DiffusionPipeline.from_pretrained(
+ "DeepFloyd/IF-I-XL-v1.0",
+ text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder
+ unet=None,
+ device_map="auto",
+)
+
+prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
+prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
+
+# Remove the pipeline so we can re-load the pipeline with the unet
+del text_encoder
+del pipe
+gc.collect()
+torch.cuda.empty_cache()
+
+pipe = IFPipeline.from_pretrained(
+ "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
+)
+
+generator = torch.Generator().manual_seed(0)
+stage_1_output = pipe(
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ output_type="pt",
+ generator=generator,
+).images
+
+#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")
+
+# Remove the pipeline so we can load the super-resolution pipeline
+del pipe
+gc.collect()
+torch.cuda.empty_cache()
+
+# First super resolution
+
+pipe = IFSuperResolutionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
+)
+
+generator = torch.Generator().manual_seed(0)
+stage_2_output = pipe(
+ image=stage_1_output,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ output_type="pt",
+ generator=generator,
+).images
+
+#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png")
+make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2)
+```
+
+## Available Pipelines:
+
+| Pipeline | Tasks | Colab
+|---|---|:---:|
+| [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - |
+| [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_superresolution.py) | *Text-to-Image Generation* | - |
+| [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - |
+| [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - |
+| [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - |
+| [pipeline_if_inpainting_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py) | *Image-to-Image Generation* | - |
+
+## IFPipeline
+[[autodoc]] IFPipeline
+ - all
+ - __call__
+
+## IFSuperResolutionPipeline
+[[autodoc]] IFSuperResolutionPipeline
+ - all
+ - __call__
+
+## IFImg2ImgPipeline
+[[autodoc]] IFImg2ImgPipeline
+ - all
+ - __call__
+
+## IFImg2ImgSuperResolutionPipeline
+[[autodoc]] IFImg2ImgSuperResolutionPipeline
+ - all
+ - __call__
+
+## IFInpaintingPipeline
+[[autodoc]] IFInpaintingPipeline
+ - all
+ - __call__
+
+## IFInpaintingSuperResolutionPipeline
+[[autodoc]] IFInpaintingSuperResolutionPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/diffedit.md b/docs/source/en/api/pipelines/diffedit.md
new file mode 100644
index 0000000..97cbdcb
--- /dev/null
+++ b/docs/source/en/api/pipelines/diffedit.md
@@ -0,0 +1,55 @@
+
+
+# DiffEdit
+
+[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
+
+The abstract from the paper is:
+
+*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
+
+The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
+
+This pipeline was contributed by [clarencechen](https://github.com/clarencechen). โค๏ธ
+
+## Tips
+
+* The pipeline can generate masks that can be fed into other inpainting pipelines.
+* In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`])
+and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
+* The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt`
+that let you control the locations of the semantic edits in the final image to be generated. Let's say,
+you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
+this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
+`source_prompt` and "dog" to `target_prompt`.
+* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
+overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
+source concept is sufficiently descriptive to yield good results, but feel free to explore alternatives.
+* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
+and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
+the phrases including "cat" to `negative_prompt` and "dog" to `prompt`.
+* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
+ * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
+ * Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog".
+ * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
+* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](../../using-diffusers/diffedit) guide for more details.
+
+## StableDiffusionDiffEditPipeline
+[[autodoc]] StableDiffusionDiffEditPipeline
+ - all
+ - generate_mask
+ - invert
+ - __call__
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/dit.md b/docs/source/en/api/pipelines/dit.md
new file mode 100644
index 0000000..1d04458
--- /dev/null
+++ b/docs/source/en/api/pipelines/dit.md
@@ -0,0 +1,35 @@
+
+
+# DiT
+
+[Scalable Diffusion Models with Transformers](https://huggingface.co/papers/2212.09748) (DiT) is by William Peebles and Saining Xie.
+
+The abstract from the paper is:
+
+*We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.*
+
+The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## DiTPipeline
+[[autodoc]] DiTPipeline
+ - all
+ - __call__
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/i2vgenxl.md b/docs/source/en/api/pipelines/i2vgenxl.md
new file mode 100644
index 0000000..cafffaa
--- /dev/null
+++ b/docs/source/en/api/pipelines/i2vgenxl.md
@@ -0,0 +1,57 @@
+
+
+# I2VGen-XL
+
+[I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.
+
+The abstract from the paper is:
+
+*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280ร720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).*
+
+The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
+
+
+
+Sample output with I2VGenXL:
+
+
+
+
+ library.
+
+
+
+
+
+
+## Notes
+
+* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
+* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
+* Unlike SVD, it additionally accepts text prompts as inputs.
+* It can generate higher resolution videos.
+* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.
+
+## I2VGenXLPipeline
+[[autodoc]] I2VGenXLPipeline
+ - all
+ - __call__
+
+## I2VGenXLPipelineOutput
+[[autodoc]] pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput
\ No newline at end of file
diff --git a/docs/source/en/api/pipelines/kandinsky.md b/docs/source/en/api/pipelines/kandinsky.md
new file mode 100644
index 0000000..9ea3cd4
--- /dev/null
+++ b/docs/source/en/api/pipelines/kandinsky.md
@@ -0,0 +1,73 @@
+
+
+# Kandinsky 2.1
+
+Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov).
+
+The description from it's GitHub page is:
+
+*Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas. As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.*
+
+The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2).
+
+
+
+Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
+
+
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## KandinskyPriorPipeline
+
+[[autodoc]] KandinskyPriorPipeline
+ - all
+ - __call__
+ - interpolate
+
+## KandinskyPipeline
+
+[[autodoc]] KandinskyPipeline
+ - all
+ - __call__
+
+## KandinskyCombinedPipeline
+
+[[autodoc]] KandinskyCombinedPipeline
+ - all
+ - __call__
+
+## KandinskyImg2ImgPipeline
+
+[[autodoc]] KandinskyImg2ImgPipeline
+ - all
+ - __call__
+
+## KandinskyImg2ImgCombinedPipeline
+
+[[autodoc]] KandinskyImg2ImgCombinedPipeline
+ - all
+ - __call__
+
+## KandinskyInpaintPipeline
+
+[[autodoc]] KandinskyInpaintPipeline
+ - all
+ - __call__
+
+## KandinskyInpaintCombinedPipeline
+
+[[autodoc]] KandinskyInpaintCombinedPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/kandinsky3.md b/docs/source/en/api/pipelines/kandinsky3.md
new file mode 100644
index 0000000..07c533e
--- /dev/null
+++ b/docs/source/en/api/pipelines/kandinsky3.md
@@ -0,0 +1,49 @@
+
+
+# Kandinsky 3
+
+Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh)
+
+The description from it's Github page:
+
+*Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.*
+
+Its architecture includes 3 main components:
+1. [FLAN-UL2](https://huggingface.co/google/flan-ul2), which is an encoder decoder model based on the T5 architecture.
+2. New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters.
+3. Sber-MoVQGAN is a decoder proven to have superior results in image restoration.
+
+
+
+The original codebase can be found at [ai-forever/Kandinsky-3](https://github.com/ai-forever/Kandinsky-3).
+
+
+
+Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
+
+
+
+
+
+Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## Kandinsky3Pipeline
+
+[[autodoc]] Kandinsky3Pipeline
+ - all
+ - __call__
+
+## Kandinsky3Img2ImgPipeline
+
+[[autodoc]] Kandinsky3Img2ImgPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/kandinsky_v22.md b/docs/source/en/api/pipelines/kandinsky_v22.md
new file mode 100644
index 0000000..13a6ca8
--- /dev/null
+++ b/docs/source/en/api/pipelines/kandinsky_v22.md
@@ -0,0 +1,92 @@
+
+
+# Kandinsky 2.2
+
+Kandinsky 2.2 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov).
+
+The description from it's GitHub page is:
+
+*Kandinsky 2.2 brings substantial improvements upon its predecessor, Kandinsky 2.1, by introducing a new, more powerful image encoder - CLIP-ViT-G and the ControlNet support. The switch to CLIP-ViT-G as the image encoder significantly increases the model's capability to generate more aesthetic pictures and better understand text, thus enhancing the model's overall performance. The addition of the ControlNet mechanism allows the model to effectively control the process of generating images. This leads to more accurate and visually appealing outputs and opens new possibilities for text-guided image manipulation.*
+
+The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2).
+
+
+
+Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
+
+
+
+
+
+Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## KandinskyV22PriorPipeline
+
+[[autodoc]] KandinskyV22PriorPipeline
+ - all
+ - __call__
+ - interpolate
+
+## KandinskyV22Pipeline
+
+[[autodoc]] KandinskyV22Pipeline
+ - all
+ - __call__
+
+## KandinskyV22CombinedPipeline
+
+[[autodoc]] KandinskyV22CombinedPipeline
+ - all
+ - __call__
+
+## KandinskyV22ControlnetPipeline
+
+[[autodoc]] KandinskyV22ControlnetPipeline
+ - all
+ - __call__
+
+## KandinskyV22PriorEmb2EmbPipeline
+
+[[autodoc]] KandinskyV22PriorEmb2EmbPipeline
+ - all
+ - __call__
+ - interpolate
+
+## KandinskyV22Img2ImgPipeline
+
+[[autodoc]] KandinskyV22Img2ImgPipeline
+ - all
+ - __call__
+
+## KandinskyV22Img2ImgCombinedPipeline
+
+[[autodoc]] KandinskyV22Img2ImgCombinedPipeline
+ - all
+ - __call__
+
+## KandinskyV22ControlnetImg2ImgPipeline
+
+[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
+ - all
+ - __call__
+
+## KandinskyV22InpaintPipeline
+
+[[autodoc]] KandinskyV22InpaintPipeline
+ - all
+ - __call__
+
+## KandinskyV22InpaintCombinedPipeline
+
+[[autodoc]] KandinskyV22InpaintCombinedPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/latent_consistency_models.md b/docs/source/en/api/pipelines/latent_consistency_models.md
new file mode 100644
index 0000000..4d94451
--- /dev/null
+++ b/docs/source/en/api/pipelines/latent_consistency_models.md
@@ -0,0 +1,52 @@
+
+
+# Latent Consistency Models
+
+Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao.
+
+The abstract of the paper is as follows:
+
+*Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: [this https URL](https://latent-consistency-models.github.io/).*
+
+A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) checkpoint can be found [here](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model).
+
+The pipelines were contributed by [luosiallen](https://luosiallen.github.io/), [nagolinc](https://github.com/nagolinc), and [dg845](https://github.com/dg845).
+
+
+## LatentConsistencyModelPipeline
+
+[[autodoc]] LatentConsistencyModelPipeline
+ - all
+ - __call__
+ - enable_freeu
+ - disable_freeu
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_vae_tiling
+ - disable_vae_tiling
+
+## LatentConsistencyModelImg2ImgPipeline
+
+[[autodoc]] LatentConsistencyModelImg2ImgPipeline
+ - all
+ - __call__
+ - enable_freeu
+ - disable_freeu
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_vae_tiling
+ - disable_vae_tiling
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/latent_diffusion.md b/docs/source/en/api/pipelines/latent_diffusion.md
new file mode 100644
index 0000000..ab50fae
--- /dev/null
+++ b/docs/source/en/api/pipelines/latent_diffusion.md
@@ -0,0 +1,40 @@
+
+
+# Latent Diffusion
+
+Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjรถrn Ommer.
+
+The abstract from the paper is:
+
+*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.*
+
+The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## LDMTextToImagePipeline
+[[autodoc]] LDMTextToImagePipeline
+ - all
+ - __call__
+
+## LDMSuperResolutionPipeline
+[[autodoc]] LDMSuperResolutionPipeline
+ - all
+ - __call__
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/musicldm.md b/docs/source/en/api/pipelines/musicldm.md
new file mode 100644
index 0000000..3ffb654
--- /dev/null
+++ b/docs/source/en/api/pipelines/musicldm.md
@@ -0,0 +1,52 @@
+
+
+# MusicLDM
+
+MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
+MusicLDM takes a text prompt as input and predicts the corresponding music sample.
+
+Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm),
+MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
+latents.
+
+MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style.
+
+The abstract of the paper is the following:
+
+*Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.*
+
+This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi).
+
+## Tips
+
+When constructing a prompt, keep in mind:
+
+* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
+* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
+
+During inference:
+
+* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
+* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
+* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## MusicLDMPipeline
+[[autodoc]] MusicLDMPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/overview.md b/docs/source/en/api/pipelines/overview.md
new file mode 100644
index 0000000..5849152
--- /dev/null
+++ b/docs/source/en/api/pipelines/overview.md
@@ -0,0 +1,105 @@
+
+
+# Pipelines
+
+Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different schedulers or even model components.
+
+All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. Specific pipeline types (for example [`StableDiffusionPipeline`]) loaded with [`~DiffusionPipeline.from_pretrained`] are automatically detected and the pipeline components are loaded and passed to the `__init__` function of the pipeline.
+
+
+
+You shouldn't use the [`DiffusionPipeline`] class for training. Individual components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead.
+
+
+
+Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../../training/overview) guides instead!
+
+
+
+The table below lists all the pipelines currently available in ๐ค Diffusers and the tasks they support. Click on a pipeline to view its abstract and published paper.
+
+| Pipeline | Tasks |
+|---|---|
+| [AltDiffusion](alt_diffusion) | image2image |
+| [AnimateDiff](animatediff) | text2video |
+| [Attend-and-Excite](attend_and_excite) | text2image |
+| [Audio Diffusion](audio_diffusion) | image2audio |
+| [AudioLDM](audioldm) | text2audio |
+| [AudioLDM2](audioldm2) | text2audio |
+| [BLIP Diffusion](blip_diffusion) | text2image |
+| [Consistency Models](consistency_models) | unconditional image generation |
+| [ControlNet](controlnet) | text2image, image2image, inpainting |
+| [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image |
+| [ControlNet-XS](controlnetxs) | text2image |
+| [ControlNet-XS with Stable Diffusion XL](controlnetxs_sdxl) | text2image |
+| [Cycle Diffusion](cycle_diffusion) | image2image |
+| [Dance Diffusion](dance_diffusion) | unconditional audio generation |
+| [DDIM](ddim) | unconditional image generation |
+| [DDPM](ddpm) | unconditional image generation |
+| [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution |
+| [DiffEdit](diffedit) | inpainting |
+| [DiT](dit) | text2image |
+| [GLIGEN](stable_diffusion/gligen) | text2image |
+| [InstructPix2Pix](pix2pix) | image editing |
+| [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation |
+| [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting |
+| [Kandinsky 3](kandinsky3) | text2image, image2image |
+| [Latent Consistency Models](latent_consistency_models) | text2image |
+| [Latent Diffusion](latent_diffusion) | text2image, super-resolution |
+| [LDM3D](stable_diffusion/ldm3d_diffusion) | text2image, text-to-3D, text-to-pano, upscaling |
+| [MultiDiffusion](panorama) | text2image |
+| [MusicLDM](musicldm) | text2audio |
+| [Paint by Example](paint_by_example) | inpainting |
+| [ParaDiGMS](paradigms) | text2image |
+| [Pix2Pix Zero](pix2pix_zero) | image editing |
+| [PixArt-ฮฑ](pixart) | text2image |
+| [PNDM](pndm) | unconditional image generation |
+| [RePaint](repaint) | inpainting |
+| [Score SDE VE](score_sde_ve) | unconditional image generation |
+| [Self-Attention Guidance](self_attention_guidance) | text2image |
+| [Semantic Guidance](semantic_stable_diffusion) | text2image |
+| [Shap-E](shap_e) | text-to-3D, image-to-3D |
+| [Spectrogram Diffusion](spectrogram_diffusion) | |
+| [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
+| [Stable Diffusion Model Editing](model_editing) | model editing |
+| [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting |
+| [Stable Diffusion XL Turbo](stable_diffusion/sdxl_turbo) | text2image, image2image, inpainting |
+| [Stable unCLIP](stable_unclip) | text2image, image variation |
+| [Stochastic Karras VE](stochastic_karras_ve) | unconditional image generation |
+| [T2I-Adapter](stable_diffusion/adapter) | text2image |
+| [Text2Video](text_to_video) | text2video, video2video |
+| [Text2Video-Zero](text_to_video_zero) | text2video |
+| [unCLIP](unclip) | text2image, image variation |
+| [Unconditional Latent Diffusion](latent_diffusion_uncond) | unconditional image generation |
+| [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation |
+| [Value-guided planning](value_guided_sampling) | value guided sampling |
+| [Versatile Diffusion](versatile_diffusion) | text2image, image variation |
+| [VQ Diffusion](vq_diffusion) | text2image |
+| [Wuerstchen](wuerstchen) | text2image |
+
+## DiffusionPipeline
+
+[[autodoc]] DiffusionPipeline
+ - all
+ - __call__
+ - device
+ - to
+ - components
+
+## FlaxDiffusionPipeline
+
+[[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline
+
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
diff --git a/docs/source/en/api/pipelines/paint_by_example.md b/docs/source/en/api/pipelines/paint_by_example.md
new file mode 100644
index 0000000..effd608
--- /dev/null
+++ b/docs/source/en/api/pipelines/paint_by_example.md
@@ -0,0 +1,39 @@
+
+
+# Paint by Example
+
+[Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen.
+
+The abstract from the paper is:
+
+*Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.*
+
+The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example).
+
+## Tips
+
+Paint by Example is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images.
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## PaintByExamplePipeline
+[[autodoc]] PaintByExamplePipeline
+ - all
+ - __call__
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/panorama.md b/docs/source/en/api/pipelines/panorama.md
new file mode 100644
index 0000000..b34008a
--- /dev/null
+++ b/docs/source/en/api/pipelines/panorama.md
@@ -0,0 +1,50 @@
+
+
+# MultiDiffusion
+
+[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.
+
+The abstract from the paper is:
+
+*Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.*
+
+You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion).
+
+## Tips
+
+While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1.
+For some GPUs with high performance, this can speedup the generation process and increase VRAM usage.
+
+To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default.
+
+Circular padding is applied to ensure there are no stitching artifacts when working with panoramas to ensure a seamless transition from the rightmost part to the leftmost part. By enabling circular padding (set `circular_padding=True`), the operation applies additional crops after the rightmost point of the image, allowing the model to "seeโ the transition from the rightmost part to the leftmost part. This helps maintain visual consistency in a 360-degree sense and creates a proper โpanoramaโ that can be viewed using 360-degree panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied to ensure that the decoded latents match in the RGB space.
+
+For example, without circular padding, there is a stitching artifact (default):
+data:image/s3,"s3://crabby-images/90e07/90e0701a0324e0847ffaba94320da6772b7c888f" alt="img"
+
+But with circular padding, the right and the left parts are matching (`circular_padding=True`):
+data:image/s3,"s3://crabby-images/56fe8/56fe83ef953443cf9993416042d7b649d54b24de" alt="img"
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableDiffusionPanoramaPipeline
+[[autodoc]] StableDiffusionPanoramaPipeline
+ - __call__
+ - all
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/pia.md b/docs/source/en/api/pipelines/pia.md
new file mode 100644
index 0000000..8ba7825
--- /dev/null
+++ b/docs/source/en/api/pipelines/pia.md
@@ -0,0 +1,167 @@
+
+
+# Image-to-Video Generation with PIA (Personalized Image Animator)
+
+## Overview
+
+[PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://arxiv.org/abs/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen
+
+Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance.
+
+[Project page](https://pi-animator.github.io/)
+
+## Available Pipelines
+
+| Pipeline | Tasks | Demo
+|---|---|:---:|
+| [PIAPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pia/pipeline_pia.py) | *Image-to-Video Generation with PIA* |
+
+## Available checkpoints
+
+Motion Adapter checkpoints for PIA can be found under the [OpenMMLab org](https://huggingface.co/openmmlab/PIA-condition-adapter). These checkpoints are meant to work with any model based on Stable Diffusion 1.5
+
+## Usage example
+
+PIA works with a MotionAdapter checkpoint and a Stable Diffusion 1.5 model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in the Stable Diffusion UNet. In addition to the motion modules, PIA also replaces the input convolution layer of the SD 1.5 UNet model with a 9 channel input convolution layer.
+
+The following example demonstrates how to use PIA to generate a video from a single image.
+
+```python
+import torch
+from diffusers import (
+ EulerDiscreteScheduler,
+ MotionAdapter,
+ PIAPipeline,
+)
+from diffusers.utils import export_to_gif, load_image
+
+adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter")
+pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
+
+pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+pipe.enable_vae_slicing()
+
+image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
+)
+image = image.resize((512, 512))
+prompt = "cat in a field"
+negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"
+
+generator = torch.Generator("cpu").manual_seed(0)
+output = pipe(image=image, prompt=prompt, generator=generator)
+frames = output.frames[0]
+export_to_gif(frames, "pia-animation.gif")
+```
+
+Here are some sample outputs:
+
+
+
+
+ cat in a field.
+
+
+
+
+
+
+
+
+
+If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the PIA checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`.
+
+
+
+## Using FreeInit
+
+[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu.
+
+FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to PIA, AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper.
+
+The following example demonstrates the usage of FreeInit.
+
+```python
+import torch
+from diffusers import (
+ DDIMScheduler,
+ MotionAdapter,
+ PIAPipeline,
+)
+from diffusers.utils import export_to_gif, load_image
+
+adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter")
+pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter)
+
+# enable FreeInit
+# Refer to the enable_free_init documentation for a full list of configurable parameters
+pipe.enable_free_init(method="butterworth", use_fast_sampling=True)
+
+# Memory saving options
+pipe.enable_model_cpu_offload()
+pipe.enable_vae_slicing()
+
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
+)
+image = image.resize((512, 512))
+prompt = "cat in a field"
+negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"
+
+generator = torch.Generator("cpu").manual_seed(0)
+
+output = pipe(image=image, prompt=prompt, generator=generator)
+frames = output.frames[0]
+export_to_gif(frames, "pia-freeinit-animation.gif")
+```
+
+
+
+
+ cat in a field.
+
+
+
+
+
+
+
+
+
+FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models).
+
+
+
+## PIAPipeline
+
+[[autodoc]] PIAPipeline
+ - all
+ - __call__
+ - enable_freeu
+ - disable_freeu
+ - enable_free_init
+ - disable_free_init
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_vae_tiling
+ - disable_vae_tiling
+
+## PIAPipelineOutput
+
+[[autodoc]] pipelines.pia.PIAPipelineOutput
\ No newline at end of file
diff --git a/docs/source/en/api/pipelines/pix2pix.md b/docs/source/en/api/pipelines/pix2pix.md
new file mode 100644
index 0000000..52767a9
--- /dev/null
+++ b/docs/source/en/api/pipelines/pix2pix.md
@@ -0,0 +1,40 @@
+
+
+# InstructPix2Pix
+
+[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros.
+
+The abstract from the paper is:
+
+*We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.*
+
+You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableDiffusionInstructPix2PixPipeline
+[[autodoc]] StableDiffusionInstructPix2PixPipeline
+ - __call__
+ - all
+ - load_textual_inversion
+ - load_lora_weights
+ - save_lora_weights
+
+## StableDiffusionXLInstructPix2PixPipeline
+[[autodoc]] StableDiffusionXLInstructPix2PixPipeline
+ - __call__
+ - all
diff --git a/docs/source/en/api/pipelines/pixart.md b/docs/source/en/api/pipelines/pixart.md
new file mode 100644
index 0000000..ef50b17
--- /dev/null
+++ b/docs/source/en/api/pipelines/pixart.md
@@ -0,0 +1,149 @@
+
+
+# PixArt-ฮฑ
+
+data:image/s3,"s3://crabby-images/62325/62325f044c8f81d68e037094f0e47890ee9cce7c" alt=""
+
+[PixArt-ฮฑ: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis](https://huggingface.co/papers/2310.00426) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.
+
+The abstract from the paper is:
+
+*The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-ฮฑ, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-ฮฑ's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-ฮฑ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-ฮฑ excels in image quality, artistry, and semantic control. We hope PIXART-ฮฑ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.*
+
+You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha).
+
+Some notes about this pipeline:
+
+* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit).
+* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details.
+* It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py).
+* It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## Inference with under 8GB GPU VRAM
+
+Run the [`PixArtAlphaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example.
+
+First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library:
+
+```bash
+pip install -U bitsandbytes
+```
+
+Then load the text encoder in 8-bit:
+
+```python
+from transformers import T5EncoderModel
+from diffusers import PixArtAlphaPipeline
+import torch
+
+text_encoder = T5EncoderModel.from_pretrained(
+ "PixArt-alpha/PixArt-XL-2-1024-MS",
+ subfolder="text_encoder",
+ load_in_8bit=True,
+ device_map="auto",
+
+)
+pipe = PixArtAlphaPipeline.from_pretrained(
+ "PixArt-alpha/PixArt-XL-2-1024-MS",
+ text_encoder=text_encoder,
+ transformer=None,
+ device_map="auto"
+)
+```
+
+Now, use the `pipe` to encode a prompt:
+
+```python
+with torch.no_grad():
+ prompt = "cute cat"
+ prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt)
+```
+
+Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up som GPU VRAM:
+
+```python
+import gc
+
+def flush():
+ gc.collect()
+ torch.cuda.empty_cache()
+
+del text_encoder
+del pipe
+flush()
+```
+
+Then compute the latents with the prompt embeddings as inputs:
+
+```python
+pipe = PixArtAlphaPipeline.from_pretrained(
+ "PixArt-alpha/PixArt-XL-2-1024-MS",
+ text_encoder=None,
+ torch_dtype=torch.float16,
+).to("cuda")
+
+latents = pipe(
+ negative_prompt=None,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ prompt_attention_mask=prompt_attention_mask,
+ negative_prompt_attention_mask=negative_prompt_attention_mask,
+ num_images_per_prompt=1,
+ output_type="latent",
+).images
+
+del pipe.transformer
+flush()
+```
+
+
+
+Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded.
+
+
+
+Once the latents are computed, pass it off to the VAE to decode into a real image:
+
+```python
+with torch.no_grad():
+ image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
+image = pipe.image_processor.postprocess(image, output_type="pil")[0]
+image.save("cat.png")
+```
+
+By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtAlphaPipeline`] with under 8GB GPU VRAM.
+
+data:image/s3,"s3://crabby-images/ab97e/ab97e1579e9491f69558798b958b1eee47b99cb2" alt=""
+
+If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e).
+
+
+
+Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit.
+
+
+
+While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB.
+
+## PixArtAlphaPipeline
+
+[[autodoc]] PixArtAlphaPipeline
+ - all
+ - __call__
+
\ No newline at end of file
diff --git a/docs/source/en/api/pipelines/self_attention_guidance.md b/docs/source/en/api/pipelines/self_attention_guidance.md
new file mode 100644
index 0000000..e56aae2
--- /dev/null
+++ b/docs/source/en/api/pipelines/self_attention_guidance.md
@@ -0,0 +1,35 @@
+
+
+# Self-Attention Guidance
+
+[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://huggingface.co/papers/2210.00939) is by Susung Hong et al.
+
+The abstract from the paper is:
+
+*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.*
+
+You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableDiffusionSAGPipeline
+[[autodoc]] StableDiffusionSAGPipeline
+ - __call__
+ - all
+
+## StableDiffusionOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/semantic_stable_diffusion.md b/docs/source/en/api/pipelines/semantic_stable_diffusion.md
new file mode 100644
index 0000000..7634363
--- /dev/null
+++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.md
@@ -0,0 +1,35 @@
+
+
+# Semantic Guidance
+
+Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
+Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition.
+
+The abstract from the paper is:
+
+*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.*
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## SemanticStableDiffusionPipeline
+[[autodoc]] SemanticStableDiffusionPipeline
+ - all
+ - __call__
+
+## StableDiffusionSafePipelineOutput
+[[autodoc]] pipelines.semantic_stable_diffusion.pipeline_output.SemanticStableDiffusionPipelineOutput
+ - all
diff --git a/docs/source/en/api/pipelines/shap_e.md b/docs/source/en/api/pipelines/shap_e.md
new file mode 100644
index 0000000..9f9155c
--- /dev/null
+++ b/docs/source/en/api/pipelines/shap_e.md
@@ -0,0 +1,37 @@
+
+
+# Shap-E
+
+The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai).
+
+The abstract from the paper is:
+
+*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
+
+The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e).
+
+
+
+See the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## ShapEPipeline
+[[autodoc]] ShapEPipeline
+ - all
+ - __call__
+
+## ShapEImg2ImgPipeline
+[[autodoc]] ShapEImg2ImgPipeline
+ - all
+ - __call__
+
+## ShapEPipelineOutput
+[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_cascade.md b/docs/source/en/api/pipelines/stable_cascade.md
new file mode 100644
index 0000000..37df68b
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_cascade.md
@@ -0,0 +1,88 @@
+
+
+# Stable Cascade
+
+This model is built upon the [Wรผrstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
+difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
+important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
+How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
+encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
+1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
+highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
+Diffusion 1.5.
+
+Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
+like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
+
+The original codebase can be found at [Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade).
+
+## Model Overview
+Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
+hence the name "Stable Cascade".
+
+Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
+However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
+spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
+a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
+image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
+for generating the small 24 x 24 latents given a text prompt.
+
+## Uses
+
+### Direct Use
+
+The model is intended for research purposes for now. Possible research areas and tasks include
+
+- Research on generative models.
+- Safe deployment of models which have the potential to generate harmful content.
+- Probing and understanding the limitations and biases of generative models.
+- Generation of artworks and use in design and other artistic processes.
+- Applications in educational or creative tools.
+
+Excluded uses are described below.
+
+### Out-of-Scope Use
+
+The model was not trained to be factual or true representations of people or events,
+and therefore using the model to generate such content is out-of-scope for the abilities of this model.
+The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
+
+## Limitations and Bias
+
+### Limitations
+- Faces and people in general may not be generated properly.
+- The autoencoding part of the model is lossy.
+
+
+## StableCascadeCombinedPipeline
+
+[[autodoc]] StableCascadeCombinedPipeline
+ - all
+ - __call__
+
+## StableCascadePriorPipeline
+
+[[autodoc]] StableCascadePriorPipeline
+ - all
+ - __call__
+
+## StableCascadePriorPipelineOutput
+
+[[autodoc]] pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput
+
+## StableCascadeDecoderPipeline
+
+[[autodoc]] StableCascadeDecoderPipeline
+ - all
+ - __call__
+
diff --git a/docs/source/en/api/pipelines/stable_diffusion/adapter.md b/docs/source/en/api/pipelines/stable_diffusion/adapter.md
new file mode 100644
index 0000000..aa38e3d
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/adapter.md
@@ -0,0 +1,259 @@
+
+
+# Text-to-Image Generation with Adapter Conditioning
+
+## Overview
+
+[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie.
+
+Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
+
+The abstract of the paper is the following:
+
+*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
+
+This model was contributed by the community contributor [HimariO](https://github.com/HimariO) โค๏ธ .
+
+## Available Pipelines:
+
+| Pipeline | Tasks | Demo
+|---|---|:---:|
+| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
+| [StableDiffusionXLAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning on StableDiffusion-XL* | -
+
+## Usage example with the base model of StableDiffusion-1.4/1.5
+
+In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
+All adapters use the same pipeline.
+
+ 1. Images are first converted into the appropriate *control image* format.
+ 2. The *control image* and *prompt* are passed to the [`StableDiffusionAdapterPipeline`].
+
+Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).
+
+```python
+from diffusers.utils import load_image, make_image_grid
+
+image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
+```
+
+data:image/s3,"s3://crabby-images/6bd14/6bd14c27ecfc0ced4494f6f15aedacb10cad8b56" alt="img"
+
+
+Then we can create our color palette by simply resizing it to 8 by 8 pixels and then scaling it back to original size.
+
+```python
+from PIL import Image
+
+color_palette = image.resize((8, 8))
+color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)
+```
+
+Let's take a look at the processed image.
+
+data:image/s3,"s3://crabby-images/1080a/1080a66835e8de1017483d92c4a3a1fd8041c5d7" alt="img"
+
+
+Next, create the adapter pipeline
+
+```py
+import torch
+from diffusers import StableDiffusionAdapterPipeline, T2IAdapter
+
+adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16)
+pipe = StableDiffusionAdapterPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ adapter=adapter,
+ torch_dtype=torch.float16,
+)
+pipe.to("cuda")
+```
+
+Finally, pass the prompt and control image to the pipeline
+
+```py
+# fix the random seed, so you will get the same result as the example
+generator = torch.Generator("cuda").manual_seed(7)
+
+out_image = pipe(
+ "At night, glowing cubes in front of the beach",
+ image=color_palette,
+ generator=generator,
+).images[0]
+make_image_grid([image, color_palette, out_image], rows=1, cols=3)
+```
+
+data:image/s3,"s3://crabby-images/cc4ef/cc4ef19dd469066aaaeebe6e197d6ff1eb681fed" alt="img"
+
+## Usage example with the base model of StableDiffusion-XL
+
+In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
+All adapters use the same pipeline.
+
+ 1. Images are first downloaded into the appropriate *control image* format.
+ 2. The *control image* and *prompt* are passed to the [`StableDiffusionXLAdapterPipeline`].
+
+Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
+
+```python
+from diffusers.utils import load_image, make_image_grid
+
+sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
+```
+
+data:image/s3,"s3://crabby-images/226fa/226fa91b83c3b9b9f408f65dd71d65cc52c8b16f" alt="img"
+
+Then, create the adapter pipeline
+
+```py
+import torch
+from diffusers import (
+ T2IAdapter,
+ StableDiffusionXLAdapterPipeline,
+ DDPMScheduler
+)
+
+model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0", torch_dtype=torch.float16, adapter_type="full_adapter_xl")
+scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
+
+pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
+ model_id, adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
+)
+
+pipe.to("cuda")
+```
+
+Finally, pass the prompt and control image to the pipeline
+
+```py
+# fix the random seed, so you will get the same result as the example
+generator = torch.Generator().manual_seed(42)
+
+sketch_image_out = pipe(
+ prompt="a photo of a dog in real world, high quality",
+ negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
+ image=sketch_image,
+ generator=generator,
+ guidance_scale=7.5
+).images[0]
+make_image_grid([sketch_image, sketch_image_out], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/cdfcc/cdfcc0140d86e93a4351101d215094c4ac989bcb" alt="img"
+
+## Available checkpoints
+
+Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://huggingface.co/TencentARC/T2I-Adapter/tree/main/models).
+
+### T2I-Adapter with Stable Diffusion 1.4
+
+| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
+|---|---|---|---|
+|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1) *Trained with spatial color palette* | An image with 8x8 color palette.|||
+|[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1) *Trained with canny edge detection* | A monochrome image with white edges on a black background.|||
+|[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1) *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|||
+|[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1) *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|||
+|[TencentARC/t2iadapter_openpose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_openpose_sd14v1) *Trained with OpenPose bone image* | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|||
+|[TencentARC/t2iadapter_keypose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_keypose_sd14v1) *Trained with mmpose skeleton image* | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|||
+|[TencentARC/t2iadapter_seg_sd14v1](https://huggingface.co/TencentARC/t2iadapter_seg_sd14v1) *Trained with semantic segmentation* | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|| |
+|[TencentARC/t2iadapter_canny_sd15v2](https://huggingface.co/TencentARC/t2iadapter_canny_sd15v2)||
+|[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
+|[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
+|[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
+|[Adapter/t2iadapter, subfolder='sketch_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0)||
+|[Adapter/t2iadapter, subfolder='canny_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/canny_sdxl_1.0)||
+|[Adapter/t2iadapter, subfolder='openpose_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/openpose_sdxl_1.0)||
+
+## Combining multiple adapters
+
+[`MultiAdapter`] can be used for applying multiple conditionings at once.
+
+Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.
+
+```py
+from diffusers.utils import load_image, make_image_grid
+
+cond_keypose = load_image(
+ "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
+)
+cond_depth = load_image(
+ "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
+)
+cond = [cond_keypose, cond_depth]
+
+prompt = ["A man walking in an office room with a nice view"]
+```
+
+The two control images look as such:
+
+data:image/s3,"s3://crabby-images/ad498/ad49852750e339218b609ad203c5eac2646f42ab" alt="img"
+data:image/s3,"s3://crabby-images/380c4/380c40626488cc75aef95b281443b34ea4fdec98" alt="img"
+
+
+`MultiAdapter` combines keypose and depth adapters.
+
+`adapter_conditioning_scale` balances the relative influence of the different adapters.
+
+```py
+import torch
+from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter
+
+adapters = MultiAdapter(
+ [
+ T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"),
+ T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"),
+ ]
+)
+adapters = adapters.to(torch.float16)
+
+pipe = StableDiffusionAdapterPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ torch_dtype=torch.float16,
+ adapter=adapters,
+).to("cuda")
+
+image = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]).images[0]
+make_image_grid([cond_keypose, cond_depth, image], rows=1, cols=3)
+```
+
+data:image/s3,"s3://crabby-images/72264/72264731a354272209ae17401a8b726cf7559fda" alt="img"
+
+
+## T2I-Adapter vs ControlNet
+
+T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
+T2I-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
+However, T2I-Adapter performs slightly worse than ControlNet.
+
+## StableDiffusionAdapterPipeline
+[[autodoc]] StableDiffusionAdapterPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+## StableDiffusionXLAdapterPipeline
+[[autodoc]] StableDiffusionXLAdapterPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
diff --git a/docs/source/en/api/pipelines/stable_diffusion/depth2img.md b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
new file mode 100644
index 0000000..84dae80
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
@@ -0,0 +1,40 @@
+
+
+# Depth-to-image
+
+The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
+
+
+
+## StableDiffusionDepth2ImgPipeline
+
+[[autodoc]] StableDiffusionDepth2ImgPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - load_textual_inversion
+ - load_lora_weights
+ - save_lora_weights
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/gligen.md b/docs/source/en/api/pipelines/stable_diffusion/gligen.md
new file mode 100644
index 0000000..c675444
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/gligen.md
@@ -0,0 +1,59 @@
+
+
+# GLIGEN (Grounded Language-to-Image Generation)
+
+The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
+
+The abstract from the [paper](https://huggingface.co/papers/2301.07093) is:
+
+*Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGENโs zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.*
+
+
+
+Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently!
+
+If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations!
+
+
+
+[`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyแป n Cรดng Tรบ Anh](https://github.com/tuanh123789).
+
+## StableDiffusionGLIGENPipeline
+
+[[autodoc]] StableDiffusionGLIGENPipeline
+ - all
+ - __call__
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_vae_tiling
+ - disable_vae_tiling
+ - enable_model_cpu_offload
+ - prepare_latents
+ - enable_fuser
+
+## StableDiffusionGLIGENTextImagePipeline
+
+[[autodoc]] StableDiffusionGLIGENTextImagePipeline
+ - all
+ - __call__
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_vae_tiling
+ - disable_vae_tiling
+ - enable_model_cpu_offload
+ - prepare_latents
+ - enable_fuser
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/image_variation.md b/docs/source/en/api/pipelines/stable_diffusion/image_variation.md
new file mode 100644
index 0000000..57dd2f0
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/image_variation.md
@@ -0,0 +1,37 @@
+
+
+# Image variation
+
+The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model by [Justin Pinkney](https://www.justinpinkney.com/) from [Lambda](https://lambdalabs.com/).
+
+The original codebase can be found at [LambdaLabsML/lambda-diffusers](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers).
+
+
+
+Make sure to check out the Stable Diffusion [Tips](./overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+
+
+## StableDiffusionImageVariationPipeline
+
+[[autodoc]] StableDiffusionImageVariationPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/img2img.md b/docs/source/en/api/pipelines/stable_diffusion/img2img.md
new file mode 100644
index 0000000..1a62a5a
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.md
@@ -0,0 +1,55 @@
+
+
+# Image-to-image
+
+The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images.
+
+The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon.
+
+The abstract from the paper is:
+
+*Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.*
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+
+
+## StableDiffusionImg2ImgPipeline
+
+[[autodoc]] StableDiffusionImg2ImgPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - load_textual_inversion
+ - from_single_file
+ - load_lora_weights
+ - save_lora_weights
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+
+## FlaxStableDiffusionImg2ImgPipeline
+
+[[autodoc]] FlaxStableDiffusionImg2ImgPipeline
+ - all
+ - __call__
+
+## FlaxStableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.md b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
new file mode 100644
index 0000000..9842b58
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
@@ -0,0 +1,57 @@
+
+
+# Inpainting
+
+The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion.
+
+## Tips
+
+It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such
+as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default
+text-to-image Stable Diffusion checkpoints, such as
+[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible but they might be less performant.
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
+
+
+
+## StableDiffusionInpaintPipeline
+
+[[autodoc]] StableDiffusionInpaintPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - load_textual_inversion
+ - load_lora_weights
+ - save_lora_weights
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+
+## FlaxStableDiffusionInpaintPipeline
+
+[[autodoc]] FlaxStableDiffusionInpaintPipeline
+ - all
+ - __call__
+
+## FlaxStableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md b/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md
new file mode 100644
index 0000000..07e34bd
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md
@@ -0,0 +1,27 @@
+
+
+# K-Diffusion
+
+[k-diffusion](https://github.com/crowsonkb/k-diffusion) is a popular library created by [Katherine Crowson](https://github.com/crowsonkb/). We provide `StableDiffusionKDiffusionPipeline` and `StableDiffusionXLKDiffusionPipeline` that allow you to run Stable DIffusion with samplers from k-diffusion.
+
+Note that most the samplers from k-diffusion are implemented in Diffusers and we recommend using existing schedulers. You can find a mapping between k-diffusion samplers and schedulers in Diffusers [here](https://huggingface.co/docs/diffusers/api/schedulers/overview)
+
+
+## StableDiffusionKDiffusionPipeline
+
+[[autodoc]] StableDiffusionKDiffusionPipeline
+
+
+## StableDiffusionXLKDiffusionPipeline
+
+[[autodoc]] StableDiffusionXLKDiffusionPipeline
\ No newline at end of file
diff --git a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md
new file mode 100644
index 0000000..9abccd6
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md
@@ -0,0 +1,38 @@
+
+
+# Latent upscaler
+
+The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2 (see this demo [notebook](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) for a demonstration of the original implementation).
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
+
+
+
+## StableDiffusionLatentUpscalePipeline
+
+[[autodoc]] StableDiffusionLatentUpscalePipeline
+ - all
+ - __call__
+ - enable_sequential_cpu_offload
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md
new file mode 100644
index 0000000..64cfdde
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md
@@ -0,0 +1,55 @@
+
+
+# Text-to-(RGB, depth)
+
+LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
+
+Two checkpoints are available for use:
+- [ldm3d-original](https://huggingface.co/Intel/ldm3d). The original checkpoint used in the [paper](https://arxiv.org/pdf/2305.10853.pdf)
+- [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images.
+
+
+The abstract from the paper is:
+
+*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).*
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+
+
+## StableDiffusionLDM3DPipeline
+
+[[autodoc]] pipelines.stable_diffusion_ldm3d.pipeline_stable_diffusion_ldm3d.StableDiffusionLDM3DPipeline
+ - all
+ - __call__
+
+
+## LDM3DPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion_ldm3d.pipeline_stable_diffusion_ldm3d.LDM3DPipelineOutput
+ - all
+ - __call__
+
+# Upscaler
+
+[LDM3D-VR](https://arxiv.org/pdf/2311.03226.pdf) is an extended version of LDM3D.
+
+The abstract from the paper is:
+*Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods*
+
+Two checkpoints are available for use:
+- [ldm3d-pano](https://huggingface.co/Intel/ldm3d-pano). This checkpoint enables the generation of panoramic images and requires the StableDiffusionLDM3DPipeline pipeline to be used.
+- [ldm3d-sr](https://huggingface.co/Intel/ldm3d-sr). This checkpoint enables the upscaling of RGB and depth images. Can be used in cascade after the original LDM3D pipeline using the StableDiffusionUpscaleLDM3DPipeline from communauty pipeline.
+
diff --git a/docs/source/en/api/pipelines/stable_diffusion/overview.md b/docs/source/en/api/pipelines/stable_diffusion/overview.md
new file mode 100644
index 0000000..182aae4
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/overview.md
@@ -0,0 +1,174 @@
+
+
+# Stable Diffusion pipelines
+
+Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjรถrn Ommer.
+
+Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs.
+
+For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and our own [blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details.
+
+You can find the original codebase for Stable Diffusion v1.0 at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) and Stable Diffusion v2.0 at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) as well as their original scripts for various tasks. Additional official checkpoints for the different Stable Diffusion versions and tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations. Explore these organizations to find the best checkpoint for your use-case!
+
+The table below summarizes the available Stable Diffusion pipelines, their supported tasks, and an interactive demo:
+
+
+
+## Tips
+
+To help you get the most out of the Stable Diffusion pipelines, here are a few tips for improving performance and usability. These tips are applicable to all Stable Diffusion pipelines.
+
+### Explore tradeoff between speed and quality
+
+[`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but ๐ค Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible. For example, if you want to use the [`EulerDiscreteScheduler`] instead of the default:
+
+```py
+from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
+
+pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
+
+# or
+euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")
+pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler)
+```
+
+### Reuse pipeline components to save memory
+
+To save memory and use the same components across multiple pipelines, use the `.components` method to avoid loading weights into RAM more than once.
+
+```py
+from diffusers import (
+ StableDiffusionPipeline,
+ StableDiffusionImg2ImgPipeline,
+ StableDiffusionInpaintPipeline,
+)
+
+text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+img2img = StableDiffusionImg2ImgPipeline(**text2img.components)
+inpaint = StableDiffusionInpaintPipeline(**text2img.components)
+
+# now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline
+```
diff --git a/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md b/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md
new file mode 100644
index 0000000..764685a
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md
@@ -0,0 +1,35 @@
+
+
+# SDXL Turbo
+
+Stable Diffusion XL (SDXL) Turbo was proposed in [Adversarial Diffusion Distillation](https://stability.ai/research/adversarial-diffusion-distillation) by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach.
+
+The abstract from the paper is:
+
+*We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1โ4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs,Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.*
+
+## Tips
+
+- SDXL Turbo uses the exact same architecture as [SDXL](./stable_diffusion_xl), which means it also has the same API. Please refer to the [SDXL](./stable_diffusion_xl) API reference for more details.
+- SDXL Turbo should disable guidance scale by setting `guidance_scale=0.0`.
+- SDXL Turbo should use `timestep_spacing='trailing'` for the scheduler and use between 1 and 4 steps.
+- SDXL Turbo has been trained to generate images of size 512x512.
+- SDXL Turbo is open-access, but not open-source meaning that one might have to buy a model license in order to use it for commercial applications. Make sure to read the [official model card](https://huggingface.co/stabilityai/sdxl-turbo) to learn more.
+
+
+
+To learn how to use SDXL Turbo for various tasks, how to optimize performance, and other usage examples, take a look at the [SDXL Turbo](../../../using-diffusers/sdxl_turbo) guide.
+
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints!
+
+
diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
new file mode 100644
index 0000000..d148545
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
@@ -0,0 +1,125 @@
+
+
+# Stable Diffusion 2
+
+Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/).
+
+*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels.
+These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAIONโs NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).*
+
+For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release).
+
+The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps.
+
+Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image:
+
+| Task | Repository |
+|-------------------------|---------------------------------------------------------------------------------------------------------------|
+| text-to-image (512x512) | [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) |
+| text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) |
+| inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) |
+| super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) |
+| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) |
+
+Here are some examples for how to use Stable Diffusion 2 for each task:
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
+
+
+
+## Text-to-image
+
+```py
+from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
+import torch
+
+repo_id = "stabilityai/stable-diffusion-2-base"
+pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
+
+pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+pipe = pipe.to("cuda")
+
+prompt = "High quality photo of an astronaut riding a horse in space"
+image = pipe(prompt, num_inference_steps=25).images[0]
+image
+```
+
+## Inpainting
+
+```py
+import torch
+from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
+from diffusers.utils import load_image, make_image_grid
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+
+repo_id = "stabilityai/stable-diffusion-2-inpainting"
+pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
+
+pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+pipe = pipe.to("cuda")
+
+prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+## Super-resolution
+
+```py
+from diffusers import StableDiffusionUpscalePipeline
+from diffusers.utils import load_image, make_image_grid
+import torch
+
+# load model and scheduler
+model_id = "stabilityai/stable-diffusion-x4-upscaler"
+pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16)
+pipeline = pipeline.to("cuda")
+
+# let's download an image
+url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
+low_res_img = load_image(url)
+low_res_img = low_res_img.resize((128, 128))
+prompt = "a white cat"
+upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
+make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2)
+```
+
+## Depth-to-image
+
+```py
+import torch
+from diffusers import StableDiffusionDepth2ImgPipeline
+from diffusers.utils import load_image, make_image_grid
+
+pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth",
+ torch_dtype=torch.float16,
+).to("cuda")
+
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+init_image = load_image(url)
+prompt = "two tigers"
+negative_prompt = "bad, deformed, ugly, bad anotomy"
+image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md
new file mode 100644
index 0000000..97c11bf
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md
@@ -0,0 +1,61 @@
+
+
+# Safe Stable Diffusion
+
+Safe Stable Diffusion was proposed in [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105) and mitigates inappropriate degeneration from Stable Diffusion models because they're trained on unfiltered web-crawled datasets. For instance Stable Diffusion may unexpectedly generate nudity, violence, images depicting self-harm, and otherwise offensive content. Safe Stable Diffusion is an extension of Stable Diffusion that drastically reduces this type of content.
+
+The abstract from the paper is:
+
+*Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.*
+
+## Tips
+
+Use the `safety_concept` property of [`StableDiffusionPipelineSafe`] to check and edit the current safety concept:
+
+```python
+>>> from diffusers import StableDiffusionPipelineSafe
+
+>>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe")
+>>> pipeline.safety_concept
+'an image showing hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality, cruelty'
+```
+For each image generation the active concept is also contained in [`StableDiffusionSafePipelineOutput`].
+
+There are 4 configurations (`SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyConfig.STRONG`, and `SafetyConfig.MAX`) that can be applied:
+
+```python
+>>> from diffusers import StableDiffusionPipelineSafe
+>>> from diffusers.pipelines.stable_diffusion_safe import SafetyConfig
+
+>>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe")
+>>> prompt = "the four horsewomen of the apocalypse, painting by tom of finland, gaston bussiere, craig mullins, j. c. leyendecker"
+>>> out = pipeline(prompt=prompt, **SafetyConfig.MAX)
+```
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+
+
+## StableDiffusionPipelineSafe
+
+[[autodoc]] StableDiffusionPipelineSafe
+ - all
+ - __call__
+
+## StableDiffusionSafePipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion_safe.StableDiffusionSafePipelineOutput
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
new file mode 100644
index 0000000..c5433c0
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -0,0 +1,55 @@
+
+
+# Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Mรผller, Joe Penna, and Robin Rombach.
+
+The abstract from the paper is:
+
+*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.*
+
+## Tips
+
+- Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers:
+ - set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality
+ - set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE)
+- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
+- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
+- SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
+- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
+
+
+
+To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide.
+
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints!
+
+
+
+## StableDiffusionXLPipeline
+
+[[autodoc]] StableDiffusionXLPipeline
+ - all
+ - __call__
+
+## StableDiffusionXLImg2ImgPipeline
+
+[[autodoc]] StableDiffusionXLImg2ImgPipeline
+ - all
+ - __call__
+
+## StableDiffusionXLInpaintPipeline
+
+[[autodoc]] StableDiffusionXLInpaintPipeline
+ - all
+ - __call__
diff --git a/docs/source/en/api/pipelines/stable_diffusion/svd.md b/docs/source/en/api/pipelines/stable_diffusion/svd.md
new file mode 100644
index 0000000..87a9c2a
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/svd.md
@@ -0,0 +1,43 @@
+
+
+# Stable Video Diffusion
+
+Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach.
+
+The abstract from the paper is:
+
+*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.*
+
+
+
+To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide.
+
+
+
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints!
+
+
+
+## Tips
+
+Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
+
+Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
+
+## StableVideoDiffusionPipeline
+
+[[autodoc]] StableVideoDiffusionPipeline
+
+## StableVideoDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.md b/docs/source/en/api/pipelines/stable_diffusion/text2img.md
new file mode 100644
index 0000000..86f3090
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.md
@@ -0,0 +1,59 @@
+
+
+# Text-to-image
+
+The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjรถrn Ommer.
+
+The abstract from the paper is:
+
+*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion.*
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
+
+
+
+## StableDiffusionPipeline
+
+[[autodoc]] StableDiffusionPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+ - enable_vae_tiling
+ - disable_vae_tiling
+ - load_textual_inversion
+ - from_single_file
+ - load_lora_weights
+ - save_lora_weights
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+
+## FlaxStableDiffusionPipeline
+
+[[autodoc]] FlaxStableDiffusionPipeline
+ - all
+ - __call__
+
+## FlaxStableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_diffusion/upscale.md b/docs/source/en/api/pipelines/stable_diffusion/upscale.md
new file mode 100644
index 0000000..b188c29
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.md
@@ -0,0 +1,37 @@
+
+
+# Super-resolution
+
+The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4.
+
+
+
+Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
+
+If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
+
+
+
+## StableDiffusionUpscalePipeline
+
+[[autodoc]] StableDiffusionUpscalePipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
diff --git a/docs/source/en/api/pipelines/stable_unclip.md b/docs/source/en/api/pipelines/stable_unclip.md
new file mode 100644
index 0000000..3067ba9
--- /dev/null
+++ b/docs/source/en/api/pipelines/stable_unclip.md
@@ -0,0 +1,129 @@
+
+
+# Stable unCLIP
+
+Stable unCLIP checkpoints are finetuned from [Stable Diffusion 2.1](./stable_diffusion/stable_diffusion_2) checkpoints to condition on CLIP image embeddings.
+Stable unCLIP still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used
+for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation.
+
+The abstract from the paper is:
+
+*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
+
+## Tips
+
+Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`).
+
+### Text-to-Image Generation
+Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha):
+
+```python
+import torch
+from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline
+from diffusers.models import PriorTransformer
+from transformers import CLIPTokenizer, CLIPTextModelWithProjection
+
+prior_model_id = "kakaobrain/karlo-v1-alpha"
+data_type = torch.float16
+prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type)
+
+prior_text_model_id = "openai/clip-vit-large-patch14"
+prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id)
+prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type)
+prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler")
+prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config)
+
+stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small"
+
+pipe = StableUnCLIPPipeline.from_pretrained(
+ stable_unclip_model_id,
+ torch_dtype=data_type,
+ variant="fp16",
+ prior_tokenizer=prior_tokenizer,
+ prior_text_encoder=prior_text_model,
+ prior=prior,
+ prior_scheduler=prior_scheduler,
+)
+
+pipe = pipe.to("cuda")
+wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular"
+
+image = pipe(prompt=wave_prompt).images[0]
+image
+```
+
+
+For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use.
+
+
+
+### Text guided Image-to-Image Variation
+
+```python
+from diffusers import StableUnCLIPImg2ImgPipeline
+from diffusers.utils import load_image
+import torch
+
+pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16"
+)
+pipe = pipe.to("cuda")
+
+url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
+init_image = load_image(url)
+
+images = pipe(init_image).images
+images[0].save("variation_image.png")
+```
+
+Optionally, you can also pass a prompt to `pipe` such as:
+
+```python
+prompt = "A fantasy landscape, trending on artstation"
+
+image = pipe(init_image, prompt=prompt).images[0]
+image
+```
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## StableUnCLIPPipeline
+
+[[autodoc]] StableUnCLIPPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+## StableUnCLIPImg2ImgPipeline
+
+[[autodoc]] StableUnCLIPImg2ImgPipeline
+ - all
+ - __call__
+ - enable_attention_slicing
+ - disable_attention_slicing
+ - enable_vae_slicing
+ - disable_vae_slicing
+ - enable_xformers_memory_efficient_attention
+ - disable_xformers_memory_efficient_attention
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/text_to_video.md b/docs/source/en/api/pipelines/text_to_video.md
new file mode 100644
index 0000000..7522264
--- /dev/null
+++ b/docs/source/en/api/pipelines/text_to_video.md
@@ -0,0 +1,193 @@
+
+
+
+
+๐งช This pipeline is for research purposes only.
+
+
+
+# Text-to-video
+
+[ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang.
+
+The abstract from the paper is:
+
+*This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.*
+
+You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense).
+
+## Usage example
+
+### `text-to-video-ms-1.7b`
+
+Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+from diffusers.utils import export_to_video
+
+pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
+pipe = pipe.to("cuda")
+
+prompt = "Spiderman is surfing"
+video_frames = pipe(prompt).frames[0]
+video_path = export_to_video(video_frames)
+video_path
+```
+
+Diffusers supports different optimization techniques to improve the latency
+and memory footprint of a pipeline. Since videos are often more memory-heavy than images,
+we can enable CPU offloading and VAE slicing to keep the memory footprint at bay.
+
+Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing:
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+from diffusers.utils import export_to_video
+
+pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
+pipe.enable_model_cpu_offload()
+
+# memory optimization
+pipe.enable_vae_slicing()
+
+prompt = "Darth Vader surfing a wave"
+video_frames = pipe(prompt, num_frames=64).frames[0]
+video_path = export_to_video(video_frames)
+video_path
+```
+
+It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above.
+
+We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion:
+
+```python
+import torch
+from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
+from diffusers.utils import export_to_video
+
+pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
+pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+
+prompt = "Spiderman is surfing"
+video_frames = pipe(prompt, num_inference_steps=25).frames[0]
+video_path = export_to_video(video_frames)
+video_path
+```
+
+Here are some sample outputs:
+
+
+
+
+ An astronaut riding a horse.
+
+
+
+
+ Darth vader surfing in waves.
+
+
+
+
+
+
+### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL`
+
+Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`.
+One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`],
+which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL).
+
+
+```py
+import torch
+from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
+from diffusers.utils import export_to_video
+from PIL import Image
+
+pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+
+# memory optimization
+pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
+pipe.enable_vae_slicing()
+
+prompt = "Darth Vader surfing a wave"
+video_frames = pipe(prompt, num_frames=24).frames[0]
+video_path = export_to_video(video_frames)
+video_path
+```
+
+Now the video can be upscaled:
+
+```py
+pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
+pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+
+# memory optimization
+pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
+pipe.enable_vae_slicing()
+
+video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]
+
+video_frames = pipe(prompt, video=video, strength=0.6).frames[0]
+video_path = export_to_video(video_frames)
+video_path
+```
+
+Here are some sample outputs:
+
+
+
+
+ Darth vader surfing in waves.
+
+
+
+
+
+
+## Tips
+
+Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
+
+Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## TextToVideoSDPipeline
+[[autodoc]] TextToVideoSDPipeline
+ - all
+ - __call__
+
+## VideoToVideoSDPipeline
+[[autodoc]] VideoToVideoSDPipeline
+ - all
+ - __call__
+
+## TextToVideoSDPipelineOutput
+[[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput
diff --git a/docs/source/en/api/pipelines/text_to_video_zero.md b/docs/source/en/api/pipelines/text_to_video_zero.md
new file mode 100644
index 0000000..1f8688a
--- /dev/null
+++ b/docs/source/en/api/pipelines/text_to_video_zero.md
@@ -0,0 +1,301 @@
+
+
+# Text2Video-Zero
+
+[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).
+
+Text2Video-Zero enables zero-shot video generation using either:
+1. A textual prompt
+2. A prompt combined with guidance from poses or edges
+3. Video Instruct-Pix2Pix (instruction-guided video editing)
+
+Results are temporally consistent and closely follow the guidance and textual prompts.
+
+data:image/s3,"s3://crabby-images/ef05d/ef05d5bc24d395e35189c858ea2485b296d12dea" alt="teaser-img"
+
+The abstract from the paper is:
+
+*Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.
+Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object.
+Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
+As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.*
+
+You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero).
+
+## Usage example
+
+### Text-To-Video
+
+To generate a video from prompt, run the following Python code:
+```python
+import torch
+from diffusers import TextToVideoZeroPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+
+prompt = "A panda is playing guitar on times square"
+result = pipe(prompt=prompt).images
+result = [(r * 255).astype("uint8") for r in result]
+imageio.mimsave("video.mp4", result, fps=4)
+```
+You can change these parameters in the pipeline call:
+* Motion field strength (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1):
+ * `motion_field_strength_x` and `motion_field_strength_y`. Default: `motion_field_strength_x=12`, `motion_field_strength_y=12`
+* `T` and `T'` (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1)
+ * `t0` and `t1` in the range `{0, ..., num_inference_steps}`. Default: `t0=45`, `t1=48`
+* Video length:
+ * `video_length`, the number of frames video_length to be generated. Default: `video_length=8`
+
+We can also generate longer videos by doing the processing in a chunk-by-chunk manner:
+```python
+import torch
+from diffusers import TextToVideoZeroPipeline
+import numpy as np
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+seed = 0
+video_length = 24 #24 รท 4fps = 6 seconds
+chunk_size = 8
+prompt = "A panda is playing guitar on times square"
+
+# Generate the video chunk-by-chunk
+result = []
+chunk_ids = np.arange(0, video_length, chunk_size - 1)
+generator = torch.Generator(device="cuda")
+for i in range(len(chunk_ids)):
+ print(f"Processing chunk {i + 1} / {len(chunk_ids)}")
+ ch_start = chunk_ids[i]
+ ch_end = video_length if i == len(chunk_ids) - 1 else chunk_ids[i + 1]
+ # Attach the first frame for Cross Frame Attention
+ frame_ids = [0] + list(range(ch_start, ch_end))
+ # Fix the seed for the temporal consistency
+ generator.manual_seed(seed)
+ output = pipe(prompt=prompt, video_length=len(frame_ids), generator=generator, frame_ids=frame_ids)
+ result.append(output.images[1:])
+
+# Concatenate chunks and save
+result = np.concatenate(result)
+result = [(r * 255).astype("uint8") for r in result]
+imageio.mimsave("video.mp4", result, fps=4)
+```
+
+
+- #### SDXL Support
+In order to use the SDXL model when generating a video from prompt, use the `TextToVideoZeroSDXLPipeline` pipeline:
+
+```python
+import torch
+from diffusers import TextToVideoZeroSDXLPipeline
+
+model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+pipe = TextToVideoZeroSDXLPipeline.from_pretrained(
+ model_id, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+```
+
+### Text-To-Video with Pose Control
+To generate a video from prompt with additional pose control
+
+1. Download a demo video
+
+ ```python
+ from huggingface_hub import hf_hub_download
+
+ filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4"
+ repo_id = "PAIR/Text2Video-Zero"
+ video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+ ```
+
+
+2. Read video containing extracted pose images
+ ```python
+ from PIL import Image
+ import imageio
+
+ reader = imageio.get_reader(video_path, "ffmpeg")
+ frame_count = 8
+ pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+ ```
+ To extract pose from actual video, read [ControlNet documentation](controlnet).
+
+3. Run `StableDiffusionControlNetPipeline` with our custom attention processor
+
+ ```python
+ import torch
+ from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+ from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+ model_id = "runwayml/stable-diffusion-v1-5"
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ model_id, controlnet=controlnet, torch_dtype=torch.float16
+ ).to("cuda")
+
+ # Set the attention processor
+ pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+ pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+
+ # fix latents for all frames
+ latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
+
+ prompt = "Darth Vader dancing in a desert"
+ result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
+ imageio.mimsave("video.mp4", result, fps=4)
+ ```
+- #### SDXL Support
+
+ Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL:
+ ```python
+ import torch
+ from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
+ from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+ controlnet_model_id = 'thibaud/controlnet-openpose-sdxl-1.0'
+ model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
+
+ controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16)
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ model_id, controlnet=controlnet, torch_dtype=torch.float16
+ ).to('cuda')
+
+ # Set the attention processor
+ pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+ pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+
+ # fix latents for all frames
+ latents = torch.randn((1, 4, 128, 128), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
+
+ prompt = "Darth Vader dancing in a desert"
+ result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
+ imageio.mimsave("video.mp4", result, fps=4)
+ ```
+
+### Text-To-Video with Edge Control
+
+To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).
+
+
+### Video Instruct-Pix2Pix
+
+To perform text-guided video editing (with [InstructPix2Pix](pix2pix)):
+
+1. Download a demo video
+
+ ```python
+ from huggingface_hub import hf_hub_download
+
+ filename = "__assets__/pix2pix video/camel.mp4"
+ repo_id = "PAIR/Text2Video-Zero"
+ video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+ ```
+
+2. Read video from path
+ ```python
+ from PIL import Image
+ import imageio
+
+ reader = imageio.get_reader(video_path, "ffmpeg")
+ frame_count = 8
+ video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+ ```
+
+3. Run `StableDiffusionInstructPix2PixPipeline` with our custom attention processor
+ ```python
+ import torch
+ from diffusers import StableDiffusionInstructPix2PixPipeline
+ from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+ model_id = "timbrooks/instruct-pix2pix"
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+ pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3))
+
+ prompt = "make it Van Gogh Starry Night style"
+ result = pipe(prompt=[prompt] * len(video), image=video).images
+ imageio.mimsave("edited_video.mp4", result, fps=4)
+ ```
+
+
+### DreamBooth specialization
+
+Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control**
+can run with custom [DreamBooth](../../training/dreambooth) models, as shown below for
+[Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
+[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model:
+
+1. Download a demo video
+
+ ```python
+ from huggingface_hub import hf_hub_download
+
+ filename = "__assets__/canny_videos_mp4/girl_turning.mp4"
+ repo_id = "PAIR/Text2Video-Zero"
+ video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+ ```
+
+2. Read video from path
+ ```python
+ from PIL import Image
+ import imageio
+
+ reader = imageio.get_reader(video_path, "ffmpeg")
+ frame_count = 8
+ canny_edges = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+ ```
+
+3. Run `StableDiffusionControlNetPipeline` with custom trained DreamBooth model
+ ```python
+ import torch
+ from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+ from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+ # set model id to custom model
+ model_id = "PAIR/text2video-zero-controlnet-canny-avatar"
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ model_id, controlnet=controlnet, torch_dtype=torch.float16
+ ).to("cuda")
+
+ # Set the attention processor
+ pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+ pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+
+ # fix latents for all frames
+ latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(canny_edges), 1, 1, 1)
+
+ prompt = "oil painting of a beautiful girl avatar style"
+ result = pipe(prompt=[prompt] * len(canny_edges), image=canny_edges, latents=latents).images
+ imageio.mimsave("video.mp4", result, fps=4)
+ ```
+
+You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## TextToVideoZeroPipeline
+[[autodoc]] TextToVideoZeroPipeline
+ - all
+ - __call__
+
+## TextToVideoZeroSDXLPipeline
+[[autodoc]] TextToVideoZeroSDXLPipeline
+ - all
+ - __call__
+
+## TextToVideoPipelineOutput
+[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
diff --git a/docs/source/en/api/pipelines/unclip.md b/docs/source/en/api/pipelines/unclip.md
new file mode 100644
index 0000000..f379ffd
--- /dev/null
+++ b/docs/source/en/api/pipelines/unclip.md
@@ -0,0 +1,37 @@
+
+
+# unCLIP
+
+[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in ๐ค Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo).
+
+The abstract from the paper is following:
+
+*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
+
+You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## UnCLIPPipeline
+[[autodoc]] UnCLIPPipeline
+ - all
+ - __call__
+
+## UnCLIPImageVariationPipeline
+[[autodoc]] UnCLIPImageVariationPipeline
+ - all
+ - __call__
+
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
diff --git a/docs/source/en/api/pipelines/unidiffuser.md b/docs/source/en/api/pipelines/unidiffuser.md
new file mode 100644
index 0000000..553a6d3
--- /dev/null
+++ b/docs/source/en/api/pipelines/unidiffuser.md
@@ -0,0 +1,205 @@
+
+
+# UniDiffuser
+
+The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.
+
+The abstract from the paper is:
+
+*This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).*
+
+You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml).
+
+
+
+There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X.
+
+
+
+This pipeline was contributed by [dg845](https://github.com/dg845). โค๏ธ
+
+## Usage Examples
+
+Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks:
+
+### Unconditional Image and Text Generation
+
+Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a [`UniDiffuserPipeline`] will produce a (image, text) pair:
+
+```python
+import torch
+
+from diffusers import UniDiffuserPipeline
+
+device = "cuda"
+model_id_or_path = "thu-ml/unidiffuser-v1"
+pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+pipe.to(device)
+
+# Unconditional image and text generation. The generation task is automatically inferred.
+sample = pipe(num_inference_steps=20, guidance_scale=8.0)
+image = sample.images[0]
+text = sample.text[0]
+image.save("unidiffuser_joint_sample_image.png")
+print(text)
+```
+
+This is also called "joint" generation in the UniDiffuser paper, since we are sampling from the joint image-text distribution.
+
+Note that the generation task is inferred from the inputs used when calling the pipeline.
+It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]:
+
+```python
+# Equivalent to the above.
+pipe.set_joint_mode()
+sample = pipe(num_inference_steps=20, guidance_scale=8.0)
+```
+
+When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting to infer the mode.
+You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode.
+
+You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively):
+
+```python
+# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance
+# Image-only generation
+pipe.set_image_mode()
+sample_image = pipe(num_inference_steps=20).images[0]
+# Text-only generation
+pipe.set_text_mode()
+sample_text = pipe(num_inference_steps=20).text[0]
+```
+
+### Text-to-Image Generation
+
+UniDiffuser is also capable of sampling from conditional distributions; that is, the distribution of images conditioned on a text prompt or the distribution of texts conditioned on an image.
+Here is an example of sampling from the conditional image distribution (text-to-image generation or text-conditioned image generation):
+
+```python
+import torch
+
+from diffusers import UniDiffuserPipeline
+
+device = "cuda"
+model_id_or_path = "thu-ml/unidiffuser-v1"
+pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+pipe.to(device)
+
+# Text-to-image generation
+prompt = "an elephant under the sea"
+
+sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
+t2i_image = sample.images[0]
+t2i_image
+```
+
+The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`].
+
+### Image-to-Text Generation
+
+Similarly, UniDiffuser can also produce text samples given an image (image-to-text or image-conditioned text generation):
+
+```python
+import torch
+
+from diffusers import UniDiffuserPipeline
+from diffusers.utils import load_image
+
+device = "cuda"
+model_id_or_path = "thu-ml/unidiffuser-v1"
+pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+pipe.to(device)
+
+# Image-to-text generation
+image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
+init_image = load_image(image_url).resize((512, 512))
+
+sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
+i2t_text = sample.text[0]
+print(i2t_text)
+```
+
+The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuserPipeline.set_image_to_text_mode`].
+
+### Image Variation
+
+The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and then perform a text-to-image generation on the outputs of the first generation.
+This produces a new image which is semantically similar to the input image:
+
+```python
+import torch
+
+from diffusers import UniDiffuserPipeline
+from diffusers.utils import load_image
+
+device = "cuda"
+model_id_or_path = "thu-ml/unidiffuser-v1"
+pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+pipe.to(device)
+
+# Image variation can be performed with an image-to-text generation followed by a text-to-image generation:
+# 1. Image-to-text generation
+image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
+init_image = load_image(image_url).resize((512, 512))
+
+sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
+i2t_text = sample.text[0]
+print(i2t_text)
+
+# 2. Text-to-image generation
+sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
+final_image = sample.images[0]
+final_image.save("unidiffuser_image_variation_sample.png")
+```
+
+### Text Variation
+
+Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation:
+
+```python
+import torch
+
+from diffusers import UniDiffuserPipeline
+
+device = "cuda"
+model_id_or_path = "thu-ml/unidiffuser-v1"
+pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+pipe.to(device)
+
+# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
+# 1. Text-to-image generation
+prompt = "an elephant under the sea"
+
+sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
+t2i_image = sample.images[0]
+t2i_image.save("unidiffuser_text2img_sample_image.png")
+
+# 2. Image-to-text generation
+sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
+final_prompt = sample.text[0]
+print(final_prompt)
+```
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## UniDiffuserPipeline
+[[autodoc]] UniDiffuserPipeline
+ - all
+ - __call__
+
+## ImageTextPipelineOutput
+[[autodoc]] pipelines.ImageTextPipelineOutput
diff --git a/docs/source/en/api/pipelines/value_guided_sampling.md b/docs/source/en/api/pipelines/value_guided_sampling.md
new file mode 100644
index 0000000..d21dbf0
--- /dev/null
+++ b/docs/source/en/api/pipelines/value_guided_sampling.md
@@ -0,0 +1,38 @@
+
+
+# Value-guided planning
+
+
+
+๐งช This is an experimental pipeline for reinforcement learning!
+
+
+
+This pipeline is based on the [Planning with Diffusion for Flexible Behavior Synthesis](https://huggingface.co/papers/2205.09991) paper by Michael Janner, Yilun Du, Joshua B. Tenenbaum, Sergey Levine.
+
+The abstract from the paper is:
+
+*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.*
+
+You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/drive/1rXm8CX4ZdN5qivjJ2lhwhkOmt_m0CvU0#scrollTo=6HXJvhyqcITc&uniqifier=1).
+
+The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning).
+
+
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+
+
+## ValueGuidedRLPipeline
+[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline
diff --git a/docs/source/en/api/pipelines/wuerstchen.md b/docs/source/en/api/pipelines/wuerstchen.md
new file mode 100644
index 0000000..4d90ad4
--- /dev/null
+++ b/docs/source/en/api/pipelines/wuerstchen.md
@@ -0,0 +1,163 @@
+
+
+# Wรผrstchen
+
+
+
+[Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.
+
+The abstract from the paper is:
+
+*We introduce Wรผrstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.*
+
+## Wรผrstchen Overview
+Wรผrstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Wรผrstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Wรผrstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.
+
+## Wรผrstchen v2 comes to Diffusers
+
+After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Wรผrstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements.
+
+- Higher resolution (1024x1024 up to 2048x2048)
+- Faster inference
+- Multi Aspect Resolution Sampling
+- Better quality
+
+
+We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:
+
+- v2-base
+- v2-aesthetic
+- **(default)** v2-interpolated (50% interpolation between v2-base and v2-aesthetic)
+
+We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations.
+A comparison can be seen here:
+
+
+
+## Text-to-Image Generation
+
+For the sake of usability, Wรผrstchen can be used with a single pipeline. This pipeline can be used as follows:
+
+```python
+import torch
+from diffusers import AutoPipelineForText2Image
+from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
+
+pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")
+
+caption = "Anthropomorphic cat dressed as a fire fighter"
+images = pipe(
+ caption,
+ width=1024,
+ height=1536,
+ prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
+ prior_guidance_scale=4.0,
+ num_images_per_prompt=2,
+).images
+```
+
+For explanation purposes, we can also initialize the two main pipelines of Wรผrstchen individually. Wรผrstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look at the [paper](https://huggingface.co/papers/2306.00637).
+
+```python
+import torch
+from diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline
+from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
+
+device = "cuda"
+dtype = torch.float16
+num_images_per_prompt = 2
+
+prior_pipeline = WuerstchenPriorPipeline.from_pretrained(
+ "warp-ai/wuerstchen-prior", torch_dtype=dtype
+).to(device)
+decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained(
+ "warp-ai/wuerstchen", torch_dtype=dtype
+).to(device)
+
+caption = "Anthropomorphic cat dressed as a fire fighter"
+negative_prompt = ""
+
+prior_output = prior_pipeline(
+ prompt=caption,
+ height=1024,
+ width=1536,
+ timesteps=DEFAULT_STAGE_C_TIMESTEPS,
+ negative_prompt=negative_prompt,
+ guidance_scale=4.0,
+ num_images_per_prompt=num_images_per_prompt,
+)
+decoder_output = decoder_pipeline(
+ image_embeddings=prior_output.image_embeddings,
+ prompt=caption,
+ negative_prompt=negative_prompt,
+ guidance_scale=0.0,
+ output_type="pil",
+).images[0]
+decoder_output
+```
+
+## Speed-Up Inference
+You can make use of `torch.compile` function and gain a speed-up of about 2-3x:
+
+```python
+prior_pipeline.prior = torch.compile(prior_pipeline.prior, mode="reduce-overhead", fullgraph=True)
+decoder_pipeline.decoder = torch.compile(decoder_pipeline.decoder, mode="reduce-overhead", fullgraph=True)
+```
+
+## Limitations
+
+- Due to the high compression employed by Wรผrstchen, generations can lack a good amount
+of detail. To our human eye, this is especially noticeable in faces, hands etc.
+- **Images can only be generated in 128-pixel steps**, e.g. the next higher resolution
+after 1024x1024 is 1152x1152
+- The model lacks the ability to render correct text in images
+- The model often does not achieve photorealism
+- Difficult compositional prompts are hard for the model
+
+The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen).
+
+
+## WuerstchenCombinedPipeline
+
+[[autodoc]] WuerstchenCombinedPipeline
+ - all
+ - __call__
+
+## WuerstchenPriorPipeline
+
+[[autodoc]] WuerstchenPriorPipeline
+ - all
+ - __call__
+
+## WuerstchenPriorPipelineOutput
+
+[[autodoc]] pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput
+
+## WuerstchenDecoderPipeline
+
+[[autodoc]] WuerstchenDecoderPipeline
+ - all
+ - __call__
+
+## Citation
+
+```bibtex
+ @misc{pernias2023wuerstchen,
+ title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
+ author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
+ year={2023},
+ eprint={2306.00637},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+ }
+```
diff --git a/docs/source/en/api/schedulers/cm_stochastic_iterative.md b/docs/source/en/api/schedulers/cm_stochastic_iterative.md
new file mode 100644
index 0000000..89e50b5
--- /dev/null
+++ b/docs/source/en/api/schedulers/cm_stochastic_iterative.md
@@ -0,0 +1,27 @@
+
+
+# CMStochasticIterativeScheduler
+
+[Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps.
+
+The abstract from the paper is:
+
+*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.*
+
+The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models).
+
+## CMStochasticIterativeScheduler
+[[autodoc]] CMStochasticIterativeScheduler
+
+## CMStochasticIterativeSchedulerOutput
+[[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput
diff --git a/docs/source/en/api/schedulers/consistency_decoder.md b/docs/source/en/api/schedulers/consistency_decoder.md
new file mode 100644
index 0000000..a9eaa53
--- /dev/null
+++ b/docs/source/en/api/schedulers/consistency_decoder.md
@@ -0,0 +1,21 @@
+
+
+# ConsistencyDecoderScheduler
+
+This scheduler is a part of the [`ConsistencyDecoderPipeline`] and was introduced in [DALL-E 3](https://openai.com/dall-e-3).
+
+The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models).
+
+
+## ConsistencyDecoderScheduler
+[[autodoc]] schedulers.scheduling_consistency_decoder.ConsistencyDecoderScheduler
diff --git a/docs/source/en/api/schedulers/ddim.md b/docs/source/en/api/schedulers/ddim.md
new file mode 100644
index 0000000..952855d
--- /dev/null
+++ b/docs/source/en/api/schedulers/ddim.md
@@ -0,0 +1,82 @@
+
+
+# DDIMScheduler
+
+[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
+
+The abstract from the paper is:
+
+*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample.
+To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models
+with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process.
+We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
+We empirically demonstrate that DDIMs can produce high quality samples 10ร to 50ร faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
+
+The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/).
+
+## Tips
+
+The paper [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. To fix this, the authors propose:
+
+
+
+๐งช This is an experimental feature!
+
+
+
+1. rescale the noise schedule to enforce zero terminal signal-to-noise ratio (SNR)
+
+```py
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
+```
+
+2. train a model with `v_prediction` (add the following argument to the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts)
+
+```bash
+--prediction_type="v_prediction"
+```
+
+3. change the sampler to always start from the last timestep
+
+```py
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+```
+
+4. rescale classifier-free guidance to prevent over-exposure
+
+```py
+image = pipe(prompt, guidance_rescale=0.7).images[0]
+```
+
+For example:
+
+```py
+from diffusers import DiffusionPipeline, DDIMScheduler
+import torch
+
+pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
+pipe.scheduler = DDIMScheduler.from_config(
+ pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
+)
+pipe.to("cuda")
+
+prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
+image = pipe(prompt, guidance_rescale=0.7).images[0]
+image
+```
+
+## DDIMScheduler
+[[autodoc]] DDIMScheduler
+
+## DDIMSchedulerOutput
+[[autodoc]] schedulers.scheduling_ddim.DDIMSchedulerOutput
diff --git a/docs/source/en/api/schedulers/ddim_inverse.md b/docs/source/en/api/schedulers/ddim_inverse.md
new file mode 100644
index 0000000..82069cc
--- /dev/null
+++ b/docs/source/en/api/schedulers/ddim_inverse.md
@@ -0,0 +1,19 @@
+
+
+# DDIMInverseScheduler
+
+`DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
+The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794).
+
+## DDIMInverseScheduler
+[[autodoc]] DDIMInverseScheduler
diff --git a/docs/source/en/api/schedulers/ddpm.md b/docs/source/en/api/schedulers/ddpm.md
new file mode 100644
index 0000000..cfe3815
--- /dev/null
+++ b/docs/source/en/api/schedulers/ddpm.md
@@ -0,0 +1,25 @@
+
+
+# DDPMScheduler
+
+[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the context of the ๐ค Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.
+
+The abstract from the paper is:
+
+*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at [this https URL](https://github.com/hojonathanho/diffusion).*
+
+## DDPMScheduler
+[[autodoc]] DDPMScheduler
+
+## DDPMSchedulerOutput
+[[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput
diff --git a/docs/source/en/api/schedulers/deis.md b/docs/source/en/api/schedulers/deis.md
new file mode 100644
index 0000000..4a449b3
--- /dev/null
+++ b/docs/source/en/api/schedulers/deis.md
@@ -0,0 +1,34 @@
+
+
+# DEISMultistepScheduler
+
+Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs).
+
+This implementation modifies the polynomial fitting formula in log-rho space instead of the original linear `t` space in the DEIS paper. The modification enjoys closed-form coefficients for exponential multistep update instead of replying on the numerical solver.
+
+The abstract from the paper is:
+
+*The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate 50k images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID, and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at [this https URL](https://github.com/qsh-zh/deis).*
+
+## Tips
+
+It is recommended to set `solver_order` to 2 or 3, while `solver_order=1` is equivalent to [`DDIMScheduler`].
+
+Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set `thresholding=True` to use the dynamic thresholding.
+
+## DEISMultistepScheduler
+[[autodoc]] DEISMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/dpm_discrete.md b/docs/source/en/api/schedulers/dpm_discrete.md
new file mode 100644
index 0000000..cb95f37
--- /dev/null
+++ b/docs/source/en/api/schedulers/dpm_discrete.md
@@ -0,0 +1,23 @@
+
+
+# KDPM2DiscreteScheduler
+
+The `KDPM2DiscreteScheduler` is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/).
+
+The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion).
+
+## KDPM2DiscreteScheduler
+[[autodoc]] KDPM2DiscreteScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/dpm_discrete_ancestral.md b/docs/source/en/api/schedulers/dpm_discrete_ancestral.md
new file mode 100644
index 0000000..97d205b
--- /dev/null
+++ b/docs/source/en/api/schedulers/dpm_discrete_ancestral.md
@@ -0,0 +1,23 @@
+
+
+# KDPM2AncestralDiscreteScheduler
+
+The `KDPM2DiscreteScheduler` with ancestral sampling is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/).
+
+The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion).
+
+## KDPM2AncestralDiscreteScheduler
+[[autodoc]] KDPM2AncestralDiscreteScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/dpm_sde.md b/docs/source/en/api/schedulers/dpm_sde.md
new file mode 100644
index 0000000..fe87bb9
--- /dev/null
+++ b/docs/source/en/api/schedulers/dpm_sde.md
@@ -0,0 +1,21 @@
+
+
+# DPMSolverSDEScheduler
+
+The `DPMSolverSDEScheduler` is inspired by the stochastic sampler from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/).
+
+## DPMSolverSDEScheduler
+[[autodoc]] DPMSolverSDEScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/euler.md b/docs/source/en/api/schedulers/euler.md
new file mode 100644
index 0000000..9c98118
--- /dev/null
+++ b/docs/source/en/api/schedulers/euler.md
@@ -0,0 +1,22 @@
+
+
+# EulerDiscreteScheduler
+
+The Euler scheduler (Algorithm 2) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/).
+
+
+## EulerDiscreteScheduler
+[[autodoc]] EulerDiscreteScheduler
+
+## EulerDiscreteSchedulerOutput
+[[autodoc]] schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput
diff --git a/docs/source/en/api/schedulers/euler_ancestral.md b/docs/source/en/api/schedulers/euler_ancestral.md
new file mode 100644
index 0000000..eba9b06
--- /dev/null
+++ b/docs/source/en/api/schedulers/euler_ancestral.md
@@ -0,0 +1,21 @@
+
+
+# EulerAncestralDiscreteScheduler
+
+A scheduler that uses ancestral sampling with Euler method steps. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by [Katherine Crowson](https://github.com/crowsonkb/).
+
+## EulerAncestralDiscreteScheduler
+[[autodoc]] EulerAncestralDiscreteScheduler
+
+## EulerAncestralDiscreteSchedulerOutput
+[[autodoc]] schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput
diff --git a/docs/source/en/api/schedulers/heun.md b/docs/source/en/api/schedulers/heun.md
new file mode 100644
index 0000000..bca5cf7
--- /dev/null
+++ b/docs/source/en/api/schedulers/heun.md
@@ -0,0 +1,21 @@
+
+
+# HeunDiscreteScheduler
+
+The Heun scheduler (Algorithm 1) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. The scheduler is ported from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library and created by [Katherine Crowson](https://github.com/crowsonkb/).
+
+## HeunDiscreteScheduler
+[[autodoc]] HeunDiscreteScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/ipndm.md b/docs/source/en/api/schedulers/ipndm.md
new file mode 100644
index 0000000..eeeee8a
--- /dev/null
+++ b/docs/source/en/api/schedulers/ipndm.md
@@ -0,0 +1,21 @@
+
+
+# IPNDMScheduler
+
+`IPNDMScheduler` is a fourth-order Improved Pseudo Linear Multistep scheduler. The original implementation can be found at [crowsonkb/v-diffusion-pytorch](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296).
+
+## IPNDMScheduler
+[[autodoc]] IPNDMScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/lcm.md b/docs/source/en/api/schedulers/lcm.md
new file mode 100644
index 0000000..93e80ea
--- /dev/null
+++ b/docs/source/en/api/schedulers/lcm.md
@@ -0,0 +1,21 @@
+
+
+# Latent Consistency Model Multistep Scheduler
+
+## Overview
+
+Multistep and onestep scheduler (Algorithm 3) introduced alongside latent consistency models in the paper [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao.
+This scheduler should be able to generate good samples from [`LatentConsistencyModelPipeline`] in 1-8 steps.
+
+## LCMScheduler
+[[autodoc]] LCMScheduler
diff --git a/docs/source/en/api/schedulers/lms_discrete.md b/docs/source/en/api/schedulers/lms_discrete.md
new file mode 100644
index 0000000..a0f4aea
--- /dev/null
+++ b/docs/source/en/api/schedulers/lms_discrete.md
@@ -0,0 +1,21 @@
+
+
+# LMSDiscreteScheduler
+
+`LMSDiscreteScheduler` is a linear multistep scheduler for discrete beta schedules. The scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/), and the original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181).
+
+## LMSDiscreteScheduler
+[[autodoc]] LMSDiscreteScheduler
+
+## LMSDiscreteSchedulerOutput
+[[autodoc]] schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput
diff --git a/docs/source/en/api/schedulers/multistep_dpm_solver.md b/docs/source/en/api/schedulers/multistep_dpm_solver.md
new file mode 100644
index 0000000..3487f99
--- /dev/null
+++ b/docs/source/en/api/schedulers/multistep_dpm_solver.md
@@ -0,0 +1,35 @@
+
+
+# DPMSolverMultistepScheduler
+
+`DPMSolverMultistep` is a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.
+
+DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality
+samples, and it can generate quite good samples even in 10 steps.
+
+## Tips
+
+It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling.
+
+Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic
+thresholding. This thresholding method is unsuitable for latent-space diffusion models such as
+Stable Diffusion.
+
+The SDE variant of DPMSolver and DPM-Solver++ is also supported, but only for the first and second-order solvers. This is a fast SDE solver for the reverse diffusion SDE. It is recommended to use the second-order `sde-dpmsolver++`.
+
+## DPMSolverMultistepScheduler
+[[autodoc]] DPMSolverMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md
new file mode 100644
index 0000000..b77a5cf
--- /dev/null
+++ b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md
@@ -0,0 +1,30 @@
+
+
+# DPMSolverMultistepInverse
+
+`DPMSolverMultistepInverse` is the inverted scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.
+
+The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794) and notebook implementation of the [`DiffEdit`] latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb).
+
+## Tips
+
+Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic
+thresholding. This thresholding method is unsuitable for latent-space diffusion models such as
+Stable Diffusion.
+
+## DPMSolverMultistepInverseScheduler
+[[autodoc]] DPMSolverMultistepInverseScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/overview.md b/docs/source/en/api/schedulers/overview.md
new file mode 100644
index 0000000..28db9f3
--- /dev/null
+++ b/docs/source/en/api/schedulers/overview.md
@@ -0,0 +1,64 @@
+
+
+# Schedulers
+
+๐ค Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`.
+
+Depending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output:
+
+- during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model
+- during *inference*, a scheduler defines how to update a sample based on a pretrained model's output
+
+Many schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in ๐ค Diffusers, take a look at the table below:
+
+| A1111/k-diffusion | ๐ค Diffusers | Usage |
+|---------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------|
+| DPM++ 2M | [`DPMSolverMultistepScheduler`] | |
+| DPM++ 2M Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` |
+| DPM++ 2M SDE | [`DPMSolverMultistepScheduler`] | init with `algorithm_type="sde-dpmsolver++"` |
+| DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` and `algorithm_type="sde-dpmsolver++"` |
+| DPM++ 2S a | N/A | very similar to `DPMSolverSinglestepScheduler` |
+| DPM++ 2S a Karras | N/A | very similar to `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` |
+| DPM++ SDE | [`DPMSolverSinglestepScheduler`] | |
+| DPM++ SDE Karras | [`DPMSolverSinglestepScheduler`] | init with `use_karras_sigmas=True` |
+| DPM2 | [`KDPM2DiscreteScheduler`] | |
+| DPM2 Karras | [`KDPM2DiscreteScheduler`] | init with `use_karras_sigmas=True` |
+| DPM2 a | [`KDPM2AncestralDiscreteScheduler`] | |
+| DPM2 a Karras | [`KDPM2AncestralDiscreteScheduler`] | init with `use_karras_sigmas=True` |
+| DPM adaptive | N/A | |
+| DPM fast | N/A | |
+| Euler | [`EulerDiscreteScheduler`] | |
+| Euler a | [`EulerAncestralDiscreteScheduler`] | |
+| Heun | [`HeunDiscreteScheduler`] | |
+| LMS | [`LMSDiscreteScheduler`] | |
+| LMS Karras | [`LMSDiscreteScheduler`] | init with `use_karras_sigmas=True` |
+| N/A | [`DEISMultistepScheduler`] | |
+| N/A | [`UniPCMultistepScheduler`] | |
+
+All schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers.
+
+## SchedulerMixin
+[[autodoc]] SchedulerMixin
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
+
+## KarrasDiffusionSchedulers
+
+[`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in ๐ค Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed.
+
+The different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in ๐ค Diffusers. The schedulers in this class are given [here](https://github.com/huggingface/diffusers/blob/a69754bb879ed55b9b6dc9dd0b3cf4fa4124c765/src/diffusers/schedulers/scheduling_utils.py#L32).
+
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
diff --git a/docs/source/en/api/schedulers/pndm.md b/docs/source/en/api/schedulers/pndm.md
new file mode 100644
index 0000000..ed959d5
--- /dev/null
+++ b/docs/source/en/api/schedulers/pndm.md
@@ -0,0 +1,21 @@
+
+
+# PNDMScheduler
+
+`PNDMScheduler`, or pseudo numerical methods for diffusion models, uses more advanced ODE integration techniques like the Runge-Kutta and linear multi-step method. The original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181).
+
+## PNDMScheduler
+[[autodoc]] PNDMScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/repaint.md b/docs/source/en/api/schedulers/repaint.md
new file mode 100644
index 0000000..3b19e34
--- /dev/null
+++ b/docs/source/en/api/schedulers/repaint.md
@@ -0,0 +1,27 @@
+
+
+# RePaintScheduler
+
+`RePaintScheduler` is a DDPM-based inpainting scheduler for unsupervised inpainting with extreme masks. It is designed to be used with the [`RePaintPipeline`], and it is based on the paper [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) by Andreas Lugmayr et al.
+
+The abstract from the paper is:
+
+*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. GitHub Repository: [this http URL](http://git.io/RePaint).*
+
+The original implementation can be found at [andreas128/RePaint](https://github.com/andreas128/).
+
+## RePaintScheduler
+[[autodoc]] RePaintScheduler
+
+## RePaintSchedulerOutput
+[[autodoc]] schedulers.scheduling_repaint.RePaintSchedulerOutput
diff --git a/docs/source/en/api/schedulers/score_sde_ve.md b/docs/source/en/api/schedulers/score_sde_ve.md
new file mode 100644
index 0000000..43bce14
--- /dev/null
+++ b/docs/source/en/api/schedulers/score_sde_ve.md
@@ -0,0 +1,25 @@
+
+
+# ScoreSdeVeScheduler
+
+`ScoreSdeVeScheduler` is a variance exploding stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole.
+
+The abstract from the paper is:
+
+*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.*
+
+## ScoreSdeVeScheduler
+[[autodoc]] ScoreSdeVeScheduler
+
+## SdeVeOutput
+[[autodoc]] schedulers.scheduling_sde_ve.SdeVeOutput
diff --git a/docs/source/en/api/schedulers/score_sde_vp.md b/docs/source/en/api/schedulers/score_sde_vp.md
new file mode 100644
index 0000000..4b25b25
--- /dev/null
+++ b/docs/source/en/api/schedulers/score_sde_vp.md
@@ -0,0 +1,28 @@
+
+
+# ScoreSdeVpScheduler
+
+`ScoreSdeVpScheduler` is a variance preserving stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole.
+
+The abstract from the paper is:
+
+*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.*
+
+
+
+๐ง This scheduler is under construction!
+
+
+
+## ScoreSdeVpScheduler
+[[autodoc]] schedulers.deprecated.scheduling_sde_vp.ScoreSdeVpScheduler
diff --git a/docs/source/en/api/schedulers/singlestep_dpm_solver.md b/docs/source/en/api/schedulers/singlestep_dpm_solver.md
new file mode 100644
index 0000000..063678f
--- /dev/null
+++ b/docs/source/en/api/schedulers/singlestep_dpm_solver.md
@@ -0,0 +1,35 @@
+
+
+# DPMSolverSinglestepScheduler
+
+`DPMSolverSinglestepScheduler` is a single step scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.
+
+DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality
+samples, and it can generate quite good samples even in 10 steps.
+
+The original implementation can be found at [LuChengTHU/dpm-solver](https://github.com/LuChengTHU/dpm-solver).
+
+## Tips
+
+It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling.
+
+Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use dynamic
+thresholding. This thresholding method is unsuitable for latent-space diffusion models such as
+Stable Diffusion.
+
+## DPMSolverSinglestepScheduler
+[[autodoc]] DPMSolverSinglestepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/stochastic_karras_ve.md b/docs/source/en/api/schedulers/stochastic_karras_ve.md
new file mode 100644
index 0000000..2d08b32
--- /dev/null
+++ b/docs/source/en/api/schedulers/stochastic_karras_ve.md
@@ -0,0 +1,21 @@
+
+
+# KarrasVeScheduler
+
+`KarrasVeScheduler` is a stochastic sampler tailored to variance-expanding (VE) models. It is based on the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) and [Score-based generative modeling through stochastic differential equations](https://huggingface.co/papers/2011.13456) papers.
+
+## KarrasVeScheduler
+[[autodoc]] KarrasVeScheduler
+
+## KarrasVeOutput
+[[autodoc]] schedulers.deprecated.scheduling_karras_ve.KarrasVeOutput
\ No newline at end of file
diff --git a/docs/source/en/api/schedulers/tcd.md b/docs/source/en/api/schedulers/tcd.md
new file mode 100644
index 0000000..3df7390
--- /dev/null
+++ b/docs/source/en/api/schedulers/tcd.md
@@ -0,0 +1,29 @@
+
+
+# TCDScheduler
+
+[Trajectory Consistency Distillation](https://huggingface.co/papers/2402.19159) by Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao and Tat-Jen Cham introduced a Strategic Stochastic Sampling (Algorithm 4) that is capable of generating good samples in a small number of steps. Distinguishing it as an advanced iteration of the multistep scheduler (Algorithm 1) in the [Consistency Models](https://huggingface.co/papers/2303.01469), Strategic Stochastic Sampling specifically tailored for the trajectory consistency function.
+
+The abstract from the paper is:
+
+*Latent Consistency Model (LCM) extends the Consistency Model to the latent space and leverages the guided consistency distillation technique to achieve impressive performance in accelerating text-to-image synthesis. However, we observed that LCM struggles to generate images with both clarity and detailed intricacy. To address this limitation, we initially delve into and elucidate the underlying causes. Our investigation identifies that the primary issue stems from errors in three distinct areas. Consequently, we introduce Trajectory Consistency Distillation (TCD), which encompasses trajectory consistency function and strategic stochastic sampling. The trajectory consistency function diminishes the distillation errors by broadening the scope of the self-consistency boundary condition and endowing the TCD with the ability to accurately trace the entire trajectory of the Probability Flow ODE. Additionally, strategic stochastic sampling is specifically designed to circumvent the accumulated errors inherent in multi-step consistency sampling, which is meticulously tailored to complement the TCD model. Experiments demonstrate that TCD not only significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model at high NFEs.*
+
+The original codebase can be found at [jabir-zheng/TCD](https://github.com/jabir-zheng/TCD).
+
+## TCDScheduler
+[[autodoc]] TCDScheduler
+
+
+## TCDSchedulerOutput
+[[autodoc]] schedulers.scheduling_tcd.TCDSchedulerOutput
+
diff --git a/docs/source/en/api/schedulers/unipc.md b/docs/source/en/api/schedulers/unipc.md
new file mode 100644
index 0000000..d823459
--- /dev/null
+++ b/docs/source/en/api/schedulers/unipc.md
@@ -0,0 +1,35 @@
+
+
+# UniPCMultistepScheduler
+
+`UniPCMultistepScheduler` is a training-free framework designed for fast sampling of diffusion models. It was introduced in [UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models](https://huggingface.co/papers/2302.04867) by Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu.
+
+It consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders.
+UniPC is by design model-agnostic, supporting pixel-space/latent-space DPMs on unconditional/conditional sampling. It can also be applied to both noise prediction and data prediction models. The corrector UniC can be also applied after any off-the-shelf solvers to increase the order of accuracy.
+
+The abstract from the paper is:
+
+*Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., <10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256ร256 (conditional) with only 10 function evaluations. Code is available at [this https URL](https://github.com/wl-zhao/UniPC).*
+
+## Tips
+
+It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling.
+
+Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `predict_x0=True` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion.
+
+## UniPCMultistepScheduler
+[[autodoc]] UniPCMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
diff --git a/docs/source/en/api/schedulers/vq_diffusion.md b/docs/source/en/api/schedulers/vq_diffusion.md
new file mode 100644
index 0000000..b21cba9
--- /dev/null
+++ b/docs/source/en/api/schedulers/vq_diffusion.md
@@ -0,0 +1,25 @@
+
+
+# VQDiffusionScheduler
+
+`VQDiffusionScheduler` converts the transformer model's output into a sample for the unnoised image at the previous diffusion timestep. It was introduced in [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo.
+
+The abstract from the paper is:
+
+*We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.*
+
+## VQDiffusionScheduler
+[[autodoc]] VQDiffusionScheduler
+
+## VQDiffusionSchedulerOutput
+[[autodoc]] schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput
diff --git a/docs/source/en/api/utilities.md b/docs/source/en/api/utilities.md
new file mode 100644
index 0000000..71253db
--- /dev/null
+++ b/docs/source/en/api/utilities.md
@@ -0,0 +1,39 @@
+
+
+# Utilities
+
+Utility and helper functions for working with ๐ค Diffusers.
+
+## numpy_to_pil
+
+[[autodoc]] utils.numpy_to_pil
+
+## pt_to_pil
+
+[[autodoc]] utils.pt_to_pil
+
+## load_image
+
+[[autodoc]] utils.load_image
+
+## export_to_gif
+
+[[autodoc]] utils.export_to_gif
+
+## export_to_video
+
+[[autodoc]] utils.export_to_video
+
+## make_image_grid
+
+[[autodoc]] utils.make_image_grid
diff --git a/docs/source/en/conceptual/contribution.md b/docs/source/en/conceptual/contribution.md
new file mode 100644
index 0000000..24ac52b
--- /dev/null
+++ b/docs/source/en/conceptual/contribution.md
@@ -0,0 +1,525 @@
+
+
+# How to contribute to Diffusers ๐งจ
+
+We โค๏ธ contributions from the open-source community! Everyone is welcome, and all types of participation โnot just codeโ are valued and appreciated. Answering questions, helping others, reaching out, and improving the documentation are all immensely valuable to the community, so don't be afraid and get involved if you're up for it!
+
+Everyone is encouraged to start by saying ๐ in our public Discord channel. We discuss the latest trends in diffusion models, ask questions, show off personal projects, help each other with contributions, or just hang out โ.
+
+Whichever way you choose to contribute, we strive to be part of an open, welcoming, and kind community. Please, read our [code of conduct](https://github.com/huggingface/diffusers/blob/main/CODE_OF_CONDUCT.md) and be mindful to respect it during your interactions. We also recommend you become familiar with the [ethical guidelines](https://huggingface.co/docs/diffusers/conceptual/ethical_guidelines) that guide our project and ask you to adhere to the same principles of transparency and responsibility.
+
+We enormously value feedback from the community, so please do not be afraid to speak up if you believe you have valuable feedback that can help improve the library - every message, comment, issue, and pull request (PR) is read and considered.
+
+## Overview
+
+You can contribute in many ways ranging from answering questions on issues to adding new diffusion models to
+the core library.
+
+In the following, we give an overview of different ways to contribute, ranked by difficulty in ascending order. All of them are valuable to the community.
+
+* 1. Asking and answering questions on [the Diffusers discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers) or on [Discord](https://discord.gg/G7tWnz98XR).
+* 2. Opening new issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues/new/choose).
+* 3. Answering issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues).
+* 4. Fix a simple issue, marked by the "Good first issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).
+* 5. Contribute to the [documentation](https://github.com/huggingface/diffusers/tree/main/docs/source).
+* 6. Contribute a [Community Pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3Acommunity-examples).
+* 7. Contribute to the [examples](https://github.com/huggingface/diffusers/tree/main/examples).
+* 8. Fix a more difficult issue, marked by the "Good second issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22).
+* 9. Add a new pipeline, model, or scheduler, see ["New Pipeline/Model"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) and ["New scheduler"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) issues. For this contribution, please have a look at [Design Philosophy](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md).
+
+As said before, **all contributions are valuable to the community**.
+In the following, we will explain each contribution a bit more in detail.
+
+For all contributions 4 - 9, you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr).
+
+### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord
+
+Any question or comment related to the Diffusers library can be asked on the [discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/) or on [Discord](https://discord.gg/G7tWnz98XR). Such questions and comments include (but are not limited to):
+- Reports of training or inference experiments in an attempt to share knowledge
+- Presentation of personal projects
+- Questions to non-official training examples
+- Project proposals
+- General feedback
+- Paper summaries
+- Asking for help on personal projects that build on top of the Diffusers library
+- General questions
+- Ethical questions regarding diffusion models
+- ...
+
+Every question that is asked on the forum or on Discord actively encourages the community to publicly
+share knowledge and might very well help a beginner in the future who has the same question you're
+having. Please do pose any questions you might have.
+In the same spirit, you are of immense help to the community by answering such questions because this way you are publicly documenting knowledge for everybody to learn from.
+
+**Please** keep in mind that the more effort you put into asking or answering a question, the higher
+the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
+In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
+
+**NOTE about channels**:
+[*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
+In addition, questions and answers posted in the forum can easily be linked to.
+In contrast, *Discord* has a chat-like format that invites fast back-and-forth communication.
+While it will most likely take less time for you to get an answer to your question on Discord, your
+question won't be visible anymore over time. Also, it's much harder to find information that was posted a while back on Discord. We therefore strongly recommend using the forum for high-quality questions and answers in an attempt to create long-lasting knowledge for the community. If discussions on Discord lead to very interesting answers and conclusions, we recommend posting the results on the forum to make the information more available for future readers.
+
+### 2. Opening new issues on the GitHub issues tab
+
+The ๐งจ Diffusers library is robust and reliable thanks to the users who notify us of
+the problems they encounter. So thank you for reporting an issue.
+
+Remember, GitHub issues are reserved for technical questions directly related to the Diffusers library, bug reports, feature requests, or feedback on the library design.
+
+In a nutshell, this means that everything that is **not** related to the **code of the Diffusers library** (including the documentation) should **not** be asked on GitHub, but rather on either the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR).
+
+**Please consider the following guidelines when opening a new issue**:
+- Make sure you have searched whether your issue has already been asked before (use the search bar on GitHub under Issues).
+- Please never report a new issue on another (related) issue. If another issue is highly related, please
+open a new issue nevertheless and link to the related issue.
+- Make sure your issue is written in English. Please use one of the great, free online translation services, such as [DeepL](https://www.deepl.com/translator) to translate from your native language to English if you are not comfortable in English.
+- Check whether your issue might be solved by updating to the newest Diffusers version. Before posting your issue, please make sure that `python -c "import diffusers; print(diffusers.__version__)"` is higher or matches the latest Diffusers version.
+- Remember that the more effort you put into opening a new issue, the higher the quality of your answer will be and the better the overall quality of the Diffusers issues.
+
+New issues usually include the following.
+
+#### 2.1. Reproducible, minimal bug reports
+
+A bug report should always have a reproducible code snippet and be as minimal and concise as possible.
+This means in more detail:
+- Narrow the bug down as much as you can, **do not just dump your whole code file**.
+- Format your code.
+- Do not include any external libraries except for Diffusers depending on them.
+- **Always** provide all necessary information about your environment; for this, you can run: `diffusers-cli env` in your shell and copy-paste the displayed information to the issue.
+- Explain the issue. If the reader doesn't know what the issue is and why it is an issue, she cannot solve it.
+- **Always** make sure the reader can reproduce your issue with as little effort as possible. If your code snippet cannot be run because of missing libraries or undefined variables, the reader cannot help you. Make sure your reproducible code snippet is as minimal as possible and can be copy-pasted into a simple Python shell.
+- If in order to reproduce your issue a model and/or dataset is required, make sure the reader has access to that model or dataset. You can always upload your model or dataset to the [Hub](https://huggingface.co) to make it easily downloadable. Try to keep your model and dataset as small as possible, to make the reproduction of your issue as effortless as possible.
+
+For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
+
+You can open a bug report [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&projects=&template=bug-report.yml).
+
+#### 2.2. Feature requests
+
+A world-class feature request addresses the following points:
+
+1. Motivation first:
+* Is it related to a problem/frustration with the library? If so, please explain
+why. Providing a code snippet that demonstrates the problem is best.
+* Is it related to something you would need for a project? We'd love to hear
+about it!
+* Is it something you worked on and think could benefit the community?
+Awesome! Tell us what problem it solved for you.
+2. Write a *full paragraph* describing the feature;
+3. Provide a **code snippet** that demonstrates its future use;
+4. In case this is related to a paper, please attach a link;
+5. Attach any additional information (drawings, screenshots, etc.) you think may help.
+
+You can open a feature request [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=).
+
+#### 2.3 Feedback
+
+Feedback about the library design and why it is good or not good helps the core maintainers immensely to build a user-friendly library. To understand the philosophy behind the current design philosophy, please have a look [here](https://huggingface.co/docs/diffusers/conceptual/philosophy). If you feel like a certain design choice does not fit with the current design philosophy, please explain why and how it should be changed. If a certain design choice follows the design philosophy too much, hence restricting use cases, explain why and how it should be changed.
+If a certain design choice is very useful for you, please also leave a note as this is great feedback for future design decisions.
+
+You can open an issue about feedback [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=).
+
+#### 2.4 Technical questions
+
+Technical questions are mainly about why certain code of the library was written in a certain way, or what a certain part of the code does. Please make sure to link to the code in question and please provide details on
+why this part of the code is difficult to understand.
+
+You can open an issue about a technical question [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&template=bug-report.yml).
+
+#### 2.5 Proposal to add a new model, scheduler, or pipeline
+
+If the diffusion model community released a new model, pipeline, or scheduler that you would like to see in the Diffusers library, please provide the following information:
+
+* Short description of the diffusion pipeline, model, or scheduler and link to the paper or public release.
+* Link to any of its open-source implementation(s).
+* Link to the model weights if they are available.
+
+If you are willing to contribute to the model yourself, let us know so we can best guide you. Also, don't forget
+to tag the original author of the component (model, scheduler, pipeline, etc.) by GitHub handle if you can find it.
+
+You can open a request for a model/pipeline/scheduler [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=New+model%2Fpipeline%2Fscheduler&template=new-model-addition.yml).
+
+### 3. Answering issues on the GitHub issues tab
+
+Answering issues on GitHub might require some technical knowledge of Diffusers, but we encourage everybody to give it a try even if you are not 100% certain that your answer is correct.
+Some tips to give a high-quality answer to an issue:
+- Be as concise and minimal as possible.
+- Stay on topic. An answer to the issue should concern the issue and only the issue.
+- Provide links to code, papers, or other sources that prove or encourage your point.
+- Answer in code. If a simple code snippet is the answer to the issue or shows how the issue can be solved, please provide a fully reproducible code snippet.
+
+Also, many issues tend to be simply off-topic, duplicates of other issues, or irrelevant. It is of great
+help to the maintainers if you can answer such issues, encouraging the author of the issue to be
+more precise, provide the link to a duplicated issue or redirect them to [the forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR).
+
+If you have verified that the issued bug report is correct and requires a correction in the source code,
+please have a look at the next sections.
+
+For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section.
+
+### 4. Fixing a "Good first issue"
+
+*Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already
+explains how a potential solution should look so that it is easier to fix.
+If the issue hasn't been closed and you would like to try to fix this issue, you can just leave a message "I would like to try this issue.". There are usually three scenarios:
+- a.) The issue description already proposes a fix. In this case and if the solution makes sense to you, you can open a PR or draft PR to fix it.
+- b.) The issue description does not propose a fix. In this case, you can ask what a proposed fix could look like and someone from the Diffusers team should answer shortly. If you have a good idea of how to fix it, feel free to directly open a PR.
+- c.) There is already an open PR to fix the issue, but the issue hasn't been closed yet. If the PR has gone stale, you can simply open a new PR and link to the stale PR. PRs often go stale if the original contributor who wanted to fix the issue suddenly cannot find the time anymore to proceed. This often happens in open-source and is very normal. In this case, the community will be very happy if you give it a new try and leverage the knowledge of the existing PR. If there is already a PR and it is active, you can help the author by giving suggestions, reviewing the PR or even asking whether you can contribute to the PR.
+
+
+### 5. Contribute to the documentation
+
+A good library **always** has good documentation! The official documentation is often one of the first points of contact for new users of the library, and therefore contributing to the documentation is a **highly
+valuable contribution**.
+
+Contributing to the library can have many forms:
+
+- Correcting spelling or grammatical errors.
+- Correct incorrect formatting of the docstring. If you see that the official documentation is weirdly displayed or a link is broken, we would be very happy if you take some time to correct it.
+- Correct the shape or dimensions of a docstring input or output tensor.
+- Clarify documentation that is hard to understand or incorrect.
+- Update outdated code examples.
+- Translating the documentation to another language.
+
+Anything displayed on [the official Diffusers doc page](https://huggingface.co/docs/diffusers/index) is part of the official documentation and can be corrected, adjusted in the respective [documentation source](https://github.com/huggingface/diffusers/tree/main/docs/source).
+
+Please have a look at [this page](https://github.com/huggingface/diffusers/tree/main/docs) on how to verify changes made to the documentation locally.
+
+
+### 6. Contribute a community pipeline
+
+[Pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) are usually the first point of contact between the Diffusers library and the user.
+Pipelines are examples of how to use Diffusers [models](https://huggingface.co/docs/diffusers/api/models/overview) and [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview).
+We support two types of pipelines:
+
+- Official Pipelines
+- Community Pipelines
+
+Both official and community pipelines follow the same design and consist of the same type of components.
+
+Official pipelines are tested and maintained by the core maintainers of Diffusers. Their code
+resides in [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines).
+In contrast, community pipelines are contributed and maintained purely by the **community** and are **not** tested.
+They reside in [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) and while they can be accessed via the [PyPI diffusers package](https://pypi.org/project/diffusers/), their code is not part of the PyPI distribution.
+
+The reason for the distinction is that the core maintainers of the Diffusers library cannot maintain and test all
+possible ways diffusion models can be used for inference, but some of them may be of interest to the community.
+Officially released diffusion pipelines,
+such as Stable Diffusion are added to the core src/diffusers/pipelines package which ensures
+high quality of maintenance, no backward-breaking code changes, and testing.
+More bleeding edge pipelines should be added as community pipelines. If usage for a community pipeline is high, the pipeline can be moved to the official pipelines upon request from the community. This is one of the ways we strive to be a community-driven library.
+
+To add a community pipeline, one should add a .py file to [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) and adapt the [examples/community/README.md](https://github.com/huggingface/diffusers/tree/main/examples/community/README.md) to include an example of the new pipeline.
+
+An example can be seen [here](https://github.com/huggingface/diffusers/pull/2400).
+
+Community pipeline PRs are only checked at a superficial level and ideally they should be maintained by their original authors.
+
+Contributing a community pipeline is a great way to understand how Diffusers models and schedulers work. Having contributed a community pipeline is usually the first stepping stone to contributing an official pipeline to the
+core package.
+
+### 7. Contribute to training examples
+
+Diffusers examples are a collection of training scripts that reside in [examples](https://github.com/huggingface/diffusers/tree/main/examples).
+
+We support two types of training examples:
+
+- Official training examples
+- Research training examples
+
+Research training examples are located in [examples/research_projects](https://github.com/huggingface/diffusers/tree/main/examples/research_projects) whereas official training examples include all folders under [examples](https://github.com/huggingface/diffusers/tree/main/examples) except the `research_projects` and `community` folders.
+The official training examples are maintained by the Diffusers' core maintainers whereas the research training examples are maintained by the community.
+This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models.
+If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author.
+
+Both official training and research examples consist of a directory that contains one or more training scripts, a requirements.txt file, and a README.md file. In order for the user to make use of the
+training examples, it is required to clone the repository:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+```
+
+as well as to install all additional dependencies required for training:
+
+```bash
+pip install -r /examples//requirements.txt
+```
+
+Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt).
+
+Training examples of the Diffusers library should adhere to the following philosophy:
+- All the code necessary to run the examples should be found in a single Python file.
+- One should be able to run the example from the command line with `python .py --args`.
+- Examples should be kept simple and serve as **an example** on how to use Diffusers for training. The purpose of example scripts is **not** to create state-of-the-art diffusion models, but rather to reproduce known training schemes without adding too much custom logic. As a byproduct of this point, our examples also strive to serve as good educational materials.
+
+To contribute an example, it is highly recommended to look at already existing examples such as [dreambooth](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) to get an idea of how they should look like.
+We strongly advise contributors to make use of the [Accelerate library](https://github.com/huggingface/accelerate) as it's tightly integrated
+with Diffusers.
+Once an example script works, please make sure to add a comprehensive `README.md` that states how to use the example exactly. This README should include:
+- An example command on how to run the example script as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#running-locally-with-pytorch).
+- A link to some training results (logs, models, etc.) that show what the user can expect as shown [here](https://api.wandb.ai/report/patrickvonplaten/xm6cd5q5).
+- If you are adding a non-official/research training example, **please don't forget** to add a sentence that you are maintaining this training example which includes your git handle as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/research_projects/intel_opts#diffusers-examples-with-intel-optimizations).
+
+If you are contributing to the official training examples, please also make sure to add a test to [examples/test_examples.py](https://github.com/huggingface/diffusers/blob/main/examples/test_examples.py). This is not necessary for non-official training examples.
+
+### 8. Fixing a "Good second issue"
+
+*Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are
+usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).
+The issue description usually gives less guidance on how to fix the issue and requires
+a decent understanding of the library by the interested contributor.
+If you are interested in tackling a good second issue, feel free to open a PR to fix it and link the PR to the issue. If you see that a PR has already been opened for this issue but did not get merged, have a look to understand why it wasn't merged and try to open an improved PR.
+Good second issues are usually more difficult to get merged compared to good first issues, so don't hesitate to ask for help from the core maintainers. If your PR is almost finished the core maintainers can also jump into your PR and commit to it in order to get it merged.
+
+### 9. Adding pipelines, models, schedulers
+
+Pipelines, models, and schedulers are the most important pieces of the Diffusers library.
+They provide easy access to state-of-the-art diffusion technologies and thus allow the community to
+build powerful generative AI applications.
+
+By adding a new model, pipeline, or scheduler you might enable a new powerful use case for any of the user interfaces relying on Diffusers which can be of immense value for the whole generative AI ecosystem.
+
+Diffusers has a couple of open feature requests for all three components - feel free to gloss over them
+if you don't know yet what specific component you would like to add:
+- [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22)
+- [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22)
+
+Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](philosophy) a read to better understand the design of any of the three components. Please be aware that we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy
+as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design pattern/design choice shall be changed everywhere in the library and whether we shall update our design philosophy. Consistency across the library is very important for us.
+
+Please make sure to add links to the original codebase/paper to the PR and ideally also ping the original author directly on the PR so that they can follow the progress and potentially help with questions.
+
+If you are unsure or stuck in the PR, don't hesitate to leave a message to ask for a first review or help.
+
+#### Copied from mechanism
+
+A unique and important feature to understand when adding any pipeline, model or scheduler code is the `# Copied from` mechanism. You'll see this all over the Diffusers codebase, and the reason we use it is to keep the codebase easy to understand and maintain. Marking code with the `# Copied from` mechanism forces the marked code to be identical to the code it was copied from. This makes it easy to update and propagate changes across many files whenever you run `make fix-copies`.
+
+For example, in the code example below, [`~diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is the original code and `AltDiffusionPipelineOutput` uses the `# Copied from` mechanism to copy it. The only difference is changing the class prefix from `Stable` to `Alt`.
+
+```py
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_output.StableDiffusionPipelineOutput with Stable->Alt
+class AltDiffusionPipelineOutput(BaseOutput):
+ """
+ Output class for Alt Diffusion pipelines.
+
+ Args:
+ images (`List[PIL.Image.Image]` or `np.ndarray`)
+ List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
+ num_channels)`.
+ nsfw_content_detected (`List[bool]`)
+ List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or
+ `None` if safety checking could not be performed.
+ """
+```
+
+To learn more, read this section of the [~Don't~ Repeat Yourself*](https://huggingface.co/blog/transformers-design-philosophy#4-machine-learning-models-are-static) blog post.
+
+## How to write a good issue
+
+**The better your issue is written, the higher the chances that it will be quickly resolved.**
+
+1. Make sure that you've used the correct template for your issue. You can pick between *Bug Report*, *Feature Request*, *Feedback about API Design*, *New model/pipeline/scheduler addition*, *Forum*, or a blank issue. Make sure to pick the correct one when opening [a new issue](https://github.com/huggingface/diffusers/issues/new/choose).
+2. **Be precise**: Give your issue a fitting title. Try to formulate your issue description as simple as possible. The more precise you are when submitting an issue, the less time it takes to understand the issue and potentially solve it. Make sure to open an issue for one issue only and not for multiple issues. If you found multiple issues, simply open multiple issues. If your issue is a bug, try to be as precise as possible about what bug it is - you should not just write "Error in diffusers".
+3. **Reproducibility**: No reproducible code snippet == no solution. If you encounter a bug, maintainers **have to be able to reproduce** it. Make sure that you include a code snippet that can be copy-pasted into a Python interpreter to reproduce the issue. Make sure that your code snippet works, *i.e.* that there are no missing imports or missing links to images, ... Your issue should contain an error message **and** a code snippet that can be copy-pasted without any changes to reproduce the exact same error message. If your issue is using local model weights or local data that cannot be accessed by the reader, the issue cannot be solved. If you cannot share your data or model, try to make a dummy model or dummy data.
+4. **Minimalistic**: Try to help the reader as much as you can to understand the issue as quickly as possible by staying as concise as possible. Remove all code / all information that is irrelevant to the issue. If you have found a bug, try to create the easiest code example you can to demonstrate your issue, do not just dump your whole workflow into the issue as soon as you have found a bug. E.g., if you train a model and get an error at some point during the training, you should first try to understand what part of the training code is responsible for the error and try to reproduce it with a couple of lines. Try to use dummy data instead of full datasets.
+5. Add links. If you are referring to a certain naming, method, or model make sure to provide a link so that the reader can better understand what you mean. If you are referring to a specific PR or issue, make sure to link it to your issue. Do not assume that the reader knows what you are talking about. The more links you add to your issue the better.
+6. Formatting. Make sure to nicely format your issue by formatting code into Python code syntax, and error messages into normal code syntax. See the [official GitHub formatting docs](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) for more information.
+7. Think of your issue not as a ticket to be solved, but rather as a beautiful entry to a well-written encyclopedia. Every added issue is a contribution to publicly available knowledge. By adding a nicely written issue you not only make it easier for maintainers to solve your issue, but you are helping the whole community to better understand a certain aspect of the library.
+
+## How to write a good PR
+
+1. Be a chameleon. Understand existing design patterns and syntax and make sure your code additions flow seamlessly into the existing code base. Pull requests that significantly diverge from existing design patterns or user interfaces will not be merged.
+2. Be laser focused. A pull request should solve one problem and one problem only. Make sure to not fall into the trap of "also fixing another problem while we're adding it". It is much more difficult to review pull requests that solve multiple, unrelated problems at once.
+3. If helpful, try to add a code snippet that displays an example of how your addition can be used.
+4. The title of your pull request should be a summary of its contribution.
+5. If your pull request addresses an issue, please mention the issue number in
+the pull request description to make sure they are linked (and people
+consulting the issue know you are working on it);
+6. To indicate a work in progress please prefix the title with `[WIP]`. These
+are useful to avoid duplicated work, and to differentiate it from PRs ready
+to be merged;
+7. Try to formulate and format your text as explained in [How to write a good issue](#how-to-write-a-good-issue).
+8. Make sure existing tests pass;
+9. Add high-coverage tests. No quality testing = no merge.
+- If you are adding new `@slow` tests, make sure they pass using
+`RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`.
+CircleCI does not run the slow tests, but GitHub Actions does every night!
+10. All public methods must have informative docstrings that work nicely with markdown. See [`pipeline_latent_diffusion.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) for an example.
+11. Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
+[`hf-internal-testing`](https://huggingface.co/hf-internal-testing) or [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images) to place these files.
+If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
+to this dataset.
+
+## How to open a PR
+
+Before writing code, we strongly advise you to search through the existing PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+
+You will need basic `git` proficiency to be able to contribute to
+๐งจ Diffusers. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+
+Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/main/setup.py#L244)):
+
+1. Fork the [repository](https://github.com/huggingface/diffusers) by
+clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+under your GitHub user account.
+
+2. Clone your fork to your local disk, and add the base repository as a remote:
+
+ ```bash
+ $ git clone git@github.com:/diffusers.git
+ $ cd diffusers
+ $ git remote add upstream https://github.com/huggingface/diffusers.git
+ ```
+
+3. Create a new branch to hold your development changes:
+
+ ```bash
+ $ git checkout -b a-descriptive-name-for-my-changes
+ ```
+
+**Do not** work on the `main` branch.
+
+4. Set up a development environment by running the following command in a virtual environment:
+
+ ```bash
+ $ pip install -e ".[dev]"
+ ```
+
+If you have already cloned the repo, you might need to `git pull` to get the most recent changes in the
+library.
+
+5. Develop the features on your branch.
+
+As you work on the features, you should make sure that the test suite
+passes. You should run the tests impacted by your changes like this:
+
+ ```bash
+ $ pytest tests/.py
+ ```
+
+Before you run the tests, please make sure you install the dependencies required for testing. You can do so
+with this command:
+
+ ```bash
+ $ pip install -e ".[test]"
+ ```
+
+You can also run the full test suite with the following command, but it takes
+a beefy machine to produce a result in a decent amount of time now that
+Diffusers has grown a lot. Here is the command for it:
+
+ ```bash
+ $ make test
+ ```
+
+๐งจ Diffusers relies on `black` and `isort` to format its source code
+consistently. After you make changes, apply automatic style corrections and code verifications
+that can't be automated in one go with:
+
+ ```bash
+ $ make style
+ ```
+
+๐งจ Diffusers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality
+control runs in CI, however, you can also run the same checks with:
+
+ ```bash
+ $ make quality
+ ```
+
+Once you're happy with your changes, add changed files using `git add` and
+make a commit with `git commit` to record your changes locally:
+
+ ```bash
+ $ git add modified_file.py
+ $ git commit -m "A descriptive message about your changes."
+ ```
+
+It is a good idea to sync your copy of the code with the original
+repository regularly. This way you can quickly account for changes:
+
+ ```bash
+ $ git pull upstream main
+ ```
+
+Push the changes to your account using:
+
+ ```bash
+ $ git push -u origin a-descriptive-name-for-my-changes
+ ```
+
+6. Once you are satisfied, go to the
+webpage of your fork on GitHub. Click on 'Pull request' to send your changes
+to the project maintainers for review.
+
+7. It's OK if maintainers ask you for changes. It happens to core contributors
+too! So everyone can see the changes in the Pull request, work in your local
+branch and push the changes to your fork. They will automatically appear in
+the pull request.
+
+### Tests
+
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
+the [tests folder](https://github.com/huggingface/diffusers/tree/main/tests).
+
+We like `pytest` and `pytest-xdist` because it's faster. From the root of the
+repository, here's how to run tests with `pytest` for the library:
+
+```bash
+$ python -m pytest -n auto --dist=loadfile -s -v ./tests/
+```
+
+In fact, that's how `make test` is implemented!
+
+You can specify a smaller set of tests in order to test only the feature
+you're working on.
+
+By default, slow tests are skipped. Set the `RUN_SLOW` environment variable to
+`yes` to run them. This will download many gigabytes of models โ make sure you
+have enough disk space and a good Internet connection, or a lot of patience!
+
+```bash
+$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/
+```
+
+`unittest` is fully supported, here's how to run tests with it:
+
+```bash
+$ python -m unittest discover -s tests -t . -v
+$ python -m unittest discover -s examples -t examples -v
+```
+
+### Syncing forked main with upstream (HuggingFace) main
+
+To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnecessary notifications to the developers involved in these PRs,
+when syncing the main branch of a forked repository, please, follow these steps:
+1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main.
+2. If a PR is absolutely necessary, use the following steps after checking out your branch:
+```bash
+$ git checkout -b your-branch-for-syncing
+$ git pull --squash --no-commit upstream main
+$ git commit -m ''
+$ git push --set-upstream origin your-branch-for-syncing
+```
+
+### Style guide
+
+For documentation strings, ๐งจ Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
diff --git a/docs/source/en/conceptual/ethical_guidelines.md b/docs/source/en/conceptual/ethical_guidelines.md
new file mode 100644
index 0000000..426aed0
--- /dev/null
+++ b/docs/source/en/conceptual/ethical_guidelines.md
@@ -0,0 +1,63 @@
+
+
+# ๐งจ Diffusersโ Ethical Guidelines
+
+## Preamble
+
+[Diffusers](https://huggingface.co/docs/diffusers/index) provides pre-trained diffusion models and serves as a modular toolbox for inference and training.
+
+Given its real case applications in the world and potential negative impacts on society, we think it is important to provide the project with ethical guidelines to guide the development, usersโ contributions, and usage of the Diffusers library.
+
+The risks associated with using this technology are still being examined, but to name a few: copyrights issues for artists; deep-fake exploitation; sexual content generation in inappropriate contexts; non-consensual impersonation; harmful social biases perpetuating the oppression of marginalized groups.
+We will keep tracking risks and adapt the following guidelines based on the community's responsiveness and valuable feedback.
+
+
+## Scope
+
+The Diffusers community will apply the following ethical guidelines to the projectโs development and help coordinate how the community will integrate the contributions, especially concerning sensitive topics related to ethical concerns.
+
+
+## Ethical guidelines
+
+The following ethical guidelines apply generally, but we will primarily implement them when dealing with ethically sensitive issues while making a technical choice. Furthermore, we commit to adapting those ethical principles over time following emerging harms related to the state of the art of the technology in question.
+
+- **Transparency**: we are committed to being transparent in managing PRs, explaining our choices to users, and making technical decisions.
+
+- **Consistency**: we are committed to guaranteeing our users the same level of attention in project management, keeping it technically stable and consistent.
+
+- **Simplicity**: with a desire to make it easy to use and exploit the Diffusers library, we are committed to keeping the projectโs goals lean and coherent.
+
+- **Accessibility**: the Diffusers project helps lower the entry bar for contributors who can help run it even without technical expertise. Doing so makes research artifacts more accessible to the community.
+
+- **Reproducibility**: we aim to be transparent about the reproducibility of upstream code, models, and datasets when made available through the Diffusers library.
+
+- **Responsibility**: as a community and through teamwork, we hold a collective responsibility to our users by anticipating and mitigating this technology's potential risks and dangers.
+
+
+## Examples of implementations: Safety features and Mechanisms
+
+The team works daily to make the technical and non-technical tools available to deal with the potential ethical and social risks associated with diffusion technology. Moreover, the community's input is invaluable in ensuring these features' implementation and raising awareness with us.
+
+- [**Community tab**](https://huggingface.co/docs/hub/repositories-pull-requests-discussions): it enables the community to discuss and better collaborate on a project.
+
+- **Bias exploration and evaluation**: the Hugging Face team provides a [space](https://huggingface.co/spaces/society-ethics/DiffusionBiasExplorer) to demonstrate the biases in Stable Diffusion interactively. In this sense, we support and encourage bias explorers and evaluations.
+
+- **Encouraging safety in deployment**
+
+ - [**Safe Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_safe): It mitigates the well-known issue that models, like Stable Diffusion, that are trained on unfiltered, web-crawled datasets tend to suffer from inappropriate degeneration. Related paper: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://arxiv.org/abs/2211.05105).
+
+ - [**Safety Checker**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): It checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated. The harmful concepts are intentionally hidden to prevent reverse engineering of the checker.
+
+- **Staged released on the Hub**: in particularly sensitive situations, access to some repositories should be restricted. This staged release is an intermediary step that allows the repositoryโs authors to have more control over its use.
+
+- **Licensing**: [OpenRAILs](https://huggingface.co/blog/open_rail), a new type of licensing, allow us to ensure free access while having a set of restrictions that ensure more responsible use.
diff --git a/docs/source/en/conceptual/evaluation.md b/docs/source/en/conceptual/evaluation.md
new file mode 100644
index 0000000..d4dd94e
--- /dev/null
+++ b/docs/source/en/conceptual/evaluation.md
@@ -0,0 +1,567 @@
+
+
+# Evaluating Diffusion Models
+
+
+
+
+
+Evaluation of generative models like [Stable Diffusion](https://huggingface.co/docs/diffusers/stable_diffusion) is subjective in nature. But as practitioners and researchers, we often have to make careful choices amongst many different possibilities. So, when working with different generative models (like GANs, Diffusion, etc.), how do we choose one over the other?
+
+Qualitative evaluation of such models can be error-prone and might incorrectly influence a decision.
+However, quantitative metrics don't necessarily correspond to image quality. So, usually, a combination
+of both qualitative and quantitative evaluations provides a stronger signal when choosing one model
+over the other.
+
+In this document, we provide a non-exhaustive overview of qualitative and quantitative methods to evaluate Diffusion models. For quantitative methods, we specifically focus on how to implement them alongside `diffusers`.
+
+The methods shown in this document can also be used to evaluate different [noise schedulers](https://huggingface.co/docs/diffusers/main/en/api/schedulers/overview) keeping the underlying generation model fixed.
+
+## Scenarios
+
+We cover Diffusion models with the following pipelines:
+
+- Text-guided image generation (such as the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img)).
+- Text-guided image generation, additionally conditioned on an input image (such as the [`StableDiffusionImg2ImgPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/img2img) and [`StableDiffusionInstructPix2PixPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix)).
+- Class-conditioned image generation models (such as the [`DiTPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit)).
+
+## Qualitative Evaluation
+
+Qualitative evaluation typically involves human assessment of generated images. Quality is measured across aspects such as compositionality, image-text alignment, and spatial relations. Common prompts provide a degree of uniformity for subjective metrics.
+DrawBench and PartiPrompts are prompt datasets used for qualitative benchmarking. DrawBench and PartiPrompts were introduced by [Imagen](https://imagen.research.google/) and [Parti](https://parti.research.google/) respectively.
+
+From the [official Parti website](https://parti.research.google/):
+
+> PartiPrompts (P2) is a rich set of over 1600 prompts in English that we release as part of this work. P2 can be used to measure model capabilities across various categories and challenge aspects.
+
+data:image/s3,"s3://crabby-images/04bc0/04bc09a0d7b7f9f2fb867e95ce8909e3ffd239ce" alt="parti-prompts"
+
+PartiPrompts has the following columns:
+
+- Prompt
+- Category of the prompt (such as โAbstractโ, โWorld Knowledgeโ, etc.)
+- Challenge reflecting the difficulty (such as โBasicโ, โComplexโ, โWriting & Symbolsโ, etc.)
+
+These benchmarks allow for side-by-side human evaluation of different image generation models.
+
+For this, the ๐งจ Diffusers team has built **Open Parti Prompts**, which is a community-driven qualitative benchmark based on Parti Prompts to compare state-of-the-art open-source diffusion models:
+- [Open Parti Prompts Game](https://huggingface.co/spaces/OpenGenAI/open-parti-prompts): For 10 parti prompts, 4 generated images are shown and the user selects the image that suits the prompt best.
+- [Open Parti Prompts Leaderboard](https://huggingface.co/spaces/OpenGenAI/parti-prompts-leaderboard): The leaderboard comparing the currently best open-sourced diffusion models to each other.
+
+To manually compare images, letโs see how we can use `diffusers` on a couple of PartiPrompts.
+
+Below we show some prompts sampled across different challenges: Basic, Complex, Linguistic Structures, Imagination, and Writing & Symbols. Here we are using PartiPrompts as a [dataset](https://huggingface.co/datasets/nateraw/parti-prompts).
+
+```python
+from datasets import load_dataset
+
+# prompts = load_dataset("nateraw/parti-prompts", split="train")
+# prompts = prompts.shuffle()
+# sample_prompts = [prompts[i]["Prompt"] for i in range(5)]
+
+# Fixing these sample prompts in the interest of reproducibility.
+sample_prompts = [
+ "a corgi",
+ "a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky",
+ "a car with no windows",
+ "a cube made of porcupine",
+ 'The saying "BE EXCELLENT TO EACH OTHER" written on a red brick wall with a graffiti image of a green alien wearing a tuxedo. A yellow fire hydrant is on a sidewalk in the foreground.',
+]
+```
+
+Now we can use these prompts to generate some images using Stable Diffusion ([v1-4 checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4)):
+
+```python
+import torch
+
+seed = 0
+generator = torch.manual_seed(seed)
+
+images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator).images
+```
+
+data:image/s3,"s3://crabby-images/a2cb2/a2cb219ccb0b3f077e23c1c7f9de68114b3ad5ad" alt="parti-prompts-14"
+
+We can also set `num_images_per_prompt` accordingly to compare different images for the same prompt. Running the same pipeline but with a different checkpoint ([v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)), yields:
+
+data:image/s3,"s3://crabby-images/0f2d2/0f2d2c05b7ca8bcb38991cbb0d8e14dbd3f50624" alt="parti-prompts-15"
+
+Once several images are generated from all the prompts using multiple models (under evaluation), these results are presented to human evaluators for scoring. For
+more details on the DrawBench and PartiPrompts benchmarks, refer to their respective papers.
+
+
+
+It is useful to look at some inference samples while a model is training to measure the
+training progress. In our [training scripts](https://github.com/huggingface/diffusers/tree/main/examples/), we support this utility with additional support for
+logging to TensorBoard and Weights & Biases.
+
+
+
+## Quantitative Evaluation
+
+In this section, we will walk you through how to evaluate three different diffusion pipelines using:
+
+- CLIP score
+- CLIP directional similarity
+- FID
+
+### Text-guided image generation
+
+[CLIP score](https://arxiv.org/abs/2104.08718) measures the compatibility of image-caption pairs. Higher CLIP scores imply higher compatibility ๐ผ. The CLIP score is a quantitative measurement of the qualitative concept "compatibility". Image-caption pair compatibility can also be thought of as the semantic similarity between the image and the caption. CLIP score was found to have high correlation with human judgement.
+
+Let's first load a [`StableDiffusionPipeline`]:
+
+```python
+from diffusers import StableDiffusionPipeline
+import torch
+
+model_ckpt = "CompVis/stable-diffusion-v1-4"
+sd_pipeline = StableDiffusionPipeline.from_pretrained(model_ckpt, torch_dtype=torch.float16).to("cuda")
+```
+
+Generate some images with multiple prompts:
+
+```python
+prompts = [
+ "a photo of an astronaut riding a horse on mars",
+ "A high tech solarpunk utopia in the Amazon rainforest",
+ "A pikachu fine dining with a view to the Eiffel Tower",
+ "A mecha robot in a favela in expressionist style",
+ "an insect robot preparing a delicious meal",
+ "A small cabin on top of a snowy mountain in the style of Disney, artstation",
+]
+
+images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="np").images
+
+print(images.shape)
+# (6, 512, 512, 3)
+```
+
+And then, we calculate the CLIP score.
+
+```python
+from torchmetrics.functional.multimodal import clip_score
+from functools import partial
+
+clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16")
+
+def calculate_clip_score(images, prompts):
+ images_int = (images * 255).astype("uint8")
+ clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach()
+ return round(float(clip_score), 4)
+
+sd_clip_score = calculate_clip_score(images, prompts)
+print(f"CLIP score: {sd_clip_score}")
+# CLIP score: 35.7038
+```
+
+In the above example, we generated one image per prompt. If we generated multiple images per prompt, we would have to take the average score from the generated images per prompt.
+
+Now, if we wanted to compare two checkpoints compatible with the [`StableDiffusionPipeline`] we should pass a generator while calling the pipeline. First, we generate images with a
+fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4):
+
+```python
+seed = 0
+generator = torch.manual_seed(seed)
+
+images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images
+```
+
+Then we load the [v1-5 checkpoint](https://huggingface.co/runwayml/stable-diffusion-v1-5) to generate images:
+
+```python
+model_ckpt_1_5 = "runwayml/stable-diffusion-v1-5"
+sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=weight_dtype).to(device)
+
+images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images
+```
+
+And finally, we compare their CLIP scores:
+
+```python
+sd_clip_score_1_4 = calculate_clip_score(images, prompts)
+print(f"CLIP Score with v-1-4: {sd_clip_score_1_4}")
+# CLIP Score with v-1-4: 34.9102
+
+sd_clip_score_1_5 = calculate_clip_score(images_1_5, prompts)
+print(f"CLIP Score with v-1-5: {sd_clip_score_1_5}")
+# CLIP Score with v-1-5: 36.2137
+```
+
+It seems like the [v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint performs better than its predecessor. Note, however, that the number of prompts we used to compute the CLIP scores is quite low. For a more practical evaluation, this number should be way higher, and the prompts should be diverse.
+
+
+
+By construction, there are some limitations in this score. The captions in the training dataset
+were crawled from the web and extracted from `alt` and similar tags associated an image on the internet.
+They are not necessarily representative of what a human being would use to describe an image. Hence we
+had to "engineer" some prompts here.
+
+
+
+### Image-conditioned text-to-image generation
+
+In this case, we condition the generation pipeline with an input image as well as a text prompt. Let's take the [`StableDiffusionInstructPix2PixPipeline`], as an example. It takes an edit instruction as an input prompt and an input image to be edited.
+
+Here is one example:
+
+data:image/s3,"s3://crabby-images/5b0a6/5b0a6020d460be395ddf6f71b65d8357a7d74b56" alt="edit-instruction"
+
+One strategy to evaluate such a model is to measure the consistency of the change between the two images (in [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) space) with the change between the two image captions (as shown in [CLIP-Guided Domain Adaptation of Image Generators](https://arxiv.org/abs/2108.00946)). This is referred to as the "**CLIP directional similarity**".
+
+- Caption 1 corresponds to the input image (image 1) that is to be edited.
+- Caption 2 corresponds to the edited image (image 2). It should reflect the edit instruction.
+
+Following is a pictorial overview:
+
+data:image/s3,"s3://crabby-images/2fae7/2fae734bd907fbf1136e4b9503867c6f06cbda16" alt="edit-consistency"
+
+We have prepared a mini dataset to implement this metric. Let's first load the dataset.
+
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("sayakpaul/instructpix2pix-demo", split="train")
+dataset.features
+```
+
+```bash
+{'input': Value(dtype='string', id=None),
+ 'edit': Value(dtype='string', id=None),
+ 'output': Value(dtype='string', id=None),
+ 'image': Image(decode=True, id=None)}
+```
+
+Here we have:
+
+- `input` is a caption corresponding to the `image`.
+- `edit` denotes the edit instruction.
+- `output` denotes the modified caption reflecting the `edit` instruction.
+
+Let's take a look at a sample.
+
+```python
+idx = 0
+print(f"Original caption: {dataset[idx]['input']}")
+print(f"Edit instruction: {dataset[idx]['edit']}")
+print(f"Modified caption: {dataset[idx]['output']}")
+```
+
+```bash
+Original caption: 2. FAROE ISLANDS: An archipelago of 18 mountainous isles in the North Atlantic Ocean between Norway and Iceland, the Faroe Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills'
+Edit instruction: make the isles all white marble
+Modified caption: 2. WHITE MARBLE ISLANDS: An archipelago of 18 mountainous white marble isles in the North Atlantic Ocean between Norway and Iceland, the White Marble Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills'
+```
+
+And here is the image:
+
+```python
+dataset[idx]["image"]
+```
+
+data:image/s3,"s3://crabby-images/6bdea/6bdeab4b4c29a78d3a58394f32e66138ed7c8d68" alt="edit-dataset"
+
+We will first edit the images of our dataset with the edit instruction and compute the directional similarity.
+
+Let's first load the [`StableDiffusionInstructPix2PixPipeline`]:
+
+```python
+from diffusers import StableDiffusionInstructPix2PixPipeline
+
+instruct_pix2pix_pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ "timbrooks/instruct-pix2pix", torch_dtype=torch.float16
+).to(device)
+```
+
+Now, we perform the edits:
+
+```python
+import numpy as np
+
+
+def edit_image(input_image, instruction):
+ image = instruct_pix2pix_pipeline(
+ instruction,
+ image=input_image,
+ output_type="np",
+ generator=generator,
+ ).images[0]
+ return image
+
+input_images = []
+original_captions = []
+modified_captions = []
+edited_images = []
+
+for idx in range(len(dataset)):
+ input_image = dataset[idx]["image"]
+ edit_instruction = dataset[idx]["edit"]
+ edited_image = edit_image(input_image, edit_instruction)
+
+ input_images.append(np.array(input_image))
+ original_captions.append(dataset[idx]["input"])
+ modified_captions.append(dataset[idx]["output"])
+ edited_images.append(edited_image)
+```
+
+To measure the directional similarity, we first load CLIP's image and text encoders:
+
+```python
+from transformers import (
+ CLIPTokenizer,
+ CLIPTextModelWithProjection,
+ CLIPVisionModelWithProjection,
+ CLIPImageProcessor,
+)
+
+clip_id = "openai/clip-vit-large-patch14"
+tokenizer = CLIPTokenizer.from_pretrained(clip_id)
+text_encoder = CLIPTextModelWithProjection.from_pretrained(clip_id).to(device)
+image_processor = CLIPImageProcessor.from_pretrained(clip_id)
+image_encoder = CLIPVisionModelWithProjection.from_pretrained(clip_id).to(device)
+```
+
+Notice that we are using a particular CLIP checkpoint, i.e.,ย `openai/clip-vit-large-patch14`. This is because the Stable Diffusion pre-training was performed with this CLIP variant. For more details, refer to theย [documentation](https://huggingface.co/docs/transformers/model_doc/clip).
+
+Next, we prepare a PyTorchย `nn.Module`ย to compute directional similarity:
+
+```python
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class DirectionalSimilarity(nn.Module):
+ def __init__(self, tokenizer, text_encoder, image_processor, image_encoder):
+ super().__init__()
+ self.tokenizer = tokenizer
+ self.text_encoder = text_encoder
+ self.image_processor = image_processor
+ self.image_encoder = image_encoder
+
+ def preprocess_image(self, image):
+ image = self.image_processor(image, return_tensors="pt")["pixel_values"]
+ return {"pixel_values": image.to(device)}
+
+ def tokenize_text(self, text):
+ inputs = self.tokenizer(
+ text,
+ max_length=self.tokenizer.model_max_length,
+ padding="max_length",
+ truncation=True,
+ return_tensors="pt",
+ )
+ return {"input_ids": inputs.input_ids.to(device)}
+
+ def encode_image(self, image):
+ preprocessed_image = self.preprocess_image(image)
+ image_features = self.image_encoder(**preprocessed_image).image_embeds
+ image_features = image_features / image_features.norm(dim=1, keepdim=True)
+ return image_features
+
+ def encode_text(self, text):
+ tokenized_text = self.tokenize_text(text)
+ text_features = self.text_encoder(**tokenized_text).text_embeds
+ text_features = text_features / text_features.norm(dim=1, keepdim=True)
+ return text_features
+
+ def compute_directional_similarity(self, img_feat_one, img_feat_two, text_feat_one, text_feat_two):
+ sim_direction = F.cosine_similarity(img_feat_two - img_feat_one, text_feat_two - text_feat_one)
+ return sim_direction
+
+ def forward(self, image_one, image_two, caption_one, caption_two):
+ img_feat_one = self.encode_image(image_one)
+ img_feat_two = self.encode_image(image_two)
+ text_feat_one = self.encode_text(caption_one)
+ text_feat_two = self.encode_text(caption_two)
+ directional_similarity = self.compute_directional_similarity(
+ img_feat_one, img_feat_two, text_feat_one, text_feat_two
+ )
+ return directional_similarity
+```
+
+Let's putย `DirectionalSimilarity`ย to use now.
+
+```python
+dir_similarity = DirectionalSimilarity(tokenizer, text_encoder, image_processor, image_encoder)
+scores = []
+
+for i in range(len(input_images)):
+ original_image = input_images[i]
+ original_caption = original_captions[i]
+ edited_image = edited_images[i]
+ modified_caption = modified_captions[i]
+
+ similarity_score = dir_similarity(original_image, edited_image, original_caption, modified_caption)
+ scores.append(float(similarity_score.detach().cpu()))
+
+print(f"CLIP directional similarity: {np.mean(scores)}")
+# CLIP directional similarity: 0.0797976553440094
+```
+
+Like the CLIP Score, the higher the CLIP directional similarity, the better it is.
+
+It should be noted that theย `StableDiffusionInstructPix2PixPipeline`ย exposes two arguments, namely,ย `image_guidance_scale`ย andย `guidance_scale`ย that let you control the quality of the final edited image. We encourage you to experiment with these two arguments and see the impact of that on the directional similarity.
+
+We can extend the idea of this metric to measure how similar the original image and edited version are. To do that, we can just doย `F.cosine_similarity(img_feat_two, img_feat_one)`. For these kinds of edits, we would still want the primary semantics of the images to be preserved as much as possible, i.e., a high similarity score.
+
+We can use these metrics for similar pipelines such as the [`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline).
+
+
+
+Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased.
+
+
+
+***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction.
+
+***Using the above metrics helps evaluate models that are class-conditioned. For example, [DiT](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit). It was pre-trained being conditioned on the ImageNet-1k classes.***
+
+### Class-conditioned image generation
+
+Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Frรฉchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://arxiv.org/abs/1706.08500)). We show how to compute it with the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit), which uses the [DiT model](https://arxiv.org/abs/2212.09748) under the hood.
+
+FID aims to measure how similar are two datasets of images. As per [this resource](https://mmgeneration.readthedocs.io/en/latest/quick_run.html#fid):
+
+> Frรฉchet Inception Distance is a measure of similarity between two datasets of images. It was shown to correlate well with the human judgment of visual quality and is most often used to evaluate the quality of samples of Generative Adversarial Networks. FID is calculated by computing the Frรฉchet distance between two Gaussians fitted to feature representations of the Inception network.
+
+These two datasets are essentially the dataset of real images and the dataset of fake images (generated images in our case). FID is usually calculated with two large datasets. However, for this document, we will work with two mini datasets.
+
+Let's first download a few images from the ImageNet-1k training set:
+
+```python
+from zipfile import ZipFile
+import requests
+
+
+def download(url, local_filepath):
+ r = requests.get(url)
+ with open(local_filepath, "wb") as f:
+ f.write(r.content)
+ return local_filepath
+
+dummy_dataset_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip"
+local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1])
+
+with ZipFile(local_filepath, "r") as zipper:
+ zipper.extractall(".")
+```
+
+```python
+from PIL import Image
+import os
+
+dataset_path = "sample-imagenet-images"
+image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_path)])
+
+real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths]
+```
+
+These are 10 images from the following ImageNet-1k classes: "cassette_player", "chain_saw" (x2), "church", "gas_pump" (x3), "parachute" (x2), and "tench".
+
+
+
+ Real images.
+
+
+Now that the images are loaded, let's apply some lightweight pre-processing on them to use them for FID calculation.
+
+```python
+from torchvision.transforms import functional as F
+
+
+def preprocess_image(image):
+ image = torch.tensor(image).unsqueeze(0)
+ image = image.permute(0, 3, 1, 2) / 255.0
+ return F.center_crop(image, (256, 256))
+
+real_images = torch.cat([preprocess_image(image) for image in real_images])
+print(real_images.shape)
+# torch.Size([10, 3, 256, 256])
+```
+
+We now load theย [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit) to generate images conditioned on the above-mentioned classes.
+
+```python
+from diffusers import DiTPipeline, DPMSolverMultistepScheduler
+
+dit_pipeline = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16)
+dit_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dit_pipeline.scheduler.config)
+dit_pipeline = dit_pipeline.to("cuda")
+
+words = [
+ "cassette player",
+ "chainsaw",
+ "chainsaw",
+ "church",
+ "gas pump",
+ "gas pump",
+ "gas pump",
+ "parachute",
+ "parachute",
+ "tench",
+]
+
+class_ids = dit_pipeline.get_label_ids(words)
+output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="np")
+
+fake_images = output.images
+fake_images = torch.tensor(fake_images)
+fake_images = fake_images.permute(0, 3, 1, 2)
+print(fake_images.shape)
+# torch.Size([10, 3, 256, 256])
+```
+
+Now, we can compute the FID usingย [`torchmetrics`](https://torchmetrics.readthedocs.io/).
+
+```python
+from torchmetrics.image.fid import FrechetInceptionDistance
+
+fid = FrechetInceptionDistance(normalize=True)
+fid.update(real_images, real=True)
+fid.update(fake_images, real=False)
+
+print(f"FID: {float(fid.compute())}")
+# FID: 177.7147216796875
+```
+
+The lower the FID, the better it is. Several things can influence FID here:
+
+- Number of images (both real and fake)
+- Randomness induced in the diffusion process
+- Number of inference steps in the diffusion process
+- The scheduler being used in the diffusion process
+
+For the last two points, it is, therefore, a good practice to run the evaluation across different seeds and inference steps, and then report an average result.
+
+
+
+FID results tend to be fragile as they depend on a lot of factors:
+
+* The specific Inception model used during computation.
+* The implementation accuracy of the computation.
+* The image format (not the same if we start from PNGs vs JPGs).
+
+Keeping that in mind, FID is often most useful when comparing similar runs, but it is
+hard to reproduce paper results unless the authors carefully disclose the FID
+measurement code.
+
+These points apply to other related metrics too, such as KID and IS.
+
+
+
+As a final step, let's visually inspect theย `fake_images`.
+
+
+
+ Fake images.
+
diff --git a/docs/source/en/conceptual/philosophy.md b/docs/source/en/conceptual/philosophy.md
new file mode 100644
index 0000000..29df833
--- /dev/null
+++ b/docs/source/en/conceptual/philosophy.md
@@ -0,0 +1,110 @@
+
+
+# Philosophy
+
+๐งจ Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities.
+Its purpose is to serve as a **modular toolbox** for both inference and training.
+
+We aim at building a library that stands the test of time and therefore take API design very seriously.
+
+In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones:
+
+## Usability over Performance
+
+- While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library.
+- Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages.
+- Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired.
+
+## Simple over easy
+
+As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library:
+- We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management.
+- Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible.
+- Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers.
+- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training
+is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline.
+
+## Tweakable, contributor-friendly over abstraction
+
+For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself).
+In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers.
+Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable.
+**However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because:
+- Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions.
+- Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions.
+- Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel.
+
+At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look
+at [this blog post](https://huggingface.co/blog/transformers-design-philosophy).
+
+In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such
+as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALLยทE 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond).
+
+Great, now you should have generally understood why ๐งจ Diffusers is designed the way it is ๐ค.
+We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would โค๏ธ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=).
+
+## Design Philosophy in Details
+
+Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
+Let's walk through more in-detail design decisions for each class.
+
+### Pipelines
+
+Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference.
+
+The following design principles are followed:
+- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as itโs done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [#Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
+- Pipelines all inherit from [`DiffusionPipeline`].
+- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function.
+- Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function.
+- Pipelines should be used **only** for inference.
+- Pipelines should be very readable, self-explanatory, and easy to tweak.
+- Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs.
+- Pipelines are **not** intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
+- Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines.
+- Pipelines should be named after the task they are intended to solve.
+- In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file.
+
+### Models
+
+Models are designed as configurable toolboxes that are natural extensions of [PyTorch's Module class](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). They only partly follow the **single-file policy**.
+
+The following design principles are followed:
+- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
+- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc...
+- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
+- Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
+- Models all inherit from `ModelMixin` and `ConfigMixin`.
+- Models can be optimized for performance when it doesnโt demand major code changes, keeps backward compatibility, and gives significant memory or compute gain.
+- Models should by default have the highest precision and lowest performance setting.
+- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
+- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
+- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
+readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+
+### Schedulers
+
+Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**.
+
+The following design principles are followed:
+- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
+- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
+- One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
+- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
+- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
+- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers.md).
+- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
+- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon.
+- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
+- Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box".
+- In almost all cases, novel schedulers shall be implemented in a new scheduling file.
diff --git a/docs/source/en/imgs/access_request.png b/docs/source/en/imgs/access_request.png
new file mode 100644
index 0000000..33c6abc
Binary files /dev/null and b/docs/source/en/imgs/access_request.png differ
diff --git a/docs/source/en/imgs/diffusers_library.jpg b/docs/source/en/imgs/diffusers_library.jpg
new file mode 100644
index 0000000..07ba9c6
Binary files /dev/null and b/docs/source/en/imgs/diffusers_library.jpg differ
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
new file mode 100644
index 0000000..957d907
--- /dev/null
+++ b/docs/source/en/index.md
@@ -0,0 +1,48 @@
+
+
+
+
+
+
+
+
+# Diffusers
+
+๐ค Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or want to train your own diffusion model, ๐ค Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](conceptual/philosophy#usability-over-performance), [simple over easy](conceptual/philosophy#simple-over-easy), and [customizability over abstractions](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction).
+
+The library has three main components:
+
+- State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in ๐ค Diffusers, check out the table in the pipeline [overview](api/pipelines/overview) for a complete list of available pipelines and the task they solve.
+- Interchangeable [noise schedulers](api/schedulers/overview) for balancing trade-offs between generation speed and quality.
+- Pretrained [models](api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.
+
+
diff --git a/docs/source/en/installation.md b/docs/source/en/installation.md
new file mode 100644
index 0000000..e34db1b
--- /dev/null
+++ b/docs/source/en/installation.md
@@ -0,0 +1,164 @@
+
+
+# Installation
+
+๐ค Diffusers is tested on Python 3.8+, PyTorch 1.7.0+, and Flax. Follow the installation instructions below for the deep learning library you are using:
+
+- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions
+- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions
+
+## Install with pip
+
+You should install ๐ค Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html).
+If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
+A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies.
+
+Start by creating a virtual environment in your project directory:
+
+```bash
+python -m venv .env
+```
+
+Activate the virtual environment:
+
+```bash
+source .env/bin/activate
+```
+
+You should also install ๐ค Transformers because ๐ค Diffusers relies on its models:
+
+
+
+
+Note - PyTorch only supports Python 3.8 - 3.11 on Windows.
+```bash
+pip install diffusers["torch"] transformers
+```
+
+
+```bash
+pip install diffusers["flax"] transformers
+```
+
+
+
+## Install with conda
+
+After activating your virtual environment, with `conda` (maintained by the community):
+
+```bash
+conda install -c conda-forge diffusers
+```
+
+## Install from source
+
+Before installing ๐ค Diffusers from source, make sure you have PyTorch and ๐ค Accelerate installed.
+
+To install ๐ค Accelerate:
+
+```bash
+pip install accelerate
+```
+
+Then install ๐ค Diffusers from source:
+
+```bash
+pip install git+https://github.com/huggingface/diffusers
+```
+
+This command installs the bleeding edge `main` version rather than the latest `stable` version.
+The `main` version is useful for staying up-to-date with the latest developments.
+For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet.
+However, this means the `main` version may not always be stable.
+We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day.
+If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can fix it even sooner!
+
+## Editable install
+
+You will need an editable install if you'd like to:
+
+* Use the `main` version of the source code.
+* Contribute to ๐ค Diffusers and need to test changes in the code.
+
+Clone the repository and install ๐ค Diffusers with the following commands:
+
+```bash
+git clone https://github.com/huggingface/diffusers.git
+cd diffusers
+```
+
+
+
+```bash
+pip install -e ".[torch]"
+```
+
+
+```bash
+pip install -e ".[flax]"
+```
+
+
+
+These commands will link the folder you cloned the repository to and your Python library paths.
+Python will now look inside the folder you cloned to in addition to the normal library paths.
+For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.8/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to.
+
+
+
+You must keep the `diffusers` folder if you want to keep using the library.
+
+
+
+Now you can easily update your clone to the latest version of ๐ค Diffusers with the following command:
+
+```bash
+cd ~/diffusers/
+git pull
+```
+
+Your Python environment will find the `main` version of ๐ค Diffusers on the next run.
+
+## Cache
+
+Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like [`~DiffusionPipeline.from_pretrained`].
+
+Cached files allow you to run ๐ค Diffusers offline. To prevent ๐ค Diffusers from connecting to the internet, set the `HF_HUB_OFFLINE` environment variable to `True` and ๐ค Diffusers will only load previously downloaded files in the cache.
+
+```shell
+export HF_HUB_OFFLINE=True
+```
+
+For more details about managing and cleaning the cache, take a look at the [caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) guide.
+
+## Telemetry logging
+
+Our library gathers telemetry information during [`~DiffusionPipeline.from_pretrained`] requests.
+The data gathered includes the version of ๐ค Diffusers and PyTorch/Flax, the requested model or pipeline class,
+and the path to a pretrained checkpoint if it is hosted on the Hugging Face Hub.
+This usage data helps us debug issues and prioritize new features.
+Telemetry is only sent when loading models and pipelines from the Hub,
+and it is not collected if you're loading local files.
+
+We understand that not everyone wants to share additional information,and we respect your privacy.
+You can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal:
+
+On Linux/MacOS:
+```bash
+export DISABLE_TELEMETRY=YES
+```
+
+On Windows:
+```bash
+set DISABLE_TELEMETRY=YES
+```
diff --git a/docs/source/en/optimization/coreml.md b/docs/source/en/optimization/coreml.md
new file mode 100644
index 0000000..ee6af9d
--- /dev/null
+++ b/docs/source/en/optimization/coreml.md
@@ -0,0 +1,164 @@
+
+
+# How to run Stable Diffusion with Core ML
+
+[Core ML](https://developer.apple.com/documentation/coreml) is the model format and machine learning library supported by Apple frameworks. If you are interested in running Stable Diffusion models inside your macOS or iOS/iPadOS apps, this guide will show you how to convert existing PyTorch checkpoints into the Core ML format and use them for inference with Python or Swift.
+
+Core ML models can leverage all the compute engines available in Apple devices: the CPU, the GPU, and the Apple Neural Engine (or ANE, a tensor-optimized accelerator available in Apple Silicon Macs and modern iPhones/iPads). Depending on the model and the device it's running on, Core ML can mix and match compute engines too, so some portions of the model may run on the CPU while others run on GPU, for example.
+
+
+
+You can also run the `diffusers` Python codebase on Apple Silicon Macs using the `mps` accelerator built into PyTorch. This approach is explained in depth in [the mps guide](mps), but it is not compatible with native apps.
+
+
+
+## Stable Diffusion Core ML Checkpoints
+
+Stable Diffusion weights (or checkpoints) are stored in the PyTorch format, so you need to convert them to the Core ML format before we can use them inside native apps.
+
+Thankfully, Apple engineers developed [a conversion tool](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) based on `diffusers` to convert the PyTorch checkpoints to Core ML.
+
+Before you convert a model, though, take a moment to explore the Hugging Face Hub โ chances are the model you're interested in is already available in Core ML format:
+
+- the [Apple](https://huggingface.co/apple) organization includes Stable Diffusion versions 1.4, 1.5, 2.0 base, and 2.1 base
+- [coreml community](https://huggingface.co/coreml-community) includes custom finetuned models
+- use this [filter](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes) to return all available Core ML checkpoints
+
+If you can't find the model you're interested in, we recommend you follow the instructions for [Converting Models to Core ML](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) by Apple.
+
+## Selecting the Core ML Variant to Use
+
+Stable Diffusion models can be converted to different Core ML variants intended for different purposes:
+
+- The type of attention blocks used. The attention operation is used to "pay attention" to the relationship between different areas in the image representations and to understand how the image and text representations are related. Attention is compute- and memory-intensive, so different implementations exist that consider the hardware characteristics of different devices. For Core ML Stable Diffusion models, there are two attention variants:
+ * `split_einsum` ([introduced by Apple](https://machinelearning.apple.com/research/neural-engine-transformers)) is optimized for ANE devices, which is available in modern iPhones, iPads and M-series computers.
+ * The "original" attention (the base implementation used in `diffusers`) is only compatible with CPU/GPU and not ANE. It can be *faster* to run your model on CPU + GPU using `original` attention than ANE. See [this performance benchmark](https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks) as well as some [additional measures provided by the community](https://github.com/huggingface/swift-coreml-diffusers/issues/31) for additional details.
+
+- The supported inference framework.
+ * `packages` are suitable for Python inference. This can be used to test converted Core ML models before attempting to integrate them inside native apps, or if you want to explore Core ML performance but don't need to support native apps. For example, an application with a web UI could perfectly use a Python Core ML backend.
+ * `compiled` models are required for Swift code. The `compiled` models in the Hub split the large UNet model weights into several files for compatibility with iOS and iPadOS devices. This corresponds to the [`--chunk-unet` conversion option](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml). If you want to support native apps, then you need to select the `compiled` variant.
+
+The official Core ML Stable Diffusion [models](https://huggingface.co/apple/coreml-stable-diffusion-v1-4/tree/main) include these variants, but the community ones may vary:
+
+```
+coreml-stable-diffusion-v1-4
+โโโ README.md
+โโโ original
+โ โโโ compiled
+โ โโโ packages
+โโโ split_einsum
+ โโโ compiled
+ โโโ packages
+```
+
+You can download and use the variant you need as shown below.
+
+## Core ML Inference in Python
+
+Install the following libraries to run Core ML inference in Python:
+
+```bash
+pip install huggingface_hub
+pip install git+https://github.com/apple/ml-stable-diffusion
+```
+
+### Download the Model Checkpoints
+
+To run inference in Python, use one of the versions stored in the `packages` folders because the `compiled` ones are only compatible with Swift. You may choose whether you want to use `original` or `split_einsum` attention.
+
+This is how you'd download the `original` attention variant from the Hub to a directory called `models`:
+
+```Python
+from huggingface_hub import snapshot_download
+from pathlib import Path
+
+repo_id = "apple/coreml-stable-diffusion-v1-4"
+variant = "original/packages"
+
+model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
+snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
+print(f"Model downloaded at {model_path}")
+```
+
+### Inference[[python-inference]]
+
+Once you have downloaded a snapshot of the model, you can test it using Apple's Python script.
+
+```shell
+python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_original_packages -o --compute-unit CPU_AND_GPU --seed 93
+```
+
+Pass the path of the downloaded checkpoint with `-i` flag to the script. `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility.
+
+The inference script assumes you're using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`. If you use another model, you *have* to specify its Hub id in the inference command line, using the `--model-version` option. This works for models already supported and custom models you trained or fine-tuned yourself.
+
+For example, if you want to use [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5):
+
+```shell
+python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version runwayml/stable-diffusion-v1-5
+```
+
+## Core ML inference in Swift
+
+Running inference in Swift is slightly faster than in Python because the models are already compiled in the `mlmodelc` format. This is noticeable on app startup when the model is loaded but shouldnโt be noticeable if you run several generations afterward.
+
+### Download
+
+To run inference in Swift on your Mac, you need one of the `compiled` checkpoint versions. We recommend you download them locally using Python code similar to the previous example, but with one of the `compiled` variants:
+
+```Python
+from huggingface_hub import snapshot_download
+from pathlib import Path
+
+repo_id = "apple/coreml-stable-diffusion-v1-4"
+variant = "original/compiled"
+
+model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
+snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
+print(f"Model downloaded at {model_path}")
+```
+
+### Inference[[swift-inference]]
+
+To run inference, please clone Apple's repo:
+
+```bash
+git clone https://github.com/apple/ml-stable-diffusion
+cd ml-stable-diffusion
+```
+
+And then use Apple's command line tool, [Swift Package Manager](https://www.swift.org/package-manager/#):
+
+```bash
+swift run StableDiffusionSample --resource-path models/coreml-stable-diffusion-v1-4_original_compiled --compute-units all "a photo of an astronaut riding a horse on mars"
+```
+
+You have to specify in `--resource-path` one of the checkpoints downloaded in the previous step, so please make sure it contains compiled Core ML bundles with the extension `.mlmodelc`. The `--compute-units` has to be one of these values: `all`, `cpuOnly`, `cpuAndGPU`, `cpuAndNeuralEngine`.
+
+For more details, please refer to the [instructions in Apple's repo](https://github.com/apple/ml-stable-diffusion).
+
+## Supported Diffusers Features
+
+The Core ML models and inference code don't support many of the features, options, and flexibility of ๐งจ Diffusers. These are some of the limitations to keep in mind:
+
+- Core ML models are only suitable for inference. They can't be used for training or fine-tuning.
+- Only two schedulers have been ported to Swift, the default one used by Stable Diffusion and `DPMSolverMultistepScheduler`, which we ported to Swift from our `diffusers` implementation. We recommend you use `DPMSolverMultistepScheduler`, since it produces the same quality in about half the steps.
+- Negative prompts, classifier-free guidance scale, and image-to-image tasks are available in the inference code. Advanced features such as depth guidance, ControlNet, and latent upscalers are not available yet.
+
+Apple's [conversion and inference repo](https://github.com/apple/ml-stable-diffusion) and our own [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) repos are intended as technology demonstrators to enable other developers to build upon.
+
+If you feel strongly about any missing features, please feel free to open a feature request or, better yet, a contribution PR ๐.
+
+## Native Diffusers Swift app
+
+One easy way to run Stable Diffusion on your own Apple hardware is to use [our open-source Swift repo](https://github.com/huggingface/swift-coreml-diffusers), based on `diffusers` and Apple's conversion and inference repo. You can study the code, compile it with [Xcode](https://developer.apple.com/xcode/) and adapt it for your own needs. For your convenience, there's also a [standalone Mac app in the App Store](https://apps.apple.com/app/diffusers/id1666309574), so you can play with it without having to deal with the code or IDE. If you are a developer and have determined that Core ML is the best solution to build your Stable Diffusion app, then you can use the rest of this guide to get started with your project. We can't wait to see what you'll build ๐.
diff --git a/docs/source/en/optimization/deepcache.md b/docs/source/en/optimization/deepcache.md
new file mode 100644
index 0000000..2cc3b25
--- /dev/null
+++ b/docs/source/en/optimization/deepcache.md
@@ -0,0 +1,62 @@
+
+
+# DeepCache
+[DeepCache](https://huggingface.co/papers/2312.00858) accelerates [`StableDiffusionPipeline`] and [`StableDiffusionXLPipeline`] by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the U-Net architecture.
+
+Start by installing [DeepCache](https://github.com/horseee/DeepCache):
+```bash
+pip install DeepCache
+```
+
+Then load and enable the [`DeepCacheSDHelper`](https://github.com/horseee/DeepCache#usage):
+
+```diff
+ import torch
+ from diffusers import StableDiffusionPipeline
+ pipe = StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda")
+
++ from DeepCache import DeepCacheSDHelper
++ helper = DeepCacheSDHelper(pipe=pipe)
++ helper.set_params(
++ cache_interval=3,
++ cache_branch_id=0,
++ )
++ helper.enable()
+
+ image = pipe("a photo of an astronaut on a moon").images[0]
+```
+
+The `set_params` method accepts two arguments: `cache_interval` and `cache_branch_id`. `cache_interval` means the frequency of feature caching, specified as the number of steps between each cache operation. `cache_branch_id` identifies which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching processes.
+Opting for a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality (ablation experiments of these two hyperparameters can be found in the [paper](https://arxiv.org/abs/2312.00858)). Once those arguments are set, use the `enable` or `disable` methods to activate or deactivate the `DeepCacheSDHelper`.
+
+
+
+
+
+You can find more generated samples (original pipeline vs DeepCache) and the corresponding inference latency in the [WandB report](https://wandb.ai/horseee/DeepCache/runs/jwlsqqgt?workspace=user-horseee). The prompts are randomly selected from the [MS-COCO 2017](https://cocodataset.org/#home) dataset.
+
+## Benchmark
+
+We tested how much faster DeepCache accelerates [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with 50 inference steps on an NVIDIA RTX A5000, using different configurations for resolution, batch size, cache interval (I), and cache branch (B).
+
+| **Resolution** | **Batch size** | **Original** | **DeepCache(I=3, B=0)** | **DeepCache(I=5, B=0)** | **DeepCache(I=5, B=1)** |
+|----------------|----------------|--------------|-------------------------|-------------------------|-------------------------|
+| 512| 8| 15.96| 6.88(2.32x)| 5.03(3.18x)| 7.27(2.20x)|
+| | 4| 8.39| 3.60(2.33x)| 2.62(3.21x)| 3.75(2.24x)|
+| | 1| 2.61| 1.12(2.33x)| 0.81(3.24x)| 1.11(2.35x)|
+| 768| 8| 43.58| 18.99(2.29x)| 13.96(3.12x)| 21.27(2.05x)|
+| | 4| 22.24| 9.67(2.30x)| 7.10(3.13x)| 10.74(2.07x)|
+| | 1| 6.33| 2.72(2.33x)| 1.97(3.21x)| 2.98(2.12x)|
+| 1024| 8| 101.95| 45.57(2.24x)| 33.72(3.02x)| 53.00(1.92x)|
+| | 4| 49.25| 21.86(2.25x)| 16.19(3.04x)| 25.78(1.91x)|
+| | 1| 13.83| 6.07(2.28x)| 4.43(3.12x)| 7.15(1.93x)|
diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md
new file mode 100644
index 0000000..7a2cf93
--- /dev/null
+++ b/docs/source/en/optimization/fp16.md
@@ -0,0 +1,74 @@
+
+
+# Speed up inference
+
+There are several ways to optimize ๐ค Diffusers for inference speed. As a general rule of thumb, we recommend using either [xFormers](xformers) or `torch.nn.functional.scaled_dot_product_attention` in PyTorch 2.0 for their memory-efficient attention.
+
+
+
+In many cases, optimizing for speed or memory leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about preserving memory in the [Reduce memory usage](memory) guide.
+
+
+
+The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect.
+
+| | latency | speed-up |
+| ---------------- | ------- | ------- |
+| original | 9.50s | x1 |
+| fp16 | 3.61s | x2.63 |
+| channels last | 3.30s | x2.88 |
+| traced UNet | 3.21s | x2.96 |
+| memory efficient attention | 2.63s | x3.61 |
+
+## Use TensorFloat-32
+
+On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy.
+
+```python
+import torch
+
+torch.backends.cuda.matmul.allow_tf32 = True
+```
+
+You can learn more about TF32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide.
+
+## Half-precision weights
+
+To save GPU memory and get more speed, try loading and running the model weights directly in half-precision or float16:
+
+```Python
+import torch
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+)
+pipe = pipe.to("cuda")
+
+prompt = "a photo of an astronaut riding a horse on mars"
+image = pipe(prompt).images[0]
+```
+
+
+
+Don't use [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision.
+
+
+
+## Distilled model
+
+You could also use a distilled Stable Diffusion model and autoencoder to speed up inference. During distillation, many of the UNet's residual and attention blocks are shed to reduce the model size. The distilled model is faster and uses less memory while generating images of comparable quality to the full Stable Diffusion model.
+
+Learn more about in the [Distilled Stable Diffusion inference](../using-diffusers/distilled_sd) guide!
diff --git a/docs/source/en/optimization/habana.md b/docs/source/en/optimization/habana.md
new file mode 100644
index 0000000..a1123d9
--- /dev/null
+++ b/docs/source/en/optimization/habana.md
@@ -0,0 +1,76 @@
+
+
+# Habana Gaudi
+
+๐ค Diffusers is compatible with Habana Gaudi through ๐ค [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana:
+
+```bash
+python -m pip install --upgrade-strategy eager optimum[habana]
+```
+
+To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances:
+
+- [`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation.
+- [`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler.
+
+When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`.
+
+Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub.
+
+```python
+from optimum.habana import GaudiConfig
+from optimum.habana.diffusers import GaudiDDIMScheduler, GaudiStableDiffusionPipeline
+
+model_name = "stabilityai/stable-diffusion-2-base"
+scheduler = GaudiDDIMScheduler.from_pretrained(model_name, subfolder="scheduler")
+pipeline = GaudiStableDiffusionPipeline.from_pretrained(
+ model_name,
+ scheduler=scheduler,
+ use_habana=True,
+ use_hpu_graphs=True,
+ gaudi_config="Habana/stable-diffusion-2",
+)
+```
+
+Now you can call the pipeline to generate images by batches from one or several prompts:
+
+```python
+outputs = pipeline(
+ prompt=[
+ "High quality photo of an astronaut riding a horse in space",
+ "Face of a yellow cat, high resolution, sitting on a park bench",
+ ],
+ num_images_per_prompt=10,
+ batch_size=4,
+)
+```
+
+For more information, check out ๐ค Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official GitHub repository.
+
+## Benchmark
+
+We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance.
+
+For [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) on 512x512 images:
+
+| | Latency (batch size = 1) | Throughput |
+| ---------------------- |:------------------------:|:---------------------------:|
+| first-generation Gaudi | 3.80s | 0.308 images/s (batch size = 8) |
+| Gaudi2 | 1.33s | 1.081 images/s (batch size = 8) |
+
+For [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on 768x768 images:
+
+| | Latency (batch size = 1) | Throughput |
+| ---------------------- |:------------------------:|:-------------------------------:|
+| first-generation Gaudi | 10.2s | 0.108 images/s (batch size = 4) |
+| Gaudi2 | 3.17s | 0.379 images/s (batch size = 8) |
diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md
new file mode 100644
index 0000000..6b2a22b
--- /dev/null
+++ b/docs/source/en/optimization/memory.md
@@ -0,0 +1,332 @@
+
+
+# Reduce memory usage
+
+A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage.
+
+
+
+In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16).
+
+
+
+The results below are obtained from generating a single 512x512 image from the prompt a photo of an astronaut riding a horse on mars with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect as a result of reduced memory consumption.
+
+| | latency | speed-up |
+| ---------------- | ------- | ------- |
+| original | 9.50s | x1 |
+| fp16 | 3.61s | x2.63 |
+| channels last | 3.30s | x2.88 |
+| traced UNet | 3.21s | x2.96 |
+| memory-efficient attention | 2.63s | x3.61 |
+
+## Sliced VAE
+
+Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to reduce memory use further if you have xFormers installed.
+
+To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference:
+
+```python
+import torch
+from diffusers import StableDiffusionPipeline
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+)
+pipe = pipe.to("cuda")
+
+prompt = "a photo of an astronaut riding a horse on mars"
+pipe.enable_vae_slicing()
+#pipe.enable_xformers_memory_efficient_attention()
+images = pipe([prompt] * 32).images
+```
+
+You may see a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches.
+
+## Tiled VAE
+
+Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to reduce memory use further if you have xFormers installed.
+
+To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference:
+
+```python
+import torch
+from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+)
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe = pipe.to("cuda")
+prompt = "a beautiful landscape photograph"
+pipe.enable_vae_tiling()
+#pipe.enable_xformers_memory_efficient_attention()
+
+image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0]
+```
+
+The output image has some tile-to-tile tone variation because the tiles are decoded separately, but you shouldn't see any sharp and obvious seams between the tiles. Tiling is turned off for images that are 512x512 or smaller.
+
+## CPU offloading
+
+Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory. Often, this technique can reduce memory consumption to less than 3GB.
+
+To perform CPU offloading, call [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]:
+
+```Python
+import torch
+from diffusers import StableDiffusionPipeline
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+)
+
+prompt = "a photo of an astronaut riding a horse on mars"
+pipe.enable_sequential_cpu_offload()
+image = pipe(prompt).images[0]
+```
+
+CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers.
+
+
+
+Consider using [model offloading](#model-offloading) if you want to optimize for speed because it is much faster. The tradeoff is your memory savings won't be as large.
+
+
+
+
+
+When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information).
+
+[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models.
+
+
+
+## Model offloading
+
+
+
+Model offloading requires ๐ค Accelerate version 0.17.0 or higher.
+
+
+
+[Sequential CPU offloading](#cpu-offloading) preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they're immediately returned to the CPU when a new module runs.
+
+Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent *submodules*. There is a negligible impact on inference time (compared with moving the pipeline to `cuda`), and it still provides some memory savings.
+
+During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet and VAE)
+is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they're no longer needed.
+
+Enable model offloading by calling [`~StableDiffusionPipeline.enable_model_cpu_offload`] on the pipeline:
+
+```Python
+import torch
+from diffusers import StableDiffusionPipeline
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+)
+
+prompt = "a photo of an astronaut riding a horse on mars"
+pipe.enable_model_cpu_offload()
+image = pipe(prompt).images[0]
+```
+
+
+
+In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) for more information.
+
+[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline.
+
+
+
+## Channels-last memory format
+
+The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model.
+
+For example, to set the pipeline's UNet to use the channels-last format:
+
+```python
+print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1)
+pipe.unet.to(memory_format=torch.channels_last) # in-place operation
+print(
+ pipe.unet.conv_out.state_dict()["weight"].stride()
+) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works
+```
+
+## Tracing
+
+Tracing runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers. The executable or `ScriptFunction` that is returned is optimized with just-in-time compilation.
+
+To trace a UNet:
+
+```python
+import time
+import torch
+from diffusers import StableDiffusionPipeline
+import functools
+
+# torch disable grad
+torch.set_grad_enabled(False)
+
+# set variables
+n_experiments = 2
+unet_runs_per_experiment = 50
+
+
+# load inputs
+def generate_inputs():
+ sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
+ timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
+ encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
+ return sample, timestep, encoder_hidden_states
+
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+).to("cuda")
+unet = pipe.unet
+unet.eval()
+unet.to(memory_format=torch.channels_last) # use channels_last memory format
+unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default
+
+# warmup
+for _ in range(3):
+ with torch.inference_mode():
+ inputs = generate_inputs()
+ orig_output = unet(*inputs)
+
+# trace
+print("tracing..")
+unet_traced = torch.jit.trace(unet, inputs)
+unet_traced.eval()
+print("done tracing")
+
+
+# warmup and optimize graph
+for _ in range(5):
+ with torch.inference_mode():
+ inputs = generate_inputs()
+ orig_output = unet_traced(*inputs)
+
+
+# benchmarking
+with torch.inference_mode():
+ for _ in range(n_experiments):
+ torch.cuda.synchronize()
+ start_time = time.time()
+ for _ in range(unet_runs_per_experiment):
+ orig_output = unet_traced(*inputs)
+ torch.cuda.synchronize()
+ print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
+ for _ in range(n_experiments):
+ torch.cuda.synchronize()
+ start_time = time.time()
+ for _ in range(unet_runs_per_experiment):
+ orig_output = unet(*inputs)
+ torch.cuda.synchronize()
+ print(f"unet inference took {time.time() - start_time:.2f} seconds")
+
+# save the model
+unet_traced.save("unet_traced.pt")
+```
+
+Replace the `unet` attribute of the pipeline with the traced model:
+
+```python
+from diffusers import StableDiffusionPipeline
+import torch
+from dataclasses import dataclass
+
+
+@dataclass
+class UNet2DConditionOutput:
+ sample: torch.FloatTensor
+
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+).to("cuda")
+
+# use jitted unet
+unet_traced = torch.jit.load("unet_traced.pt")
+
+
+# del pipe.unet
+class TracedUNet(torch.nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.in_channels = pipe.unet.config.in_channels
+ self.device = pipe.unet.device
+
+ def forward(self, latent_model_input, t, encoder_hidden_states):
+ sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
+ return UNet2DConditionOutput(sample=sample)
+
+
+pipe.unet = TracedUNet()
+
+with torch.inference_mode():
+ image = pipe([prompt] * 1, num_inference_steps=50).images[0]
+```
+
+## Memory-efficient attention
+
+Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/abs/2205.14135) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)).
+
+
+
+If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling `xformers`.
+
+
+
+To use Flash Attention, install the following:
+
+- PyTorch > 1.12
+- CUDA available
+- [xFormers](xformers)
+
+Then call [`~ModelMixin.enable_xformers_memory_efficient_attention`] on the pipeline:
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+pipe = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+).to("cuda")
+
+pipe.enable_xformers_memory_efficient_attention()
+
+with torch.inference_mode():
+ sample = pipe("a small cat")
+
+# optional: You can disable it via
+# pipe.disable_xformers_memory_efficient_attention()
+```
+
+The iteration speed when using `xformers` should match the iteration speed of PyTorch 2.0 as described [here](torch2.0).
diff --git a/docs/source/en/optimization/mps.md b/docs/source/en/optimization/mps.md
new file mode 100644
index 0000000..d0cabfe
--- /dev/null
+++ b/docs/source/en/optimization/mps.md
@@ -0,0 +1,74 @@
+
+
+# Metal Performance Shaders (MPS)
+
+๐ค Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have:
+
+- macOS computer with Apple silicon (M1/M2) hardware
+- macOS 12.6 or later (13.0 or later recommended)
+- arm64 version of Python
+- [PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`)
+
+The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device:
+
+```python
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+pipe = pipe.to("mps")
+
+# Recommended if your computer has < 64 GB of RAM
+pipe.enable_attention_slicing()
+
+prompt = "a photo of an astronaut riding a horse on mars"
+image = pipe(prompt).images[0]
+image
+```
+
+
+
+Generating multiple prompts in a batch can [crash](https://github.com/huggingface/diffusers/issues/363) or fail to work reliably. We believe this is related to the [`mps`](https://github.com/pytorch/pytorch/issues/84039) backend in PyTorch. While this is being investigated, you should iterate instead of batching.
+
+
+
+If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result.
+
+```diff
+ from diffusers import DiffusionPipeline
+
+ pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("mps")
+ pipe.enable_attention_slicing()
+
+ prompt = "a photo of an astronaut riding a horse on mars"
+ # First-time "warmup" pass if PyTorch version is 1.13
++ _ = pipe(prompt, num_inference_steps=1)
+
+ # Results match those from the CPU device after the warmup pass.
+ image = pipe(prompt).images[0]
+```
+
+## Troubleshoot
+
+M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance.
+
+To prevent this from happening, we recommend *attention slicing* to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512ร512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps")
+pipeline.enable_attention_slicing()
+```
+
+Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more.
diff --git a/docs/source/en/optimization/onnx.md b/docs/source/en/optimization/onnx.md
new file mode 100644
index 0000000..486f450
--- /dev/null
+++ b/docs/source/en/optimization/onnx.md
@@ -0,0 +1,86 @@
+
+
+# ONNX Runtime
+
+๐ค [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install ๐ค Optimum with the following command for ONNX Runtime support:
+
+```bash
+pip install -q optimum["onnxruntime"]
+```
+
+This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime.
+
+## Stable Diffusion
+
+To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`:
+
+```python
+from optimum.onnxruntime import ORTStableDiffusionPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, export=True)
+prompt = "sailing ship in storm by Leonardo da Vinci"
+image = pipeline(prompt).images[0]
+pipeline.save_pretrained("./onnx-stable-diffusion-v1-5")
+```
+
+
+
+Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching.
+
+
+
+To export the pipeline in the ONNX format offline and use it later for inference,
+use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
+
+```bash
+optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/
+```
+
+Then to perform inference (you don't have to specify `export=True` again):
+
+```python
+from optimum.onnxruntime import ORTStableDiffusionPipeline
+
+model_id = "sd_v15_onnx"
+pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id)
+prompt = "sailing ship in storm by Leonardo da Vinci"
+image = pipeline(prompt).images[0]
+```
+
+
+
+
+
+You can find more examples in ๐ค Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting.
+
+## Stable Diffusion XL
+
+To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]:
+
+```python
+from optimum.onnxruntime import ORTStableDiffusionXLPipeline
+
+model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+pipeline = ORTStableDiffusionXLPipeline.from_pretrained(model_id)
+prompt = "sailing ship in storm by Leonardo da Vinci"
+image = pipeline(prompt).images[0]
+```
+
+To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
+
+```bash
+optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/
+```
+
+SDXL in the ONNX format is supported for text-to-image and image-to-image.
diff --git a/docs/source/en/optimization/open_vino.md b/docs/source/en/optimization/open_vino.md
new file mode 100644
index 0000000..aa51c4b
--- /dev/null
+++ b/docs/source/en/optimization/open_vino.md
@@ -0,0 +1,80 @@
+
+
+# OpenVINO
+
+๐ค [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) of supported devices).
+
+You'll need to install ๐ค Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version:
+
+```bash
+pip install --upgrade-strategy eager optimum["openvino"]
+```
+
+This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO.
+
+## Stable Diffusion
+
+To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`:
+
+```python
+from optimum.intel import OVStableDiffusionPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
+prompt = "sailing ship in storm by Rembrandt"
+image = pipeline(prompt).images[0]
+
+# Don't forget to save the exported model
+pipeline.save_pretrained("openvino-sd-v1-5")
+```
+
+To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, youโll need to statically reshape your model again.
+
+```python
+# Define the shapes related to the inputs and desired outputs
+batch_size, num_images, height, width = 1, 1, 512, 512
+
+# Statically reshape the model
+pipeline.reshape(batch_size, height, width, num_images)
+# Compile the model before inference
+pipeline.compile()
+
+image = pipeline(
+ prompt,
+ height=height,
+ width=width,
+ num_images_per_prompt=num_images,
+).images[0]
+```
+
+
+
+
+You can find more examples in the ๐ค Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting.
+
+## Stable Diffusion XL
+
+To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]:
+
+```python
+from optimum.intel import OVStableDiffusionXLPipeline
+
+model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id)
+prompt = "sailing ship in storm by Rembrandt"
+image = pipeline(prompt).images[0]
+```
+
+To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section.
+
+You can find more examples in the ๐ค Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image.
diff --git a/docs/source/en/optimization/opt_overview.md b/docs/source/en/optimization/opt_overview.md
new file mode 100644
index 0000000..713f91a
--- /dev/null
+++ b/docs/source/en/optimization/opt_overview.md
@@ -0,0 +1,17 @@
+
+
+# Overview
+
+Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of ๐ค Diffuser's goals is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware.
+
+This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You'll also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors.
diff --git a/docs/source/en/optimization/tome.md b/docs/source/en/optimization/tome.md
new file mode 100644
index 0000000..9f22087
--- /dev/null
+++ b/docs/source/en/optimization/tome.md
@@ -0,0 +1,96 @@
+
+
+# Token merging
+
+[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`].
+
+Install ToMe from `pip`:
+
+```bash
+pip install tomesd
+```
+
+You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function:
+
+```diff
+ from diffusers import StableDiffusionPipeline
+ import torch
+ import tomesd
+
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
+ ).to("cuda")
++ tomesd.apply_patch(pipeline, ratio=0.5)
+
+ image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
+```
+
+The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass.
+
+As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality.
+
+To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings:
+
+
+
+
+
+We didnโt notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd).
+
+## Benchmarks
+
+We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment:
+
+```bash
+- `diffusers` version: 0.15.1
+- Python version: 3.8.16
+- PyTorch version (GPU?): 1.13.1+cu116 (True)
+- Huggingface_hub version: 0.13.2
+- Transformers version: 4.27.2
+- Accelerate version: 0.18.0
+- xFormers version: 0.0.16
+- tomesd version: 0.1.2
+```
+
+To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers.
+
+| **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** |
+|----------|----------------|----------------|-------------|----------------|---------------------|
+| **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) |
+| | 768 | 10 | OOM | 14.71 | 11 |
+| | | 8 | OOM | 11.56 | 8.84 |
+| | | 4 | OOM | 5.98 | 4.66 |
+| | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) |
+| | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) |
+| | 1024 | 10 | OOM | OOM | OOM |
+| | | 8 | OOM | OOM | OOM |
+| | | 4 | OOM | 12.51 | 9.09 |
+| | | 2 | OOM | 6.52 | 4.96 |
+| | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) |
+| **V100** | 512 | 10 | OOM | 10.03 | 9.29 |
+| | | 8 | OOM | 8.05 | 7.47 |
+| | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) |
+| | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) |
+| | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) |
+| | 768 | 10 | OOM | OOM | 23.67 |
+| | | 8 | OOM | OOM | 18.81 |
+| | | 4 | OOM | 11.81 | 9.7 |
+| | | 2 | OOM | 6.27 | 5.2 |
+| | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) |
+| | 1024 | 10 | OOM | OOM | OOM |
+| | | 8 | OOM | OOM | OOM |
+| | | 4 | OOM | OOM | 19.35 |
+| | | 2 | OOM | 13 | 10.78 |
+| | | 1 | OOM | 6.66 | 5.54 |
+
+As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0).
diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md
new file mode 100644
index 0000000..2475bb5
--- /dev/null
+++ b/docs/source/en/optimization/torch2.0.md
@@ -0,0 +1,421 @@
+
+
+# PyTorch 2.0
+
+๐ค Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include:
+
+1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers.
+2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled.
+
+Both of these optimizations require PyTorch 2.0 or later and ๐ค Diffusers > 0.13.0.
+
+```bash
+pip install --upgrade torch diffusers
+```
+
+## Scaled dot product attention
+
+[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of ๐ค Diffusers, so you don't need to add anything to your code.
+
+However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]:
+
+```diff
+ import torch
+ from diffusers import DiffusionPipeline
++ from diffusers.models.attention_processor import AttnProcessor2_0
+
+ pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
++ pipe.unet.set_attn_processor(AttnProcessor2_0())
+
+ prompt = "a photo of an astronaut riding a horse on mars"
+ image = pipe(prompt).images[0]
+```
+
+SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details.
+
+In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline:
+
+```diff
+ import torch
+ from diffusers import DiffusionPipeline
+
+ pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
++ pipe.unet.set_default_attn_processor()
+
+ prompt = "a photo of an astronaut riding a horse on mars"
+ image = pipe(prompt).images[0]
+```
+
+## torch.compile
+
+The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In ๐ค Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline.
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0]
+```
+
+Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs.
+
+Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive.
+
+For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial.
+
+> [!TIP]
+> Learn more about other ways PyTorch 2.0 can help optimize your model in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion) tutorial.
+
+## Benchmark
+
+We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on ๐ค Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details).
+
+Expand the dropdown below to find the code used to benchmark each pipeline:
+
+
+
+### Stable Diffusion text-to-image
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+path = "runwayml/stable-diffusion-v1-5"
+
+run_compile = True # Set True / False
+
+pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+ print("Run torch compile")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+ images = pipe(prompt=prompt).images
+```
+
+### Stable Diffusion image-to-image
+
+```python
+from diffusers import StableDiffusionImg2ImgPipeline
+from diffusers.utils import load_image
+import torch
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+
+init_image = load_image(url)
+init_image = init_image.resize((512, 512))
+
+path = "runwayml/stable-diffusion-v1-5"
+
+run_compile = True # Set True / False
+
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+ print("Run torch compile")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+ image = pipe(prompt=prompt, image=init_image).images[0]
+```
+
+### Stable Diffusion inpainting
+
+```python
+from diffusers import StableDiffusionInpaintPipeline
+from diffusers.utils import load_image
+import torch
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+
+path = "runwayml/stable-diffusion-inpainting"
+
+run_compile = True # Set True / False
+
+pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+ print("Run torch compile")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+ image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
+```
+
+### ControlNet
+
+```python
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+from diffusers.utils import load_image
+import torch
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+
+init_image = load_image(url)
+init_image = init_image.resize((512, 512))
+
+path = "runwayml/stable-diffusion-v1-5"
+
+run_compile = True # Set True / False
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+)
+
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+pipe.controlnet.to(memory_format=torch.channels_last)
+
+if run_compile:
+ print("Run torch compile")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+ pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+ image = pipe(prompt=prompt, image=init_image).images[0]
+```
+
+### DeepFloyd IF text-to-image + upscaling
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+run_compile = True # Set True / False
+
+pipe_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
+pipe_1.to("cuda")
+pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
+pipe_2.to("cuda")
+pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True)
+pipe_3.to("cuda")
+
+
+pipe_1.unet.to(memory_format=torch.channels_last)
+pipe_2.unet.to(memory_format=torch.channels_last)
+pipe_3.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+ pipe_1.unet = torch.compile(pipe_1.unet, mode="reduce-overhead", fullgraph=True)
+ pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
+ pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "the blue hulk"
+
+prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
+neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
+
+for _ in range(3):
+ image_1 = pipe_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
+ image_2 = pipe_2(image=image_1, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
+ image_3 = pipe_3(prompt=prompt, image=image_1, noise_level=100).images
+```
+
+
+The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*.
+
+data:image/s3,"s3://crabby-images/d36d6/d36d6da46ed2a17c95da98e1e61fcdb9add44c5b" alt="t2i_speedup"
+
+To give you an even better idea of how this speed-up holds for the other pipelines, consider the following
+graph for an A100 with PyTorch 2.0 and `torch.compile`:
+
+data:image/s3,"s3://crabby-images/a5eac/a5eac6859304a825440ba25042e07974d25b175c" alt="a100_numbers"
+
+In the following tables, we report our findings in terms of the *number of iterations/second*.
+
+### A100 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 |
+| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 |
+| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 |
+| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 |
+| IF | 20.21 / 13.84 / 24.00 | 20.12 / 13.70 / 24.03 | โ | 97.34 / 27.23 / 111.66 |
+| SDXL - txt2img | 8.64 | 9.9 | - | - |
+
+### A100 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 |
+| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 |
+| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 |
+| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 |
+| IF | 25.02 | 18.04 | โ | 48.47 |
+| SDXL - txt2img | 2.44 | 2.74 | - | - |
+
+### A100 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 |
+| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 |
+| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 |
+| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 |
+| IF | 8.78 | 9.82 | โ | 16.77 |
+| SDXL - txt2img | 0.64 | 0.72 | - | - |
+
+### V100 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 |
+| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 |
+| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 |
+| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 |
+| IF | 20.01 / 9.08 / 23.34 | 19.79 / 8.98 / 24.10 | โ | 55.75 / 11.57 / 57.67 |
+
+### V100 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 |
+| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 |
+| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 |
+| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 |
+| IF | 15.41 | 14.76 | โ | 22.95 |
+
+### V100 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 |
+| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 |
+| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 |
+| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 |
+| IF | 5.43 | 5.29 | โ | 7.06 |
+
+### T4 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 |
+| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 |
+| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 |
+| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 |
+| IF | 17.42 / 2.47 / 18.52 | 16.96 / 2.45 / 18.69 | โ | 24.63 / 2.47 / 23.39 |
+| SDXL - txt2img | 1.15 | 1.16 | - | - |
+
+### T4 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 |
+| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 |
+| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 |
+| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 |
+| IF | 5.79 | 5.61 | โ | 7.39 |
+| SDXL - txt2img | 0.288 | 0.289 | - | - |
+
+### T4 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s |
+| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s |
+| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s |
+| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
+| IF * | 1.44 | 1.44 | โ | 1.94 |
+| SDXL - txt2img | OOM | OOM | - | - |
+
+### RTX 3090 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 |
+| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 |
+| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 |
+| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 |
+| IF | 27.08 / 9.07 / 31.23 | 26.75 / 8.92 / 31.47 | โ | 68.08 / 11.16 / 65.29 |
+
+### RTX 3090 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 |
+| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 |
+| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 |
+| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 |
+| IF | 16.81 | 16.62 | โ | 21.57 |
+
+### RTX 3090 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 |
+| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 |
+| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 |
+| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 |
+| IF | 5.01 | 5.00 | โ | 6.33 |
+
+### RTX 4090 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 |
+| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 |
+| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 |
+| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 |
+| IF | 69.71 / 18.78 / 85.49 | 69.13 / 18.80 / 85.56 | โ | 124.60 / 26.37 / 138.79 |
+| SDXL - txt2img | 6.8 | 8.18 | - | - |
+
+### RTX 4090 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 |
+| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 |
+| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 |
+| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 |
+| IF | 31.88 | 31.14 | โ | 43.92 |
+| SDXL - txt2img | 2.19 | 2.35 | - | - |
+
+### RTX 4090 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - no compile** | **torch nightly - no compile** | **torch 2.0 - compile** | **torch nightly - compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 |
+| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 |
+| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 |
+| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 |
+| IF | 9.26 | 9.2 | โ | 13.31 |
+| SDXL - txt2img | 0.52 | 0.53 | - | - |
+
+## Notes
+
+* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks.
+* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1.
+
+*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
diff --git a/docs/source/en/optimization/xformers.md b/docs/source/en/optimization/xformers.md
new file mode 100644
index 0000000..4ef0da9
--- /dev/null
+++ b/docs/source/en/optimization/xformers.md
@@ -0,0 +1,35 @@
+
+
+# xFormers
+
+We recommend [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption.
+
+Install xFormers from `pip`:
+
+```bash
+pip install xformers
+```
+
+
+
+The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers).
+
+
+
+After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory#memory-efficient-attention).
+
+
+
+According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments.
+
+
diff --git a/docs/source/en/quicktour.md b/docs/source/en/quicktour.md
new file mode 100644
index 0000000..3cc8567
--- /dev/null
+++ b/docs/source/en/quicktour.md
@@ -0,0 +1,320 @@
+
+
+[[open-in-colab]]
+
+# Quicktour
+
+Diffusion models are trained to denoise random Gaussian noise step-by-step to generate a sample of interest, such as an image or audio. This has sparked a tremendous amount of interest in generative AI, and you have probably seen examples of diffusion generated images on the internet. ๐งจ Diffusers is a library aimed at making diffusion models widely accessible to everyone.
+
+Whether you're a developer or an everyday user, this quicktour will introduce you to ๐งจ Diffusers and help you get up and generating quickly! There are three main components of the library to know about:
+
+* The [`DiffusionPipeline`] is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference.
+* Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems.
+* Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference.
+
+The quicktour will show you how to use the [`DiffusionPipeline`] for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`].
+
+
+
+The quicktour is a simplified version of the introductory ๐งจ Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) to help you get started quickly. If you want to learn more about ๐งจ Diffusers' goal, design philosophy, and additional details about its core API, check out the notebook!
+
+
+
+Before you begin, make sure you have all the necessary libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install --upgrade diffusers accelerate transformers
+```
+
+- [๐ค Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training.
+- [๐ค Transformers](https://huggingface.co/docs/transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview).
+
+## DiffusionPipeline
+
+The [`DiffusionPipeline`] is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`] out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [๐งจ Diffusers Summary](./api/pipelines/overview#diffusers-summary) table.
+
+| **Task** | **Description** | **Pipeline**
+|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|
+| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) |
+| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) |
+| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) |
+| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) |
+| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) |
+
+Start by creating an instance of a [`DiffusionPipeline`] and specify which pipeline checkpoint you would like to download.
+You can use the [`DiffusionPipeline`] for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub.
+In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image generation.
+
+
+
+For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. ๐งจ Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content.
+
+
+
+Load the model with the [`~DiffusionPipeline.from_pretrained`] method:
+
+```python
+>>> from diffusers import DiffusionPipeline
+
+>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
+```
+
+The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things:
+
+```py
+>>> pipeline
+StableDiffusionPipeline {
+ "_class_name": "StableDiffusionPipeline",
+ "_diffusers_version": "0.21.4",
+ ...,
+ "scheduler": [
+ "diffusers",
+ "PNDMScheduler"
+ ],
+ ...,
+ "unet": [
+ "diffusers",
+ "UNet2DConditionModel"
+ ],
+ "vae": [
+ "diffusers",
+ "AutoencoderKL"
+ ]
+}
+```
+
+We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters.
+You can move the generator object to a GPU, just like you would in PyTorch:
+
+```python
+>>> pipeline.to("cuda")
+```
+
+Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object.
+
+```python
+>>> image = pipeline("An image of a squirrel in Picasso style").images[0]
+>>> image
+```
+
+
+
+
+
+Save the image by calling `save`:
+
+```python
+>>> image.save("image_of_squirrel_painting.png")
+```
+
+### Local pipeline
+
+You can also use the pipeline locally. The only difference is you need to download the weights first:
+
+```bash
+!git lfs install
+!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
+```
+
+Then load the saved weights into the pipeline:
+
+```python
+>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True)
+```
+
+Now, you can run the pipeline as you would in the section above.
+
+### Swapping schedulers
+
+Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of ๐งจ Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`] with the [`EulerDiscreteScheduler`], load it with the [`~diffusers.ConfigMixin.from_config`] method:
+
+```py
+>>> from diffusers import EulerDiscreteScheduler
+
+>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
+>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
+```
+
+Try generating an image with the new scheduler and see if you notice a difference!
+
+In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`] and learn how to use these components to generate an image of a cat.
+
+## Models
+
+Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems.
+
+Models are initiated with the [`~ModelMixin.from_pretrained`] method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`], a basic unconditional image generation model with a checkpoint trained on cat images:
+
+```py
+>>> from diffusers import UNet2DModel
+
+>>> repo_id = "google/ddpm-cat-256"
+>>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
+```
+
+To access the model parameters, call `model.config`:
+
+```py
+>>> model.config
+```
+
+The model configuration is a ๐ง frozen ๐ง dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference.
+
+Some of the most important parameters are:
+
+* `sample_size`: the height and width dimension of the input sample.
+* `in_channels`: the number of input channels of the input sample.
+* `down_block_types` and `up_block_types`: the type of down- and upsampling blocks used to create the UNet architecture.
+* `block_out_channels`: the number of output channels of the downsampling blocks; also used in reverse order for the number of input channels of the upsampling blocks.
+* `layers_per_block`: the number of ResNet blocks present in each UNet block.
+
+To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image:
+
+```py
+>>> import torch
+
+>>> torch.manual_seed(0)
+
+>>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
+>>> noisy_sample.shape
+torch.Size([1, 3, 256, 256])
+```
+
+For inference, pass the noisy image and a `timestep` to the model. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output:
+
+```py
+>>> with torch.no_grad():
+... noisy_residual = model(sample=noisy_sample, timestep=2).sample
+```
+
+To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler.
+
+## Schedulers
+
+Schedulers manage going from a noisy sample to a less noisy sample given the model output - in this case, it is the `noisy_residual`.
+
+
+
+๐งจ Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`] is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system.
+
+
+
+For the quicktour, you'll instantiate the [`DDPMScheduler`] with its [`~diffusers.ConfigMixin.from_config`] method:
+
+```py
+>>> from diffusers import DDPMScheduler
+
+>>> scheduler = DDPMScheduler.from_pretrained(repo_id)
+>>> scheduler
+DDPMScheduler {
+ "_class_name": "DDPMScheduler",
+ "_diffusers_version": "0.21.4",
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "beta_start": 0.0001,
+ "clip_sample": true,
+ "clip_sample_range": 1.0,
+ "dynamic_thresholding_ratio": 0.995,
+ "num_train_timesteps": 1000,
+ "prediction_type": "epsilon",
+ "sample_max_value": 1.0,
+ "steps_offset": 0,
+ "thresholding": false,
+ "timestep_spacing": "leading",
+ "trained_betas": null,
+ "variance_type": "fixed_small"
+}
+```
+
+
+
+๐ก Unlike a model, a scheduler does not have trainable weights and is parameter-free!
+
+
+
+Some of the most important parameters are:
+
+* `num_train_timesteps`: the length of the denoising process or, in other words, the number of timesteps required to process random Gaussian noise into a data sample.
+* `beta_schedule`: the type of noise schedule to use for inference and training.
+* `beta_start` and `beta_end`: the start and end noise values for the noise schedule.
+
+To predict a slightly less noisy image, pass the following to the scheduler's [`~diffusers.DDPMScheduler.step`] method: model output, `timestep`, and current `sample`.
+
+```py
+>>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample
+>>> less_noisy_sample.shape
+torch.Size([1, 3, 256, 256])
+```
+
+The `less_noisy_sample` can be passed to the next `timestep` where it'll get even less noisy! Let's bring it all together now and visualize the entire denoising process.
+
+First, create a function that postprocesses and displays the denoised image as a `PIL.Image`:
+
+```py
+>>> import PIL.Image
+>>> import numpy as np
+
+
+>>> def display_sample(sample, i):
+... image_processed = sample.cpu().permute(0, 2, 3, 1)
+... image_processed = (image_processed + 1.0) * 127.5
+... image_processed = image_processed.numpy().astype(np.uint8)
+
+... image_pil = PIL.Image.fromarray(image_processed[0])
+... display(f"Image at step {i}")
+... display(image_pil)
+```
+
+To speed up the denoising process, move the input and model to a GPU:
+
+```py
+>>> model.to("cuda")
+>>> noisy_sample = noisy_sample.to("cuda")
+```
+
+Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler:
+
+```py
+>>> import tqdm
+
+>>> sample = noisy_sample
+
+>>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)):
+... # 1. predict noise residual
+... with torch.no_grad():
+... residual = model(sample, t).sample
+
+... # 2. compute less noisy image and set x_t -> x_t-1
+... sample = scheduler.step(residual, t, sample).prev_sample
+
+... # 3. optionally look at image
+... if (i + 1) % 50 == 0:
+... display_sample(sample, i + 1)
+```
+
+Sit back and watch as a cat is generated from nothing but noise! ๐ป
+
+
+
+
+
+## Next steps
+
+Hopefully, you generated some cool images with ๐งจ Diffusers in this quicktour! For your next steps, you can:
+
+* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial.
+* See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases.
+* Learn more about loading, accessing, changing, and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide.
+* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher-quality images with the [Stable Diffusion](./stable_diffusion) guide.
+* Dive deeper into speeding up ๐งจ Diffusers with guides on [optimized PyTorch on a GPU](./optimization/fp16), and inference guides for running [Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps) and [ONNX Runtime](./optimization/onnx).
diff --git a/docs/source/en/stable_diffusion.md b/docs/source/en/stable_diffusion.md
new file mode 100644
index 0000000..877a4ac
--- /dev/null
+++ b/docs/source/en/stable_diffusion.md
@@ -0,0 +1,261 @@
+
+
+# Effective and efficient diffusion
+
+[[open-in-colab]]
+
+Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again.
+
+This is why it's important to get the most *computational* (speed) and *memory* (GPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster.
+
+This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`].
+
+Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model:
+
+```python
+from diffusers import DiffusionPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True)
+```
+
+The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt:
+
+```python
+prompt = "portrait photo of a old warrior chief"
+```
+
+## Speed
+
+
+
+๐ก If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)!
+
+
+
+One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module:
+
+```python
+pipeline = pipeline.to("cuda")
+```
+
+To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility):
+
+```python
+import torch
+
+generator = torch.Generator("cuda").manual_seed(0)
+```
+
+Now you can generate an image:
+
+```python
+image = pipeline(prompt, generator=generator).images[0]
+image
+```
+
+
+
+
+
+This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps.
+
+Let's start by loading the model in `float16` and generate an image:
+
+```python
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)
+pipeline = pipeline.to("cuda")
+generator = torch.Generator("cuda").manual_seed(0)
+image = pipeline(prompt, generator=generator).images[0]
+image
+```
+
+
+
+
+
+This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before!
+
+
+
+๐ก We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality.
+
+
+
+Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method:
+
+```python
+pipeline.scheduler.compatibles
+[
+ diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
+ diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
+ diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
+ diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler,
+ diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
+ diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
+ diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
+ diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
+ diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler,
+ diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler,
+ diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
+ diffusers.schedulers.scheduling_pndm.PNDMScheduler,
+ diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
+ diffusers.schedulers.scheduling_ddim.DDIMScheduler,
+]
+```
+
+The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`~ConfigMixin.from_config`] method to load a new scheduler:
+
+```python
+from diffusers import DPMSolverMultistepScheduler
+
+pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
+```
+
+Now set the `num_inference_steps` to 20:
+
+```python
+generator = torch.Generator("cuda").manual_seed(0)
+image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+Great, you've managed to cut the inference time to just 4 seconds! โก๏ธ
+
+## Memory
+
+The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM).
+
+Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result.
+
+```python
+def get_inputs(batch_size=1):
+ generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]
+ prompts = batch_size * [prompt]
+ num_inference_steps = 20
+
+ return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
+```
+
+Start with `batch_size=4` and see how much memory you've consumed:
+
+```python
+from diffusers.utils import make_image_grid
+
+images = pipeline(**get_inputs(batch_size=4)).images
+make_image_grid(images, 2, 2)
+```
+
+Unless you have a GPU with more vRAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function:
+
+```python
+pipeline.enable_attention_slicing()
+```
+
+Now try increasing the `batch_size` to 8!
+
+```python
+images = pipeline(**get_inputs(batch_size=8)).images
+make_image_grid(images, rows=2, cols=4)
+```
+
+
+
+
+
+Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality.
+
+## Quality
+
+In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images.
+
+### Better checkpoints
+
+The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results.
+
+As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in!
+
+### Better pipeline components
+
+You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autoencoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images:
+
+```python
+from diffusers import AutoencoderKL
+
+vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
+pipeline.vae = vae
+images = pipeline(**get_inputs(batch_size=8)).images
+make_image_grid(images, rows=2, cols=4)
+```
+
+
+
+
+
+### Better prompt engineering
+
+The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are:
+
+- How is the image or similar images of the one I want to generate stored on the internet?
+- What additional detail can I give that steers the model towards the style I want?
+
+With this in mind, let's improve the prompt to include color and higher quality details:
+
+```python
+prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"
+prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta"
+```
+
+Generate a batch of images with the new prompt:
+
+```python
+images = pipeline(**get_inputs(batch_size=8)).images
+make_image_grid(images, rows=2, cols=4)
+```
+
+
+
+
+
+Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject:
+
+```python
+prompts = [
+ "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+]
+
+generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
+images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
+make_image_grid(images, 2, 2)
+```
+
+
+
+
+
+## Next steps
+
+In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources:
+
+- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster!
+- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption.
+- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16).
diff --git a/docs/source/en/training/adapt_a_model.md b/docs/source/en/training/adapt_a_model.md
new file mode 100644
index 0000000..57bc1a3
--- /dev/null
+++ b/docs/source/en/training/adapt_a_model.md
@@ -0,0 +1,47 @@
+# Adapt a model to a new task
+
+Many diffusion systems share the same components, allowing you to adapt a pretrained model for one task to an entirely different task.
+
+This guide will show you how to adapt a pretrained text-to-image model for inpainting by initializing and modifying the architecture of a pretrained [`UNet2DConditionModel`].
+
+## Configure UNet2DConditionModel parameters
+
+A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels). For example, load a pretrained text-to-image model like [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) and take a look at the number of `in_channels`:
+
+```py
+from diffusers import StableDiffusionPipeline
+
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
+pipeline.unet.config["in_channels"]
+4
+```
+
+Inpainting requires 9 channels in the input sample. You can check this value in a pretrained inpainting model like [`runwayml/stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting):
+
+```py
+from diffusers import StableDiffusionPipeline
+
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting", use_safetensors=True)
+pipeline.unet.config["in_channels"]
+9
+```
+
+To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9.
+
+Initialize a [`UNet2DConditionModel`] with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now.
+
+```py
+from diffusers import UNet2DConditionModel
+
+model_id = "runwayml/stable-diffusion-v1-5"
+unet = UNet2DConditionModel.from_pretrained(
+ model_id,
+ subfolder="unet",
+ in_channels=9,
+ low_cpu_mem_usage=False,
+ ignore_mismatched_sizes=True,
+ use_safetensors=True,
+)
+```
+
+The pretrained weights of the other components from the text-to-image model are initialized from their checkpoints, but the input channel weights (`conv_in.weight`) of the `unet` are randomly initialized. It is important to finetune the model for inpainting because otherwise the model returns noise.
diff --git a/docs/source/en/training/controlnet.md b/docs/source/en/training/controlnet.md
new file mode 100644
index 0000000..00cc626
--- /dev/null
+++ b/docs/source/en/training/controlnet.md
@@ -0,0 +1,366 @@
+
+
+# ControlNet
+
+[ControlNet](https://hf.co/papers/2302.05543) models are adapters trained on top of another pretrained model. It allows for a greater degree of control over image generation by conditioning the model with an additional input image. The input image can be a canny edge, depth map, human pose, and many more.
+
+If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing`, `gradient_accumulation_steps`, and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. You should have a GPU with >30GB of memory if you want to train faster with Flax.
+
+This guide will explore the [train_controlnet.py](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+
+
+```bash
+cd examples/controlnet
+pip install -r requirements.txt
+```
+
+
+
+If you have access to a TPU, the Flax training script runs even faster! Let's run the training script on the [Google Cloud TPU VM](https://cloud.google.com/tpu/docs/run-calculation-jax). Create a single TPU v4-8 VM and connect to it:
+
+```bash
+ZONE=us-central2-b
+TPU_TYPE=v4-8
+VM_NAME=hg_flax
+
+gcloud alpha compute tpus tpu-vm create $VM_NAME \
+ --zone $ZONE \
+ --accelerator-type $TPU_TYPE \
+ --version tpu-vm-v4-base
+
+gcloud alpha compute tpus tpu-vm ssh $VM_NAME --zone $ZONE -- \
+```
+
+Install JAX 0.4.5:
+
+```bash
+pip install "jax[tpu]==0.4.5" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
+```
+
+Then install the required dependencies for the Flax script:
+
+```bash
+cd examples/controlnet
+pip install -r requirements_flax.txt
+```
+
+
+
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L231) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_controlnet.py \
+ --mixed_precision="fp16"
+```
+
+Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant parameters for ControlNet:
+
+- `--max_train_samples`: the number of training samples; this can be lowered for faster training, but if you want to stream really large datasets, you'll need to include this parameter and the `--streaming` parameter in your training command
+- `--gradient_accumulation_steps`: number of update steps to accumulate before the backward pass; this allows you to train with a bigger batch size than your GPU memory can typically handle
+
+### Min-SNR weighting
+
+The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
+
+Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
+
+```bash
+accelerate launch train_controlnet.py \
+ --snr_gamma=5.0
+```
+
+## Training script
+
+As with the script parameters, a general walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the relevant parts of the ControlNet script.
+
+The training script has a [`make_train_dataset`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L582) function for preprocessing the dataset with image transforms and caption tokenization. You'll see that in addition to the usual caption tokenization and image transforms, the script also includes transforms for the conditioning image.
+
+
+
+If you're streaming a dataset on a TPU, performance may be bottlenecked by the ๐ค Datasets library which is not optimized for images. To ensure maximum throughput, you're encouraged to explore other dataset formats like [WebDataset](https://webdataset.github.io/webdataset/), [TorchData](https://github.com/pytorch/data), and [TensorFlow Datasets](https://www.tensorflow.org/datasets/tfless_tfds).
+
+
+
+```py
+conditioning_image_transforms = transforms.Compose(
+ [
+ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
+ transforms.CenterCrop(args.resolution),
+ transforms.ToTensor(),
+ ]
+)
+```
+
+Within the [`main()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L713) function, you'll find the code for loading the tokenizer, text encoder, scheduler and models. This is also where the ControlNet model is loaded either from existing weights or randomly initialized from a UNet:
+
+```py
+if args.controlnet_model_name_or_path:
+ logger.info("Loading existing controlnet weights")
+ controlnet = ControlNetModel.from_pretrained(args.controlnet_model_name_or_path)
+else:
+ logger.info("Initializing controlnet weights from unet")
+ controlnet = ControlNetModel.from_unet(unet)
+```
+
+The [optimizer](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L871) is set up to update the ControlNet parameters:
+
+```py
+params_to_optimize = controlnet.parameters()
+optimizer = optimizer_class(
+ params_to_optimize,
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Finally, in the [training loop](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L943), the conditioning text embeddings and image are passed to the down and mid-blocks of the ControlNet model:
+
+```py
+encoder_hidden_states = text_encoder(batch["input_ids"])[0]
+controlnet_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype)
+
+down_block_res_samples, mid_block_res_sample = controlnet(
+ noisy_latents,
+ timesteps,
+ encoder_hidden_states=encoder_hidden_states,
+ controlnet_cond=controlnet_image,
+ return_dict=False,
+)
+```
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Now you're ready to launch the training script! ๐
+
+This guide uses the [fusing/fill50k](https://huggingface.co/datasets/fusing/fill50k) dataset, but remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide).
+
+Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where you want to save the model.
+
+Download the following images to condition your training with:
+
+```bash
+wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
+wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png
+```
+
+One more thing before you launch the script! Depending on the GPU you have, you may need to enable certain optimizations to train a ControlNet. The default configuration in this script requires ~38GB of vRAM. If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command.
+
+
+
+
+On a 16GB GPU, you can use bitsandbytes 8-bit optimizer and gradient checkpointing to optimize your training run. Install bitsandbytes:
+
+```py
+pip install bitsandbytes
+```
+
+Then, add the following parameter to your training command:
+
+```bash
+accelerate launch train_controlnet.py \
+ --gradient_checkpointing \
+ --use_8bit_adam \
+```
+
+
+
+
+On a 12GB GPU, you'll need bitsandbytes 8-bit optimizer, gradient checkpointing, xFormers, and set the gradients to `None` instead of zero to reduce your memory-usage.
+
+```bash
+accelerate launch train_controlnet.py \
+ --use_8bit_adam \
+ --gradient_checkpointing \
+ --enable_xformers_memory_efficient_attention \
+ --set_grads_to_none \
+```
+
+
+
+
+On a 8GB GPU, you'll need to use [DeepSpeed](https://www.deepspeed.ai/) to offload some of the tensors from the vRAM to either the CPU or NVME to allow training with less GPU memory.
+
+Run the following command to configure your ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+During configuration, confirm that you want to use DeepSpeed stage 2. Now it should be possible to train on under 8GB vRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM (~25 GB). See the [DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. Your configuration file should look something like:
+
+```bash
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+ gradient_accumulation_steps: 4
+ offload_optimizer_device: cpu
+ offload_param_device: cpu
+ zero3_init_flag: false
+ zero_stage: 2
+distributed_type: DEEPSPEED
+```
+
+You should also change the default Adam optimizer to DeepSpeedโs optimized version of Adam [`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your systemโs CUDA toolchain version to be the same as the one installed with PyTorch.
+
+bitsandbytes 8-bit optimizers donโt seem to be compatible with DeepSpeed at the moment.
+
+That's it! You don't need to add any additional parameters to your training command.
+
+
+
+
+
+
+
+```bash
+export MODEL_DIR="runwayml/stable-diffusion-v1-5"
+export OUTPUT_DIR="path/to/save/model"
+
+accelerate launch train_controlnet.py \
+ --pretrained_model_name_or_path=$MODEL_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --dataset_name=fusing/fill50k \
+ --resolution=512 \
+ --learning_rate=1e-5 \
+ --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
+ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --push_to_hub
+```
+
+
+
+
+With Flax, you can [profile your code](https://jax.readthedocs.io/en/latest/profiling.html) by adding the `--profile_steps==5` parameter to your training command. Install the Tensorboard profile plugin:
+
+```bash
+pip install tensorflow tensorboard-plugin-profile
+tensorboard --logdir runs/fill-circle-100steps-20230411_165612/
+```
+
+Then you can inspect the profile at [http://localhost:6006/#profile](http://localhost:6006/#profile).
+
+
+
+If you run into version conflicts with the plugin, try uninstalling and reinstalling all versions of TensorFlow and Tensorboard. The debugging functionality of the profile plugin is still experimental, and not all views are fully functional. The `trace_viewer` cuts off events after 1M, which can result in all your device traces getting lost if for example, you profile the compilation step by accident.
+
+
+
+```bash
+python3 train_controlnet_flax.py \
+ --pretrained_model_name_or_path=$MODEL_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --dataset_name=fusing/fill50k \
+ --resolution=512 \
+ --learning_rate=1e-5 \
+ --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
+ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
+ --validation_steps=1000 \
+ --train_batch_size=2 \
+ --revision="non-ema" \
+ --from_pt \
+ --report_to="wandb" \
+ --tracker_project_name=$HUB_MODEL_ID \
+ --num_train_epochs=11 \
+ --push_to_hub \
+ --hub_model_id=$HUB_MODEL_ID
+```
+
+
+
+
+Once training is complete, you can use your newly trained model for inference!
+
+```py
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+from diffusers.utils import load_image
+import torch
+
+controlnet = ControlNetModel.from_pretrained("path/to/controlnet", torch_dtype=torch.float16)
+pipeline = StableDiffusionControlNetPipeline.from_pretrained(
+ "path/to/base/model", controlnet=controlnet, torch_dtype=torch.float16
+).to("cuda")
+
+control_image = load_image("./conditioning_image_1.png")
+prompt = "pale golden rod circle with old lace background"
+
+generator = torch.manual_seed(0)
+image = pipe(prompt, num_inference_steps=20, generator=generator, image=control_image).images[0]
+image.save("./output.png")
+```
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [`train_controlnet_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py) script to train a ControlNet adapter for the SDXL model.
+
+The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide.
+
+## Next steps
+
+Congratulations on training your own ControlNet! To learn more about how to use your new model, the following guides may be helpful:
+
+- Learn how to [use a ControlNet](../using-diffusers/controlnet) for inference on a variety of tasks.
\ No newline at end of file
diff --git a/docs/source/en/training/create_dataset.md b/docs/source/en/training/create_dataset.md
new file mode 100644
index 0000000..f215d3e
--- /dev/null
+++ b/docs/source/en/training/create_dataset.md
@@ -0,0 +1,90 @@
+# Create a dataset for training
+
+There are many datasets on the [Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) to train a model on, but if you can't find one you're interested in or want to use your own, you can create a dataset with the ๐ค [Datasets](hf.co/docs/datasets) library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation.
+
+This guide will show you two ways to create a dataset to finetune on:
+
+- provide a folder of images to the `--train_data_dir` argument
+- upload a dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument
+
+
+
+๐ก Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide.
+
+
+
+## Provide a dataset as a folder
+
+For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) builder from ๐ค Datasets to automatically build a dataset from the folder. Your directory structure should look like:
+
+```bash
+data_dir/xxx.png
+data_dir/xxy.png
+data_dir/[...]/xxz.png
+```
+
+Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training:
+
+```bash
+accelerate launch train_unconditional.py \
+ --train_data_dir \
+
+```
+
+## Upload your data to the Hub
+
+
+
+๐ก For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with ๐ค Datasets](https://huggingface.co/blog/image-search-datasets) post.
+
+
+
+Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images.
+
+You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`:
+
+```python
+from datasets import load_dataset
+
+# example 1: local folder
+dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")
+
+# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd)
+dataset = load_dataset("imagefolder", data_files="path_to_zip_file")
+
+# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd)
+dataset = load_dataset(
+ "imagefolder",
+ data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
+)
+
+# example 4: providing several splits
+dataset = load_dataset(
+ "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
+)
+```
+
+Then use the [`~datasets.Dataset.push_to_hub`] method to upload the dataset to the Hub:
+
+```python
+# assuming you have ran the huggingface-cli login command in a terminal
+dataset.push_to_hub("name_of_your_dataset")
+
+# if you want to push to a private repo, simply pass private=True:
+dataset.push_to_hub("name_of_your_dataset", private=True)
+```
+
+Now the dataset is available for training by passing the dataset name to the `--dataset_name` argument:
+
+```bash
+accelerate launch --mixed_precision="fp16" train_text_to_image.py \
+ --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
+ --dataset_name="name_of_your_dataset" \
+
+```
+
+## Next steps
+
+Now that you've created a dataset, you can plug it into the `train_data_dir` (if your dataset is local) or `dataset_name` (if your dataset is on the Hub) arguments of a training script.
+
+For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](unconditional_training) or [text-to-image generation](text2image)!
\ No newline at end of file
diff --git a/docs/source/en/training/custom_diffusion.md b/docs/source/en/training/custom_diffusion.md
new file mode 100644
index 0000000..4e63498
--- /dev/null
+++ b/docs/source/en/training/custom_diffusion.md
@@ -0,0 +1,363 @@
+
+
+# Custom Diffusion
+
+[Custom Diffusion](https://huggingface.co/papers/2212.04488) is a training technique for personalizing image generation models. Like Textual Inversion, DreamBooth, and LoRA, Custom Diffusion only requires a few (~4-5) example images. This technique works by only training weights in the cross-attention layers, and it uses a special word to represent the newly learned concept. Custom Diffusion is unique because it can also learn multiple concepts at the same time.
+
+If you're training on a GPU with limited vRAM, you should try enabling xFormers with `--enable_xformers_memory_efficient_attention` for faster training with lower vRAM requirements (16GB). To save even more memory, add `--set_grads_to_none` in the training argument to set the gradients to `None` instead of zero (this option can cause some issues, so if you experience any, try removing this parameter).
+
+This guide will explore the [train_custom_diffusion.py](https://github.com/huggingface/diffusers/blob/main/examples/custom_diffusion/train_custom_diffusion.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Navigate to the example folder with the training script and install the required dependencies:
+
+```bash
+cd examples/custom_diffusion
+pip install -r requirements.txt
+pip install clip-retrieval
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/custom_diffusion/train_custom_diffusion.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training script contains all the parameters to help you customize your training run. These are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L319) function. The function comes with default values, but you can also set your own values in the training command if you'd like.
+
+For example, to change the resolution of the input image:
+
+```bash
+accelerate launch train_custom_diffusion.py \
+ --resolution=256
+```
+
+Many of the basic parameters are described in the [DreamBooth](dreambooth#script-parameters) training guide, so this guide focuses on the parameters unique to Custom Diffusion:
+
+- `--freeze_model`: freezes the key and value parameters in the cross-attention layer; the default is `crossattn_kv`, but you can set it to `crossattn` to train all the parameters in the cross-attention layer
+- `--concepts_list`: to learn multiple concepts, provide a path to a JSON file containing the concepts
+- `--modifier_token`: a special word used to represent the learned concept
+- `--initializer_token`:
+
+### Prior preservation loss
+
+Prior preservation loss is a method that uses a model's own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions.
+
+Many of the parameters for prior preservation loss are described in the [DreamBooth](dreambooth#prior-preservation-loss) training guide.
+
+### Regularization
+
+Custom Diffusion includes training the target images with a small set of real images to prevent overfitting. As you can imagine, this can be easy to do when you're only training on a few images! Download 200 real images with `clip_retrieval`. The `class_prompt` should be the same category as the target images. These images are stored in `class_data_dir`.
+
+```bash
+python retrieve.py --class_prompt cat --class_data_dir real_reg/samples_cat --num_class_images 200
+```
+
+To enable regularization, add the following parameters:
+
+- `--with_prior_preservation`: whether to use prior preservation loss
+- `--prior_loss_weight`: controls the influence of the prior preservation loss on the model
+- `--real_prior`: whether to use a small set of real images to prevent overfitting
+
+```bash
+accelerate launch train_custom_diffusion.py \
+ --with_prior_preservation \
+ --prior_loss_weight=1.0 \
+ --class_data_dir="./real_reg/samples_cat" \
+ --class_prompt="cat" \
+ --real_prior=True \
+```
+
+## Training script
+
+
+
+A lot of the code in the Custom Diffusion training script is similar to the [DreamBooth](dreambooth#training-script) script. This guide instead focuses on the code that is relevant to Custom Diffusion.
+
+
+
+The Custom Diffusion training script has two dataset classes:
+
+- [`CustomDiffusionDataset`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L165): preprocesses the images, class images, and prompts for training
+- [`PromptDataset`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L148): prepares the prompts for generating class images
+
+Next, the `modifier_token` is [added to the tokenizer](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L811), converted to token ids, and the token embeddings are resized to account for the new `modifier_token`. Then the `modifier_token` embeddings are initialized with the embeddings of the `initializer_token`. All parameters in the text encoder are frozen, except for the token embeddings since this is what the model is trying to learn to associate with the concepts.
+
+```py
+params_to_freeze = itertools.chain(
+ text_encoder.text_model.encoder.parameters(),
+ text_encoder.text_model.final_layer_norm.parameters(),
+ text_encoder.text_model.embeddings.position_embedding.parameters(),
+)
+freeze_params(params_to_freeze)
+```
+
+Now you'll need to add the [Custom Diffusion weights](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L911C3-L911C3) to the attention layers. This is a really important step for getting the shape and size of the attention weights correct, and for setting the appropriate number of attention processors in each UNet block.
+
+```py
+st = unet.state_dict()
+for name, _ in unet.attn_processors.items():
+ cross_attention_dim = None if name.endswith("attn1.processor") else unet.config.cross_attention_dim
+ if name.startswith("mid_block"):
+ hidden_size = unet.config.block_out_channels[-1]
+ elif name.startswith("up_blocks"):
+ block_id = int(name[len("up_blocks.")])
+ hidden_size = list(reversed(unet.config.block_out_channels))[block_id]
+ elif name.startswith("down_blocks"):
+ block_id = int(name[len("down_blocks.")])
+ hidden_size = unet.config.block_out_channels[block_id]
+ layer_name = name.split(".processor")[0]
+ weights = {
+ "to_k_custom_diffusion.weight": st[layer_name + ".to_k.weight"],
+ "to_v_custom_diffusion.weight": st[layer_name + ".to_v.weight"],
+ }
+ if train_q_out:
+ weights["to_q_custom_diffusion.weight"] = st[layer_name + ".to_q.weight"]
+ weights["to_out_custom_diffusion.0.weight"] = st[layer_name + ".to_out.0.weight"]
+ weights["to_out_custom_diffusion.0.bias"] = st[layer_name + ".to_out.0.bias"]
+ if cross_attention_dim is not None:
+ custom_diffusion_attn_procs[name] = attention_class(
+ train_kv=train_kv,
+ train_q_out=train_q_out,
+ hidden_size=hidden_size,
+ cross_attention_dim=cross_attention_dim,
+ ).to(unet.device)
+ custom_diffusion_attn_procs[name].load_state_dict(weights)
+ else:
+ custom_diffusion_attn_procs[name] = attention_class(
+ train_kv=False,
+ train_q_out=False,
+ hidden_size=hidden_size,
+ cross_attention_dim=cross_attention_dim,
+ )
+del st
+unet.set_attn_processor(custom_diffusion_attn_procs)
+custom_diffusion_layers = AttnProcsLayers(unet.attn_processors)
+```
+
+The [optimizer](https://github.com/huggingface/diffusers/blob/84cd9e8d01adb47f046b1ee449fc76a0c32dc4e2/examples/custom_diffusion/train_custom_diffusion.py#L982) is initialized to update the cross-attention layer parameters:
+
+```py
+optimizer = optimizer_class(
+ itertools.chain(text_encoder.get_input_embeddings().parameters(), custom_diffusion_layers.parameters())
+ if args.modifier_token is not None
+ else custom_diffusion_layers.parameters(),
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+In the [training loop](https://github.com/huggingface/diffusers/blob/84cd9e8d01adb47f046b1ee449fc76a0c32dc4e2/examples/custom_diffusion/train_custom_diffusion.py#L1048), it is important to only update the embeddings for the concept you're trying to learn. This means setting the gradients of all the other token embeddings to zero:
+
+```py
+if args.modifier_token is not None:
+ if accelerator.num_processes > 1:
+ grads_text_encoder = text_encoder.module.get_input_embeddings().weight.grad
+ else:
+ grads_text_encoder = text_encoder.get_input_embeddings().weight.grad
+ index_grads_to_zero = torch.arange(len(tokenizer)) != modifier_token_id[0]
+ for i in range(len(modifier_token_id[1:])):
+ index_grads_to_zero = index_grads_to_zero & (
+ torch.arange(len(tokenizer)) != modifier_token_id[i]
+ )
+ grads_text_encoder.data[index_grads_to_zero, :] = grads_text_encoder.data[
+ index_grads_to_zero, :
+ ].fill_(0)
+```
+
+## Launch the script
+
+Once youโve made all your changes or youโre okay with the default configuration, youโre ready to launch the training script! ๐
+
+In this guide, you'll download and use these example [cat images](https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip). You can also create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide).
+
+Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, `INSTANCE_DIR` to the path where you just downloaded the cat images to, and `OUTPUT_DIR` to where you want to save the model. You'll use `` as the special word to tie the newly learned embeddings to. The script creates and saves model checkpoints and a pytorch_custom_diffusion_weights.bin file to your repository.
+
+To monitor training progress with Weights and Biases, add the `--report_to=wandb` parameter to the training command and specify a validation prompt with `--validation_prompt`. This is useful for debugging and saving intermediate results.
+
+
+
+If you're training on human faces, the Custom Diffusion team has found the following parameters to work well:
+
+- `--learning_rate=5e-6`
+- `--max_train_steps` can be anywhere between 1000 and 2000
+- `--freeze_model=crossattn`
+- use at least 15-20 images to train with
+
+
+
+
+
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export OUTPUT_DIR="path-to-save-model"
+export INSTANCE_DIR="./data/cat"
+
+accelerate launch train_custom_diffusion.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --instance_data_dir=$INSTANCE_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --class_data_dir=./real_reg/samples_cat/ \
+ --with_prior_preservation \
+ --real_prior \
+ --prior_loss_weight=1.0 \
+ --class_prompt="cat" \
+ --num_class_images=200 \
+ --instance_prompt="photo of a cat" \
+ --resolution=512 \
+ --train_batch_size=2 \
+ --learning_rate=1e-5 \
+ --lr_warmup_steps=0 \
+ --max_train_steps=250 \
+ --scale_lr \
+ --hflip \
+ --modifier_token "" \
+ --validation_prompt=" cat sitting in a bucket" \
+ --report_to="wandb" \
+ --push_to_hub
+```
+
+
+
+
+Custom Diffusion can also learn multiple concepts if you provide a [JSON](https://github.com/adobe-research/custom-diffusion/blob/main/assets/concept_list.json) file with some details about each concept it should learn.
+
+Run clip-retrieval to collect some real images to use for regularization:
+
+```bash
+pip install clip-retrieval
+python retrieve.py --class_prompt {} --class_data_dir {} --num_class_images 200
+```
+
+Then you can launch the script:
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export OUTPUT_DIR="path-to-save-model"
+
+accelerate launch train_custom_diffusion.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --output_dir=$OUTPUT_DIR \
+ --concepts_list=./concept_list.json \
+ --with_prior_preservation \
+ --real_prior \
+ --prior_loss_weight=1.0 \
+ --resolution=512 \
+ --train_batch_size=2 \
+ --learning_rate=1e-5 \
+ --lr_warmup_steps=0 \
+ --max_train_steps=500 \
+ --num_class_images=200 \
+ --scale_lr \
+ --hflip \
+ --modifier_token "+" \
+ --push_to_hub
+```
+
+
+
+
+Once training is finished, you can use your new Custom Diffusion model for inference.
+
+
+
+
+```py
+import torch
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16,
+).to("cuda")
+pipeline.unet.load_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin")
+pipeline.load_textual_inversion("path-to-save-model", weight_name=".bin")
+
+image = pipeline(
+ " cat sitting in a bucket",
+ num_inference_steps=100,
+ guidance_scale=6.0,
+ eta=1.0,
+).images[0]
+image.save("cat.png")
+```
+
+
+
+
+```py
+import torch
+from huggingface_hub.repocard import RepoCard
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained("sayakpaul/custom-diffusion-cat-wooden-pot", torch_dtype=torch.float16).to("cuda")
+pipeline.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
+pipeline.load_textual_inversion(model_id, weight_name=".bin")
+pipeline.load_textual_inversion(model_id, weight_name=".bin")
+
+image = pipeline(
+ "the cat sculpture in the style of a wooden pot",
+ num_inference_steps=100,
+ guidance_scale=6.0,
+ eta=1.0,
+).images[0]
+image.save("multi-subject.png")
+```
+
+
+
+
+## Next steps
+
+Congratulations on training a model with Custom Diffusion! ๐ To learn more:
+
+- Read the [Multi-Concept Customization of Text-to-Image Diffusion](https://www.cs.cmu.edu/~custom-diffusion/) blog post to learn more details about the experimental results from the Custom Diffusion team.
\ No newline at end of file
diff --git a/docs/source/en/training/ddpo.md b/docs/source/en/training/ddpo.md
new file mode 100644
index 0000000..a4538fe
--- /dev/null
+++ b/docs/source/en/training/ddpo.md
@@ -0,0 +1,17 @@
+
+
+# Reinforcement learning training with DDPO
+
+You can fine-tune Stable Diffusion on a reward function via reinforcement learning with the ๐ค TRL library and ๐ค Diffusers. This is done with the Denoising Diffusion Policy Optimization (DDPO) algorithm introduced by Black et al. in [Training Diffusion Models with Reinforcement Learning](https://arxiv.org/abs/2305.13301), which is implemented in ๐ค TRL with the [`~trl.DDPOTrainer`].
+
+For more information, check out the [`~trl.DDPOTrainer`] API reference and the [Finetune Stable Diffusion Models with DDPO via TRL](https://huggingface.co/blog/trl-ddpo) blog post.
\ No newline at end of file
diff --git a/docs/source/en/training/distributed_inference.md b/docs/source/en/training/distributed_inference.md
new file mode 100644
index 0000000..008dc30
--- /dev/null
+++ b/docs/source/en/training/distributed_inference.md
@@ -0,0 +1,108 @@
+
+
+# Distributed inference with multiple GPUs
+
+On distributed setups, you can run inference across multiple GPUs with ๐ค [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.
+
+This guide will show you how to use ๐ค Accelerate and PyTorch Distributed for distributed inference.
+
+## ๐ค Accelerate
+
+๐ค [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
+
+To begin, create a Python file and initialize an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `distributed_state.device` to assign a GPU to each process.
+
+Now use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
+
+```py
+import torch
+from accelerate import PartialState
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
+distributed_state = PartialState()
+pipeline.to(distributed_state.device)
+
+with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt:
+ result = pipeline(prompt).images[0]
+ result.save(f"result_{distributed_state.process_index}.png")
+```
+
+Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
+
+```bash
+accelerate launch run_distributed.py --num_processes=2
+```
+
+
+
+To learn more, take a look at the [Distributed Inference with ๐ค Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
+
+
+
+## PyTorch Distributed
+
+PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism.
+
+To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a [`DiffusionPipeline`]:
+
+```py
+import torch
+import torch.distributed as dist
+import torch.multiprocessing as mp
+
+from diffusers import DiffusionPipeline
+
+sd = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
+```
+
+You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` is 2.
+
+Move the [`DiffusionPipeline`] to `rank` and use `get_rank` to assign a GPU to each process, where each process handles a different prompt:
+
+```py
+def run_inference(rank, world_size):
+ dist.init_process_group("nccl", rank=rank, world_size=world_size)
+
+ sd.to(rank)
+
+ if torch.distributed.get_rank() == 0:
+ prompt = "a dog"
+ elif torch.distributed.get_rank() == 1:
+ prompt = "a cat"
+
+ image = sd(prompt).images[0]
+ image.save(f"./{'_'.join(prompt)}.png")
+```
+
+To run the distributed inference, call [`mp.spawn`](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn) to run the `run_inference` function on the number of GPUs defined in `world_size`:
+
+```py
+def main():
+ world_size = 2
+ mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True)
+
+
+if __name__ == "__main__":
+ main()
+```
+
+Once you've completed the inference script, use the `--nproc_per_node` argument to specify the number of GPUs to use and call `torchrun` to run the script:
+
+```bash
+torchrun run_distributed.py --nproc_per_node=2
+```
diff --git a/docs/source/en/training/dreambooth.md b/docs/source/en/training/dreambooth.md
new file mode 100644
index 0000000..7573296
--- /dev/null
+++ b/docs/source/en/training/dreambooth.md
@@ -0,0 +1,447 @@
+
+
+# DreamBooth
+
+[DreamBooth](https://huggingface.co/papers/2208.12242) is a training technique that updates the entire diffusion model by training on just a few images of a subject or style. It works by associating a special word in the prompt with the example images.
+
+If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. You should have a GPU with >30GB of memory if you want to train faster with Flax.
+
+This guide will explore the [train_dreambooth.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Navigate to the example folder with the training script and install the required dependencies for the script you're using:
+
+
+
+
+```bash
+cd examples/dreambooth
+pip install -r requirements.txt
+```
+
+
+
+
+```bash
+cd examples/dreambooth
+pip install -r requirements_flax.txt
+```
+
+
+
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+
+
+DreamBooth is very sensitive to training hyperparameters, and it is easy to overfit. Read the [Training Stable Diffusion with Dreambooth using ๐งจ Diffusers](https://huggingface.co/blog/dreambooth) blog post for recommended settings for different subjects to help you choose the appropriate hyperparameters.
+
+
+
+The training script offers many parameters for customizing your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L228) function. The parameters are set with default values that should work pretty well out-of-the-box, but you can also set your own values in the training command if you'd like.
+
+For example, to train in the bf16 format:
+
+```bash
+accelerate launch train_dreambooth.py \
+ --mixed_precision="bf16"
+```
+
+Some basic and important parameters to know and specify are:
+
+- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
+- `--instance_data_dir`: path to a folder containing the training dataset (example images)
+- `--instance_prompt`: the text prompt that contains the special word for the example images
+- `--train_text_encoder`: whether to also train the text encoder
+- `--output_dir`: where to save the trained model
+- `--push_to_hub`: whether to push the trained model to the Hub
+- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
+
+### Min-SNR weighting
+
+The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
+
+Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
+
+```bash
+accelerate launch train_dreambooth.py \
+ --snr_gamma=5.0
+```
+
+### Prior preservation loss
+
+Prior preservation loss is a method that uses a model's own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions.
+
+- `--with_prior_preservation`: whether to use prior preservation loss
+- `--prior_loss_weight`: controls the influence of the prior preservation loss on the model
+- `--class_data_dir`: path to a folder containing the generated class sample images
+- `--class_prompt`: the text prompt describing the class of the generated sample images
+
+```bash
+accelerate launch train_dreambooth.py \
+ --with_prior_preservation \
+ --prior_loss_weight=1.0 \
+ --class_data_dir="path/to/class/images" \
+ --class_prompt="text prompt describing class"
+```
+
+### Train text encoder
+
+To improve the quality of the generated outputs, you can also train the text encoder in addition to the UNet. This requires additional memory and you'll need a GPU with at least 24GB of vRAM. If you have the necessary hardware, then training the text encoder produces better results, especially when generating images of faces. Enable this option by:
+
+```bash
+accelerate launch train_dreambooth.py \
+ --train_text_encoder
+```
+
+## Training script
+
+DreamBooth comes with its own dataset classes:
+
+- [`DreamBoothDataset`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L604): preprocesses the images and class images, and tokenizes the prompts for training
+- [`PromptDataset`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L738): generates the prompt embeddings to generate the class images
+
+If you enabled [prior preservation loss](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L842), the class images are generated here:
+
+```py
+sample_dataset = PromptDataset(args.class_prompt, num_new_images)
+sample_dataloader = torch.utils.data.DataLoader(sample_dataset, batch_size=args.sample_batch_size)
+
+sample_dataloader = accelerator.prepare(sample_dataloader)
+pipeline.to(accelerator.device)
+
+for example in tqdm(
+ sample_dataloader, desc="Generating class images", disable=not accelerator.is_local_main_process
+):
+ images = pipeline(example["prompt"]).images
+```
+
+Next is the [`main()`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L799) function which handles setting up the dataset for training and the training loop itself. The script loads the [tokenizer](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L898), [scheduler and models](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L912C1-L912C1):
+
+```py
+# Load the tokenizer
+if args.tokenizer_name:
+ tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, revision=args.revision, use_fast=False)
+elif args.pretrained_model_name_or_path:
+ tokenizer = AutoTokenizer.from_pretrained(
+ args.pretrained_model_name_or_path,
+ subfolder="tokenizer",
+ revision=args.revision,
+ use_fast=False,
+ )
+
+# Load scheduler and models
+noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+text_encoder = text_encoder_cls.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
+)
+
+if model_has_vae(args):
+ vae = AutoencoderKL.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision
+ )
+else:
+ vae = None
+
+unet = UNet2DConditionModel.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
+)
+```
+
+Then, it's time to [create the training dataset](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L1073) and DataLoader from `DreamBoothDataset`:
+
+```py
+train_dataset = DreamBoothDataset(
+ instance_data_root=args.instance_data_dir,
+ instance_prompt=args.instance_prompt,
+ class_data_root=args.class_data_dir if args.with_prior_preservation else None,
+ class_prompt=args.class_prompt,
+ class_num=args.num_class_images,
+ tokenizer=tokenizer,
+ size=args.resolution,
+ center_crop=args.center_crop,
+ encoder_hidden_states=pre_computed_encoder_hidden_states,
+ class_prompt_encoder_hidden_states=pre_computed_class_prompt_encoder_hidden_states,
+ tokenizer_max_length=args.tokenizer_max_length,
+)
+
+train_dataloader = torch.utils.data.DataLoader(
+ train_dataset,
+ batch_size=args.train_batch_size,
+ shuffle=True,
+ collate_fn=lambda examples: collate_fn(examples, args.with_prior_preservation),
+ num_workers=args.dataloader_num_workers,
+)
+```
+
+Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L1151) takes care of the remaining steps such as converting images to latent space, adding noise to the input, predicting the noise residual, and calculating the loss.
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+You're now ready to launch the training script! ๐
+
+For this guide, you'll download some images of a [dog](https://huggingface.co/datasets/diffusers/dog-example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide).
+
+```py
+from huggingface_hub import snapshot_download
+
+local_dir = "./dog"
+snapshot_download(
+ "diffusers/dog-example",
+ local_dir=local_dir,
+ repo_type="dataset",
+ ignore_patterns=".gitattributes",
+)
+```
+
+Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, `INSTANCE_DIR` to the path where you just downloaded the dog images to, and `OUTPUT_DIR` to where you want to save the model. You'll use `sks` as the special word to tie the training to.
+
+If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command:
+
+```bash
+--validation_prompt="a photo of a sks dog"
+--num_validation_images=4
+--validation_steps=100
+```
+
+One more thing before you launch the script! Depending on the GPU you have, you may need to enable certain optimizations to train DreamBooth.
+
+
+
+
+On a 16GB GPU, you can use bitsandbytes 8-bit optimizer and gradient checkpointing to help you train a DreamBooth model. Install bitsandbytes:
+
+```py
+pip install bitsandbytes
+```
+
+Then, add the following parameter to your training command:
+
+```bash
+accelerate launch train_dreambooth.py \
+ --gradient_checkpointing \
+ --use_8bit_adam \
+```
+
+
+
+
+On a 12GB GPU, you'll need bitsandbytes 8-bit optimizer, gradient checkpointing, xFormers, and set the gradients to `None` instead of zero to reduce your memory-usage.
+
+```bash
+accelerate launch train_dreambooth.py \
+ --use_8bit_adam \
+ --gradient_checkpointing \
+ --enable_xformers_memory_efficient_attention \
+ --set_grads_to_none \
+```
+
+
+
+
+On a 8GB GPU, you'll need [DeepSpeed](https://www.deepspeed.ai/) to offload some of the tensors from the vRAM to either the CPU or NVME to allow training with less GPU memory.
+
+Run the following command to configure your ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+During configuration, confirm that you want to use DeepSpeed. Now it should be possible to train on under 8GB vRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM (~25 GB). See the [DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options.
+
+You should also change the default Adam optimizer to DeepSpeedโs optimized version of Adam [`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your systemโs CUDA toolchain version to be the same as the one installed with PyTorch.
+
+bitsandbytes 8-bit optimizers donโt seem to be compatible with DeepSpeed at the moment.
+
+That's it! You don't need to add any additional parameters to your training command.
+
+
+
+
+
+
+
+```bash
+export MODEL_NAME="runwayml/stable-diffusion-v1-5"
+export INSTANCE_DIR="./dog"
+export OUTPUT_DIR="path_to_saved_model"
+
+accelerate launch train_dreambooth.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --instance_data_dir=$INSTANCE_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --instance_prompt="a photo of sks dog" \
+ --resolution=512 \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=1 \
+ --learning_rate=5e-6 \
+ --lr_scheduler="constant" \
+ --lr_warmup_steps=0 \
+ --max_train_steps=400 \
+ --push_to_hub
+```
+
+
+
+
+```bash
+export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
+export INSTANCE_DIR="./dog"
+export OUTPUT_DIR="path-to-save-model"
+
+python train_dreambooth_flax.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --instance_data_dir=$INSTANCE_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --instance_prompt="a photo of sks dog" \
+ --resolution=512 \
+ --train_batch_size=1 \
+ --learning_rate=5e-6 \
+ --max_train_steps=400 \
+ --push_to_hub
+```
+
+
+
+
+Once training is complete, you can use your newly trained model for inference!
+
+
+
+Can't wait to try your model for inference before training is complete? ๐คญ Make sure you have the latest version of ๐ค Accelerate installed.
+
+```py
+from diffusers import DiffusionPipeline, UNet2DConditionModel
+from transformers import CLIPTextModel
+import torch
+
+unet = UNet2DConditionModel.from_pretrained("path/to/model/checkpoint-100/unet")
+
+# if you have trained with `--args.train_text_encoder` make sure to also load the text encoder
+text_encoder = CLIPTextModel.from_pretrained("path/to/model/checkpoint-100/checkpoint-100/text_encoder")
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", unet=unet, text_encoder=text_encoder, dtype=torch.float16,
+).to("cuda")
+
+image = pipeline("A photo of sks dog in a bucket", num_inference_steps=50, guidance_scale=7.5).images[0]
+image.save("dog-bucket.png")
+```
+
+
+
+
+
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("path_to_saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+image = pipeline("A photo of sks dog in a bucket", num_inference_steps=50, guidance_scale=7.5).images[0]
+image.save("dog-bucket.png")
+```
+
+
+
+
+```py
+import jax
+import numpy as np
+from flax.jax_utils import replicate
+from flax.training.common_utils import shard
+from diffusers import FlaxStableDiffusionPipeline
+
+pipeline, params = FlaxStableDiffusionPipeline.from_pretrained("path-to-your-trained-model", dtype=jax.numpy.bfloat16)
+
+prompt = "A photo of sks dog in a bucket"
+prng_seed = jax.random.PRNGKey(0)
+num_inference_steps = 50
+
+num_samples = jax.device_count()
+prompt = num_samples * [prompt]
+prompt_ids = pipeline.prepare_inputs(prompt)
+
+# shard inputs and rng
+params = replicate(params)
+prng_seed = jax.random.split(prng_seed, jax.device_count())
+prompt_ids = shard(prompt_ids)
+
+images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
+image.save("dog-bucket.png")
+```
+
+
+
+
+## LoRA
+
+LoRA is a training technique for significantly reducing the number of trainable parameters. As a result, training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs). Use the [train_dreambooth_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py) script to train with LoRA.
+
+The LoRA training script is discussed in more detail in the [LoRA training](lora) guide.
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [train_dreambooth_lora_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py) script to train a SDXL model with LoRA.
+
+The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide.
+
+## Next steps
+
+Congratulations on training your DreamBooth model! To learn more about how to use your new model, the following guide may be helpful:
+
+- Learn how to [load a DreamBooth](../using-diffusers/loading_adapters) model for inference if you trained your model with LoRA.
\ No newline at end of file
diff --git a/docs/source/en/training/instructpix2pix.md b/docs/source/en/training/instructpix2pix.md
new file mode 100644
index 0000000..14f9bb3
--- /dev/null
+++ b/docs/source/en/training/instructpix2pix.md
@@ -0,0 +1,252 @@
+
+
+# InstructPix2Pix
+
+[InstructPix2Pix](https://hf.co/papers/2211.09800) is a Stable Diffusion model trained to edit images from human-provided instructions. For example, your prompt can be "turn the clouds rainy" and the model will edit the input image accordingly. This model is conditioned on the text prompt (or editing instruction) and the input image.
+
+This guide will explore the [train_instruct_pix2pix.py](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+```bash
+cd examples/instruct_pix2pix
+pip install -r requirements.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training script has many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L65) function. Default values are provided for most parameters that work pretty well, but you can also set your own values in the training command if you'd like.
+
+For example, to increase the resolution of the input image:
+
+```bash
+accelerate launch train_instruct_pix2pix.py \
+ --resolution=512 \
+```
+
+Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant parameters for InstructPix2Pix:
+
+- `--original_image_column`: the original image before the edits are made
+- `--edited_image_column`: the image after the edits are made
+- `--edit_prompt_column`: the instructions to edit the image
+- `--conditioning_dropout_prob`: the dropout probability for the edited image and edit prompts during training which enables classifier-free guidance (CFG) for one or both conditioning inputs
+
+## Training script
+
+The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L374) function. This is where you'll make your changes to the training script to adapt it for your own use-case.
+
+As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the InstructPix2Pix relevant parts of the script.
+
+The script begins by modifing the [number of input channels](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L445) in the first convolutional layer of the UNet to account for InstructPix2Pix's additional conditioning image:
+
+```py
+in_channels = 8
+out_channels = unet.conv_in.out_channels
+unet.register_to_config(in_channels=in_channels)
+
+with torch.no_grad():
+ new_conv_in = nn.Conv2d(
+ in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding
+ )
+ new_conv_in.weight.zero_()
+ new_conv_in.weight[:, :4, :, :].copy_(unet.conv_in.weight)
+ unet.conv_in = new_conv_in
+```
+
+These UNet parameters are [updated](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L545C1-L551C6) by the optimizer:
+
+```py
+optimizer = optimizer_cls(
+ unet.parameters(),
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Next, the edited images and and edit instructions are [preprocessed](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L624) and [tokenized](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L610C24-L610C24). It is important the same image transformations are applied to the original and edited images.
+
+```py
+def preprocess_train(examples):
+ preprocessed_images = preprocess_images(examples)
+
+ original_images, edited_images = preprocessed_images.chunk(2)
+ original_images = original_images.reshape(-1, 3, args.resolution, args.resolution)
+ edited_images = edited_images.reshape(-1, 3, args.resolution, args.resolution)
+
+ examples["original_pixel_values"] = original_images
+ examples["edited_pixel_values"] = edited_images
+
+ captions = list(examples[edit_prompt_column])
+ examples["input_ids"] = tokenize_captions(captions)
+ return examples
+```
+
+Finally, in the [training loop](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L730), it starts by encoding the edited images into latent space:
+
+```py
+latents = vae.encode(batch["edited_pixel_values"].to(weight_dtype)).latent_dist.sample()
+latents = latents * vae.config.scaling_factor
+```
+
+Then, the script applies dropout to the original image and edit instruction embeddings to support CFG. This is what enables the model to modulate the influence of the edit instruction and original image on the edited image.
+
+```py
+encoder_hidden_states = text_encoder(batch["input_ids"])[0]
+original_image_embeds = vae.encode(batch["original_pixel_values"].to(weight_dtype)).latent_dist.mode()
+
+if args.conditioning_dropout_prob is not None:
+ random_p = torch.rand(bsz, device=latents.device, generator=generator)
+ prompt_mask = random_p < 2 * args.conditioning_dropout_prob
+ prompt_mask = prompt_mask.reshape(bsz, 1, 1)
+ null_conditioning = text_encoder(tokenize_captions([""]).to(accelerator.device))[0]
+ encoder_hidden_states = torch.where(prompt_mask, null_conditioning, encoder_hidden_states)
+
+ image_mask_dtype = original_image_embeds.dtype
+ image_mask = 1 - (
+ (random_p >= args.conditioning_dropout_prob).to(image_mask_dtype)
+ * (random_p < 3 * args.conditioning_dropout_prob).to(image_mask_dtype)
+ )
+ image_mask = image_mask.reshape(bsz, 1, 1, 1)
+ original_image_embeds = image_mask * original_image_embeds
+```
+
+That's pretty much it! Aside from the differences described here, the rest of the script is very similar to the [Text-to-image](text2image#training-script) training script, so feel free to check it out for more details. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Once you're happy with the changes to your script or if you're okay with the default configuration, you're ready to launch the training script! ๐
+
+This guide uses the [fusing/instructpix2pix-1000-samples](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) dataset, which is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered). You can also create and use your own dataset if you'd like (see the [Create a dataset for training](create_dataset) guide).
+
+Set the `MODEL_NAME` environment variable to the name of the model (can be a model id on the Hub or a path to a local model), and the `DATASET_ID` to the name of the dataset on the Hub. The script creates and saves all the components (feature extractor, scheduler, text encoder, UNet, etc.) to a subfolder in your repository.
+
+
+
+For better results, try longer training runs with a larger dataset. We've only tested this training script on a smaller-scale dataset.
+
+
+
+To monitor training progress with Weights and Biases, add the `--report_to=wandb` parameter to the training command and specify a validation image with `--val_image_url` and a validation prompt with `--validation_prompt`. This can be really useful for debugging the model.
+
+
+
+If youโre training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command.
+
+```bash
+accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --dataset_name=$DATASET_ID \
+ --enable_xformers_memory_efficient_attention \
+ --resolution=256 \
+ --random_flip \
+ --train_batch_size=4 \
+ --gradient_accumulation_steps=4 \
+ --gradient_checkpointing \
+ --max_train_steps=15000 \
+ --checkpointing_steps=5000 \
+ --checkpoints_total_limit=1 \
+ --learning_rate=5e-05 \
+ --max_grad_norm=1 \
+ --lr_warmup_steps=0 \
+ --conditioning_dropout_prob=0.05 \
+ --mixed_precision=fp16 \
+ --seed=42 \
+ --push_to_hub
+```
+
+After training is finished, you can use your new InstructPix2Pix for inference:
+
+```py
+import PIL
+import requests
+import torch
+from diffusers import StableDiffusionInstructPix2PixPipeline
+from diffusers.utils import load_image
+
+pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained("your_cool_model", torch_dtype=torch.float16).to("cuda")
+generator = torch.Generator("cuda").manual_seed(0)
+
+image = load_image("https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png")
+prompt = "add some ducks to the lake"
+num_inference_steps = 20
+image_guidance_scale = 1.5
+guidance_scale = 10
+
+edited_image = pipeline(
+ prompt,
+ image=image,
+ num_inference_steps=num_inference_steps,
+ image_guidance_scale=image_guidance_scale,
+ guidance_scale=guidance_scale,
+ generator=generator,
+).images[0]
+edited_image.save("edited_image.png")
+```
+
+You should experiment with different `num_inference_steps`, `image_guidance_scale`, and `guidance_scale` values to see how they affect inference speed and quality. The guidance scale parameters are especially impactful because they control how much the original image and edit instructions affect the edited image.
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [`train_instruct_pix2pix_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix_sdxl.py) script to train a SDXL model to follow image editing instructions.
+
+The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide.
+
+## Next steps
+
+Congratulations on training your own InstructPix2Pix model! ๐ฅณ To learn more about the model, it may be helpful to:
+
+- Read the [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd) blog post to learn more about some experiments we've done with InstructPix2Pix, dataset preparation, and results for different instructions.
\ No newline at end of file
diff --git a/docs/source/en/training/kandinsky.md b/docs/source/en/training/kandinsky.md
new file mode 100644
index 0000000..38cfaa4
--- /dev/null
+++ b/docs/source/en/training/kandinsky.md
@@ -0,0 +1,327 @@
+
+
+# Kandinsky 2.2
+
+
+
+This script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset.
+
+
+
+Kandinsky 2.2 is a multilingual text-to-image model capable of producing more photorealistic images. The model includes an image prior model for creating image embeddings from text prompts, and a decoder model that generates images based on the prior model's embeddings. That's why you'll find two separate scripts in Diffusers for Kandinsky 2.2, one for training the prior model and one for training the decoder model. You can train both models separately, but to get the best results, you should train both the prior and decoder models.
+
+Depending on your GPU, you may need to enable `gradient_checkpointing` (โ ๏ธ not supported for the prior model!), `mixed_precision`, and `gradient_accumulation_steps` to help fit the model into memory and to speedup training. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers) (version [v0.0.16](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212) fails for training on some GPUs so you may need to install a development version instead).
+
+This guide explores the [train_text_to_image_prior.py](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py) and the [train_text_to_image_decoder.py](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py) scripts to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the scripts, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+```bash
+cd examples/kandinsky2_2/text_to_image
+pip install -r requirements.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training scripts that are important for understanding how to modify it, but it doesn't cover every aspect of the scripts in detail. If you're interested in learning more, feel free to read through the scripts and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training scripts provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L190) function. The training scripts provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_text_to_image_prior.py \
+ --mixed_precision="fp16"
+```
+
+Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so let's get straight to a walkthrough of the Kandinsky training scripts!
+
+### Min-SNR weighting
+
+The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
+
+Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
+
+```bash
+accelerate launch train_text_to_image_prior.py \
+ --snr_gamma=5.0
+```
+
+## Training script
+
+The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support training the prior and decoder models. This guide focuses on the code that is unique to the Kandinsky 2.2 training scripts.
+
+
+
+
+The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L441) function contains the code for preparing the dataset and training the model.
+
+One of the main differences you'll notice right away is that the training script also loads a [`~transformers.CLIPImageProcessor`] - in addition to a scheduler and tokenizer - for preprocessing images and a [`~transformers.CLIPVisionModelWithProjection`] model for encoding the images:
+
+```py
+noise_scheduler = DDPMScheduler(beta_schedule="squaredcos_cap_v2", prediction_type="sample")
+image_processor = CLIPImageProcessor.from_pretrained(
+ args.pretrained_prior_model_name_or_path, subfolder="image_processor"
+)
+tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="tokenizer")
+
+with ContextManagers(deepspeed_zero_init_disabled_context_manager()):
+ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ args.pretrained_prior_model_name_or_path, subfolder="image_encoder", torch_dtype=weight_dtype
+ ).eval()
+ text_encoder = CLIPTextModelWithProjection.from_pretrained(
+ args.pretrained_prior_model_name_or_path, subfolder="text_encoder", torch_dtype=weight_dtype
+ ).eval()
+```
+
+Kandinsky uses a [`PriorTransformer`] to generate the image embeddings, so you'll want to setup the optimizer to learn the prior mode's parameters.
+
+```py
+prior = PriorTransformer.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="prior")
+prior.train()
+optimizer = optimizer_cls(
+ prior.parameters(),
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Next, the input captions are tokenized, and images are [preprocessed](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L632) by the [`~transformers.CLIPImageProcessor`]:
+
+```py
+def preprocess_train(examples):
+ images = [image.convert("RGB") for image in examples[image_column]]
+ examples["clip_pixel_values"] = image_processor(images, return_tensors="pt").pixel_values
+ examples["text_input_ids"], examples["text_mask"] = tokenize_captions(examples)
+ return examples
+```
+
+Finally, the [training loop](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L718) converts the input images into latents, adds noise to the image embeddings, and makes a prediction:
+
+```py
+model_pred = prior(
+ noisy_latents,
+ timestep=timesteps,
+ proj_embedding=prompt_embeds,
+ encoder_hidden_states=text_encoder_hidden_states,
+ attention_mask=text_mask,
+).predicted_image_embedding
+```
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+
+
+
+The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py#L440) function contains the code for preparing the dataset and training the model.
+
+Unlike the prior model, the decoder initializes a [`VQModel`] to decode the latents into images and it uses a [`UNet2DConditionModel`]:
+
+```py
+with ContextManagers(deepspeed_zero_init_disabled_context_manager()):
+ vae = VQModel.from_pretrained(
+ args.pretrained_decoder_model_name_or_path, subfolder="movq", torch_dtype=weight_dtype
+ ).eval()
+ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ args.pretrained_prior_model_name_or_path, subfolder="image_encoder", torch_dtype=weight_dtype
+ ).eval()
+unet = UNet2DConditionModel.from_pretrained(args.pretrained_decoder_model_name_or_path, subfolder="unet")
+```
+
+Next, the script includes several image transforms and a [preprocessing](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py#L622) function for applying the transforms to the images and returning the pixel values:
+
+```py
+def preprocess_train(examples):
+ images = [image.convert("RGB") for image in examples[image_column]]
+ examples["pixel_values"] = [train_transforms(image) for image in images]
+ examples["clip_pixel_values"] = image_processor(images, return_tensors="pt").pixel_values
+ return examples
+```
+
+Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py#L706) handles converting the images to latents, adding noise, and predicting the noise residual.
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+```py
+model_pred = unet(noisy_latents, timesteps, None, added_cond_kwargs=added_cond_kwargs).sample[:, :4]
+```
+
+
+
+
+## Launch the script
+
+Once youโve made all your changes or youโre okay with the default configuration, youโre ready to launch the training script! ๐
+
+You'll train on the [Pokรฉmon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own Pokรฉmon, but you can also create and train on your own dataset by following the [Create a dataset for training](create_dataset) guide. Set the environment variable `DATASET_NAME` to the name of the dataset on the Hub or if you're training on your own files, set the environment variable `TRAIN_DIR` to a path to your dataset.
+
+If youโre training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command.
+
+
+
+To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. Youโll also need to add the `--validation_prompt` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results.
+
+
+
+
+
+
+```bash
+export DATASET_NAME="lambdalabs/pokemon-blip-captions"
+
+accelerate launch --mixed_precision="fp16" train_text_to_image_prior.py \
+ --dataset_name=$DATASET_NAME \
+ --resolution=768 \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --max_train_steps=15000 \
+ --learning_rate=1e-05 \
+ --max_grad_norm=1 \
+ --checkpoints_total_limit=3 \
+ --lr_scheduler="constant" \
+ --lr_warmup_steps=0 \
+ --validation_prompts="A robot pokemon, 4k photo" \
+ --report_to="wandb" \
+ --push_to_hub \
+ --output_dir="kandi2-prior-pokemon-model"
+```
+
+
+
+
+```bash
+export DATASET_NAME="lambdalabs/pokemon-blip-captions"
+
+accelerate launch --mixed_precision="fp16" train_text_to_image_decoder.py \
+ --dataset_name=$DATASET_NAME \
+ --resolution=768 \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --gradient_checkpointing \
+ --max_train_steps=15000 \
+ --learning_rate=1e-05 \
+ --max_grad_norm=1 \
+ --checkpoints_total_limit=3 \
+ --lr_scheduler="constant" \
+ --lr_warmup_steps=0 \
+ --validation_prompts="A robot pokemon, 4k photo" \
+ --report_to="wandb" \
+ --push_to_hub \
+ --output_dir="kandi2-decoder-pokemon-model"
+```
+
+
+
+
+Once training is finished, you can use your newly trained model for inference!
+
+
+
+
+```py
+from diffusers import AutoPipelineForText2Image, DiffusionPipeline
+import torch
+
+prior_pipeline = DiffusionPipeline.from_pretrained(output_dir, torch_dtype=torch.float16)
+prior_components = {"prior_" + k: v for k,v in prior_pipeline.components.items()}
+pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", **prior_components, torch_dtype=torch.float16)
+
+pipe.enable_model_cpu_offload()
+prompt="A robot pokemon, 4k photo"
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt).images[0]
+```
+
+
+
+Feel free to replace `kandinsky-community/kandinsky-2-2-decoder` with your own trained decoder checkpoint!
+
+
+
+
+
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("path/to/saved/model", torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+
+prompt="A robot pokemon, 4k photo"
+image = pipeline(prompt=prompt).images[0]
+```
+
+For the decoder model, you can also perform inference from a saved checkpoint which can be useful for viewing intermediate results. In this case, load the checkpoint into the UNet:
+
+```py
+from diffusers import AutoPipelineForText2Image, UNet2DConditionModel
+
+unet = UNet2DConditionModel.from_pretrained("path/to/saved/model" + "/checkpoint-/unet")
+
+pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", unet=unet, torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+
+image = pipeline(prompt="A robot pokemon, 4k photo").images[0]
+```
+
+
+
+
+## Next steps
+
+Congratulations on training a Kandinsky 2.2 model! To learn more about how to use your new model, the following guides may be helpful:
+
+- Read the [Kandinsky](../using-diffusers/kandinsky) guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting, interpolation), and how it can be combined with a ControlNet.
+- Check out the [DreamBooth](dreambooth) and [LoRA](lora) training guides to learn how to train a personalized Kandinsky model with just a few example images. These two training techniques can even be combined!
diff --git a/docs/source/en/training/lcm_distill.md b/docs/source/en/training/lcm_distill.md
new file mode 100644
index 0000000..8804c8b
--- /dev/null
+++ b/docs/source/en/training/lcm_distill.md
@@ -0,0 +1,255 @@
+
+
+# Latent Consistency Distillation
+
+[Latent Consistency Models (LCMs)](https://hf.co/papers/2310.04378) are able to generate high-quality images in just a few steps, representing a big leap forward because many pipelines require at least 25+ steps. LCMs are produced by applying the latent consistency distillation method to any Stable Diffusion model. This method works by applying *one-stage guided distillation* to the latent space, and incorporating a *skipping-step* method to consistently skip timesteps to accelerate the distillation process (refer to section 4.1, 4.2, and 4.3 of the paper for more details).
+
+If you're training on a GPU with limited vRAM, try enabling `gradient_checkpointing`, `gradient_accumulation_steps`, and `mixed_precision` to reduce memory-usage and speedup training. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers) and [bitsandbytes'](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer.
+
+This guide will explore the [train_lcm_distill_sd_wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_sd_wds.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+```bash
+cd examples/consistency_distillation
+pip install -r requirements.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment (try enabling `torch.compile` to significantly speedup training):
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+## Script parameters
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_sd_wds.py) and let us know if you have any questions or concerns.
+
+
+
+The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L419) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_lcm_distill_sd_wds.py \
+ --mixed_precision="fp16"
+```
+
+Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so you'll focus on the parameters that are relevant to latent consistency distillation in this guide.
+
+- `--pretrained_teacher_model`: the path to a pretrained latent diffusion model to use as the teacher model
+- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify an alternative VAE (like this [VAE]((https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)) by madebyollin which works in fp16)
+- `--w_min` and `--w_max`: the minimum and maximum guidance scale values for guidance scale sampling
+- `--num_ddim_timesteps`: the number of timesteps for DDIM sampling
+- `--loss_type`: the type of loss (L2 or Huber) to calculate for latent consistency distillation; Huber loss is generally preferred because it's more robust to outliers
+- `--huber_c`: the Huber loss parameter
+
+## Training script
+
+The training script starts by creating a dataset class - [`Text2ImageDataset`](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L141) - for preprocessing the images and creating a training dataset.
+
+```py
+def transform(example):
+ image = example["image"]
+ image = TF.resize(image, resolution, interpolation=transforms.InterpolationMode.BILINEAR)
+
+ c_top, c_left, _, _ = transforms.RandomCrop.get_params(image, output_size=(resolution, resolution))
+ image = TF.crop(image, c_top, c_left, resolution, resolution)
+ image = TF.to_tensor(image)
+ image = TF.normalize(image, [0.5], [0.5])
+
+ example["image"] = image
+ return example
+```
+
+For improved performance on reading and writing large datasets stored in the cloud, this script uses the [WebDataset](https://github.com/webdataset/webdataset) format to create a preprocessing pipeline to apply transforms and create a dataset and dataloader for training. Images are processed and fed to the training loop without having to download the full dataset first.
+
+```py
+processing_pipeline = [
+ wds.decode("pil", handler=wds.ignore_and_continue),
+ wds.rename(image="jpg;png;jpeg;webp", text="text;txt;caption", handler=wds.warn_and_continue),
+ wds.map(filter_keys({"image", "text"})),
+ wds.map(transform),
+ wds.to_tuple("image", "text"),
+]
+```
+
+In the [`main()`](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L768) function, all the necessary components like the noise scheduler, tokenizers, text encoders, and VAE are loaded. The teacher UNet is also loaded here and then you can create a student UNet from the teacher UNet. The student UNet is updated by the optimizer during training.
+
+```py
+teacher_unet = UNet2DConditionModel.from_pretrained(
+ args.pretrained_teacher_model, subfolder="unet", revision=args.teacher_revision
+)
+
+unet = UNet2DConditionModel(**teacher_unet.config)
+unet.load_state_dict(teacher_unet.state_dict(), strict=False)
+unet.train()
+```
+
+Now you can create the [optimizer](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L979) to update the UNet parameters:
+
+```py
+optimizer = optimizer_class(
+ unet.parameters(),
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Create the [dataset](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L994):
+
+```py
+dataset = Text2ImageDataset(
+ train_shards_path_or_url=args.train_shards_path_or_url,
+ num_train_examples=args.max_train_samples,
+ per_gpu_batch_size=args.train_batch_size,
+ global_batch_size=args.train_batch_size * accelerator.num_processes,
+ num_workers=args.dataloader_num_workers,
+ resolution=args.resolution,
+ shuffle_buffer_size=1000,
+ pin_memory=True,
+ persistent_workers=True,
+)
+train_dataloader = dataset.train_dataloader
+```
+
+Next, you're ready to setup the [training loop](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L1049) and implement the latent consistency distillation method (see Algorithm 1 in the paper for more details). This section of the script takes care of adding noise to the latents, sampling and creating a guidance scale embedding, and predicting the original image from the noise.
+
+```py
+pred_x_0 = predicted_origin(
+ noise_pred,
+ start_timesteps,
+ noisy_model_input,
+ noise_scheduler.config.prediction_type,
+ alpha_schedule,
+ sigma_schedule,
+)
+
+model_pred = c_skip_start * noisy_model_input + c_out_start * pred_x_0
+```
+
+It gets the [teacher model predictions](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L1172) and the [LCM predictions](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L1209) next, calculates the loss, and then backpropagates it to the LCM.
+
+```py
+if args.loss_type == "l2":
+ loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
+elif args.loss_type == "huber":
+ loss = torch.mean(
+ torch.sqrt((model_pred.float() - target.float()) ** 2 + args.huber_c**2) - args.huber_c
+ )
+```
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers tutorial](../using-diffusers/write_own_pipeline) which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Now you're ready to launch the training script and start distilling!
+
+For this guide, you'll use the `--train_shards_path_or_url` to specify the path to the [Conceptual Captions 12M](https://github.com/google-research-datasets/conceptual-12m) dataset stored on the Hub [here](https://huggingface.co/datasets/laion/conceptual-captions-12m-webdataset). Set the `MODEL_DIR` environment variable to the name of the teacher model and `OUTPUT_DIR` to where you want to save the model.
+
+```bash
+export MODEL_DIR="runwayml/stable-diffusion-v1-5"
+export OUTPUT_DIR="path/to/saved/model"
+
+accelerate launch train_lcm_distill_sd_wds.py \
+ --pretrained_teacher_model=$MODEL_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --mixed_precision=fp16 \
+ --resolution=512 \
+ --learning_rate=1e-6 --loss_type="huber" --ema_decay=0.95 --adam_weight_decay=0.0 \
+ --max_train_steps=1000 \
+ --max_train_samples=4000000 \
+ --dataloader_num_workers=8 \
+ --train_shards_path_or_url="pipe:curl -L -s https://huggingface.co/datasets/laion/conceptual-captions-12m-webdataset/resolve/main/data/{00000..01099}.tar?download=true" \
+ --validation_steps=200 \
+ --checkpointing_steps=200 --checkpoints_total_limit=10 \
+ --train_batch_size=12 \
+ --gradient_checkpointing --enable_xformers_memory_efficient_attention \
+ --gradient_accumulation_steps=1 \
+ --use_8bit_adam \
+ --resume_from_checkpoint=latest \
+ --report_to=wandb \
+ --seed=453645634 \
+ --push_to_hub
+```
+
+Once training is complete, you can use your new LCM for inference.
+
+```py
+from diffusers import UNet2DConditionModel, DiffusionPipeline, LCMScheduler
+import torch
+
+unet = UNet2DConditionModel.from_pretrained("your-username/your-model", torch_dtype=torch.float16, variant="fp16")
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", unet=unet, torch_dtype=torch.float16, variant="fp16")
+
+pipeline.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+pipeline.to("cuda")
+
+prompt = "sushi rolls in the form of panda heads, sushi platter"
+
+image = pipeline(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
+```
+
+## LoRA
+
+LoRA is a training technique for significantly reducing the number of trainable parameters. As a result, training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs). Use the [train_lcm_distill_lora_sd_wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sd_wds.py) or [train_lcm_distill_lora_sdxl.wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py) script to train with LoRA.
+
+The LoRA training script is discussed in more detail in the [LoRA training](lora) guide.
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [train_lcm_distill_sdxl_wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_sdxl_wds.py) script to train a SDXL model with LoRA.
+
+The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide.
+
+## Next steps
+
+Congratulations on distilling a LCM model! To learn more about LCM, the following may be helpful:
+
+- Learn how to use [LCMs for inference](../using-diffusers/lcm) for text-to-image, image-to-image, and with LoRA checkpoints.
+- Read the [SDXL in 4 steps with Latent Consistency LoRAs](https://huggingface.co/blog/lcm_lora) blog post to learn more about SDXL LCM-LoRA's for super fast inference, quality comparisons, benchmarks, and more.
\ No newline at end of file
diff --git a/docs/source/en/training/lora.md b/docs/source/en/training/lora.md
new file mode 100644
index 0000000..78ac8a1
--- /dev/null
+++ b/docs/source/en/training/lora.md
@@ -0,0 +1,231 @@
+
+
+# LoRA
+
+
+
+This is experimental and the API may change in the future.
+
+
+
+[LoRA (Low-Rank Adaptation of Large Language Models)](https://hf.co/papers/2106.09685) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training.
+
+
+
+LoRA is very versatile and supported for [DreamBooth](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py), [Kandinsky 2.2](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/train_text_to_image_lora_decoder.py), [Stable Diffusion XL](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora_sdxl.py), [text-to-image](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py), and [Wuerstchen](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_lora_prior.py).
+
+
+
+This guide will explore the [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Navigate to the example folder with the training script and install the required dependencies for the script you're using:
+
+
+
+
+```bash
+cd examples/text_to_image
+pip install -r requirements.txt
+```
+
+
+
+
+```bash
+cd examples/text_to_image
+pip install -r requirements_flax.txt
+```
+
+
+
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```py
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/text_to_image_lora.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training script has many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L85) function. Default values are provided for most parameters that work pretty well, but you can also set your own values in the training command if you'd like.
+
+For example, to increase the number of epochs to train:
+
+```bash
+accelerate launch train_text_to_image_lora.py \
+ --num_train_epochs=150 \
+```
+
+Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the LoRA relevant parameters:
+
+- `--rank`: the inner dimension of the low-rank matrices to train; a higher rank means more trainable parameters
+- `--learning_rate`: the default learning rate is 1e-4, but with LoRA, you can use a higher learning rate
+
+## Training script
+
+The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371) function, and if you need to adapt the training script, this is where you'll make your changes.
+
+As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the LoRA relevant parts of the script.
+
+
+
+
+Diffusers uses [`~peft.LoraConfig`] from the [PEFT](https://hf.co/docs/peft) library to set up the parameters of the LoRA adapter such as the rank, alpha, and which modules to insert the LoRA weights into. The adapter is added to the UNet, and only the LoRA layers are filtered for optimization in `lora_layers`.
+
+```py
+unet_lora_config = LoraConfig(
+ r=args.rank,
+ lora_alpha=args.rank,
+ init_lora_weights="gaussian",
+ target_modules=["to_k", "to_q", "to_v", "to_out.0"],
+)
+
+unet.add_adapter(unet_lora_config)
+lora_layers = filter(lambda p: p.requires_grad, unet.parameters())
+```
+
+
+
+
+Diffusers also supports finetuning the text encoder with LoRA from the [PEFT](https://hf.co/docs/peft) library when necessary such as finetuning Stable Diffusion XL (SDXL). The [`~peft.LoraConfig`] is used to configure the parameters of the LoRA adapter which are then added to the text encoder, and only the LoRA layers are filtered for training.
+
+```py
+text_lora_config = LoraConfig(
+ r=args.rank,
+ lora_alpha=args.rank,
+ init_lora_weights="gaussian",
+ target_modules=["q_proj", "k_proj", "v_proj", "out_proj"],
+)
+
+text_encoder_one.add_adapter(text_lora_config)
+text_encoder_two.add_adapter(text_lora_config)
+text_lora_parameters_one = list(filter(lambda p: p.requires_grad, text_encoder_one.parameters()))
+text_lora_parameters_two = list(filter(lambda p: p.requires_grad, text_encoder_two.parameters()))
+```
+
+
+
+
+The [optimizer](https://github.com/huggingface/diffusers/blob/e4b8f173b97731686e290b2eb98e7f5df2b1b322/examples/text_to_image/train_text_to_image_lora.py#L529) is initialized with the `lora_layers` because these are the only weights that'll be optimized:
+
+```py
+optimizer = optimizer_cls(
+ lora_layers,
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Aside from setting up the LoRA layers, the training script is more or less the same as train_text_to_image.py!
+
+## Launch the script
+
+Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! ๐
+
+Let's train on the [Pokรฉmon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate our own Pokรฉmon. Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and dataset respectively. You should also specify where to save the model in `OUTPUT_DIR`, and the name of the model to save to on the Hub with `HUB_MODEL_ID`. The script creates and saves the following files to your repository:
+
+- saved model checkpoints
+- `pytorch_lora_weights.safetensors` (the trained LoRA weights)
+
+If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command.
+
+
+
+A full training run takes ~5 hours on a 2080 Ti GPU with 11GB of VRAM.
+
+
+
+```bash
+export MODEL_NAME="runwayml/stable-diffusion-v1-5"
+export OUTPUT_DIR="/sddata/finetune/lora/pokemon"
+export HUB_MODEL_ID="pokemon-lora"
+export DATASET_NAME="lambdalabs/pokemon-blip-captions"
+
+accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --dataset_name=$DATASET_NAME \
+ --dataloader_num_workers=8 \
+ --resolution=512 \
+ --center_crop \
+ --random_flip \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --max_train_steps=15000 \
+ --learning_rate=1e-04 \
+ --max_grad_norm=1 \
+ --lr_scheduler="cosine" \
+ --lr_warmup_steps=0 \
+ --output_dir=${OUTPUT_DIR} \
+ --push_to_hub \
+ --hub_model_id=${HUB_MODEL_ID} \
+ --report_to=wandb \
+ --checkpointing_steps=500 \
+ --validation_prompt="A pokemon with blue eyes." \
+ --seed=1337
+```
+
+Once training has been completed, you can use your model for inference:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+pipeline.load_lora_weights("path/to/lora/model", weight_name="pytorch_lora_weights.safetensors")
+image = pipeline("A pokemon with blue eyes").images[0]
+```
+
+## Next steps
+
+Congratulations on training a new model with LoRA! To learn more about how to use your new model, the following guides may be helpful:
+
+- Learn how to [load different LoRA formats](../using-diffusers/loading_adapters#LoRA) trained using community trainers like Kohya and TheLastBen.
+- Learn how to use and [combine multiple LoRA's](../tutorials/using_peft_for_inference) with PEFT for inference.
diff --git a/docs/source/en/training/overview.md b/docs/source/en/training/overview.md
new file mode 100644
index 0000000..5396afc
--- /dev/null
+++ b/docs/source/en/training/overview.md
@@ -0,0 +1,63 @@
+
+
+# Overview
+
+๐ค Diffusers provides a collection of training scripts for you to train your own diffusion models. You can find all of our training scripts in [diffusers/examples](https://github.com/huggingface/diffusers/tree/main/examples).
+
+Each training script is:
+
+- **Self-contained**: the training script does not depend on any local files, and all packages required to run the script are installed from the `requirements.txt` file.
+- **Easy-to-tweak**: the training scripts are an example of how to train a diffusion model for a specific task and won't work out-of-the-box for every training scenario. You'll likely need to adapt the training script for your specific use-case. To help you with that, we've fully exposed the data preprocessing code and the training loop so you can modify it for your own use.
+- **Beginner-friendly**: the training scripts are designed to be beginner-friendly and easy to understand, rather than including the latest state-of-the-art methods to get the best and most competitive results. Any training methods we consider too complex are purposefully left out.
+- **Single-purpose**: each training script is expressly designed for only one task to keep it readable and understandable.
+
+Our current collection of training scripts include:
+
+| Training | SDXL-support | LoRA-support | Flax-support |
+|---|---|---|---|
+| [unconditional image generation](https://github.com/huggingface/diffusers/tree/main/examples/unconditional_image_generation) [data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) | | | |
+| [text-to-image](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image) | ๐ | ๐ | ๐ |
+| [textual inversion](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) [data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) | | | ๐ |
+| [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) [data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) | ๐ | ๐ | ๐ |
+| [ControlNet](https://github.com/huggingface/diffusers/tree/main/examples/controlnet) | ๐ | | ๐ |
+| [InstructPix2Pix](https://github.com/huggingface/diffusers/tree/main/examples/instruct_pix2pix) | ๐ | | |
+| [Custom Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/custom_diffusion) | | | |
+| [T2I-Adapters](https://github.com/huggingface/diffusers/tree/main/examples/t2i_adapter) | ๐ | | |
+| [Kandinsky 2.2](https://github.com/huggingface/diffusers/tree/main/examples/kandinsky2_2/text_to_image) | | ๐ | |
+| [Wuerstchen](https://github.com/huggingface/diffusers/tree/main/examples/wuerstchen/text_to_image) | | ๐ | |
+
+These examples are **actively** maintained, so please feel free to open an issue if they aren't working as expected. If you feel like another training example should be included, you're more than welcome to start a [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=) to discuss your feature idea with us and whether it meets our criteria of being self-contained, easy-to-tweak, beginner-friendly, and single-purpose.
+
+## Install
+
+Make sure you can successfully run the latest versions of the example scripts by installing the library from source in a new virtual environment:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the folder of the training script (for example, [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)) and install the `requirements.txt` file. Some training scripts have a specific requirement file for SDXL, LoRA or Flax. If you're using one of these scripts, make sure you install its corresponding requirements file.
+
+```bash
+cd examples/dreambooth
+pip install -r requirements.txt
+# to train SDXL with DreamBooth
+pip install -r requirements_sdxl.txt
+```
+
+To speedup training and reduce memory-usage, we recommend:
+
+- using PyTorch 2.0 or higher to automatically use [scaled dot product attention](../optimization/torch2.0#scaled-dot-product-attention) during training (you don't need to make any changes to the training code)
+- installing [xFormers](../optimization/xformers) to enable memory-efficient attention
\ No newline at end of file
diff --git a/docs/source/en/training/sdxl.md b/docs/source/en/training/sdxl.md
new file mode 100644
index 0000000..5467982
--- /dev/null
+++ b/docs/source/en/training/sdxl.md
@@ -0,0 +1,266 @@
+
+
+# Stable Diffusion XL
+
+
+
+This script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset.
+
+
+
+[Stable Diffusion XL (SDXL)](https://hf.co/papers/2307.01952) is a larger and more powerful iteration of the Stable Diffusion model, capable of producing higher resolution images.
+
+SDXL's UNet is 3x larger and the model adds a second text encoder to the architecture. Depending on the hardware available to you, this can be very computationally intensive and it may not run on a consumer GPU like a Tesla T4. To help fit this larger model into memory and to speedup training, try enabling `gradient_checkpointing`, `mixed_precision`, and `gradient_accumulation_steps`. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers) and using [bitsandbytes'](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer.
+
+This guide will explore the [train_text_to_image_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py) training script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+```bash
+cd examples/text_to_image
+pip install -r requirements_sdxl.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+## Script parameters
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py) and let us know if you have any questions or concerns.
+
+
+
+The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L129) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_text_to_image_sdxl.py \
+ --mixed_precision="bf16"
+```
+
+Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so you'll focus on the parameters that are relevant to training SDXL in this guide.
+
+- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)
+- `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings
+- `--timestep_bias_strategy`: where (earlier vs. later) in the timestep to apply a bias, which can encourage the model to either learn low or high frequency details
+- `--timestep_bias_multiplier`: the weight of the bias to apply to the timestep
+- `--timestep_bias_begin`: the timestep to begin applying the bias
+- `--timestep_bias_end`: the timestep to end applying the bias
+- `--timestep_bias_portion`: the proportion of timesteps to apply the bias to
+
+### Min-SNR weighting
+
+The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting either `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
+
+Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
+
+```bash
+accelerate launch train_text_to_image_sdxl.py \
+ --snr_gamma=5.0
+```
+
+## Training script
+
+The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support SDXL training. This guide will focus on the code that is unique to the SDXL training script.
+
+It starts by creating functions to [tokenize the prompts](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L478) to calculate the prompt embeddings, and to compute the image embeddings with the [VAE](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L519). Next, you'll a function to [generate the timesteps weights](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L531) depending on the number of timesteps and the timestep bias strategy to apply.
+
+Within the [`main()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L572) function, in addition to loading a tokenizer, the script loads a second tokenizer and text encoder because the SDXL architecture uses two of each:
+
+```py
+tokenizer_one = AutoTokenizer.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision, use_fast=False
+)
+tokenizer_two = AutoTokenizer.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="tokenizer_2", revision=args.revision, use_fast=False
+)
+
+text_encoder_cls_one = import_model_class_from_model_name_or_path(
+ args.pretrained_model_name_or_path, args.revision
+)
+text_encoder_cls_two = import_model_class_from_model_name_or_path(
+ args.pretrained_model_name_or_path, args.revision, subfolder="text_encoder_2"
+)
+```
+
+The [prompt and image embeddings](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L857) are computed first and kept in memory, which isn't typically an issue for a smaller dataset, but for larger datasets it can lead to memory problems. If this is the case, you should save the pre-computed embeddings to disk separately and load them into memory during the training process (see this [PR](https://github.com/huggingface/diffusers/pull/4505) for more discussion about this topic).
+
+```py
+text_encoders = [text_encoder_one, text_encoder_two]
+tokenizers = [tokenizer_one, tokenizer_two]
+compute_embeddings_fn = functools.partial(
+ encode_prompt,
+ text_encoders=text_encoders,
+ tokenizers=tokenizers,
+ proportion_empty_prompts=args.proportion_empty_prompts,
+ caption_column=args.caption_column,
+)
+
+train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)
+train_dataset = train_dataset.map(
+ compute_vae_encodings_fn,
+ batched=True,
+ batch_size=args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps,
+ new_fingerprint=new_fingerprint_for_vae,
+)
+```
+
+After calculating the embeddings, the text encoder, VAE, and tokenizer are deleted to free up some memory:
+
+```py
+del text_encoders, tokenizers, vae
+gc.collect()
+torch.cuda.empty_cache()
+```
+
+Finally, the [training loop](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L943) takes care of the rest. If you chose to apply a timestep bias strategy, you'll see the timestep weights are calculated and added as noise:
+
+```py
+weights = generate_timestep_weights(args, noise_scheduler.config.num_train_timesteps).to(
+ model_input.device
+ )
+ timesteps = torch.multinomial(weights, bsz, replacement=True).long()
+
+noisy_model_input = noise_scheduler.add_noise(model_input, noise, timesteps)
+```
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Once youโve made all your changes or youโre okay with the default configuration, youโre ready to launch the training script! ๐
+
+Letโs train on the [Pokรฉmon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own Pokรฉmon. Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and the dataset (either from the Hub or a local path). You should also specify a VAE other than the SDXL VAE (either from the Hub or a local path) with `VAE_NAME` to avoid numerical instabilities.
+
+
+
+To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. Youโll also need to add the `--validation_prompt` and `--validation_epochs` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results.
+
+
+
+```bash
+export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
+export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
+export DATASET_NAME="lambdalabs/pokemon-blip-captions"
+
+accelerate launch train_text_to_image_sdxl.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --pretrained_vae_model_name_or_path=$VAE_NAME \
+ --dataset_name=$DATASET_NAME \
+ --enable_xformers_memory_efficient_attention \
+ --resolution=512 \
+ --center_crop \
+ --random_flip \
+ --proportion_empty_prompts=0.2 \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --gradient_checkpointing \
+ --max_train_steps=10000 \
+ --use_8bit_adam \
+ --learning_rate=1e-06 \
+ --lr_scheduler="constant" \
+ --lr_warmup_steps=0 \
+ --mixed_precision="fp16" \
+ --report_to="wandb" \
+ --validation_prompt="a cute Sundar Pichai creature" \
+ --validation_epochs 5 \
+ --checkpointing_steps=5000 \
+ --output_dir="sdxl-pokemon-model" \
+ --push_to_hub
+```
+
+After you've finished training, you can use your newly trained SDXL model for inference!
+
+
+
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("path/to/your/model", torch_dtype=torch.float16).to("cuda")
+
+prompt = "A pokemon with green eyes and red legs."
+image = pipeline(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
+image.save("pokemon.png")
+```
+
+
+
+
+[PyTorch XLA](https://pytorch.org/xla) allows you to run PyTorch on XLA devices such as TPUs, which can be faster. The initial warmup step takes longer because the model needs to be compiled and optimized. However, subsequent calls to the pipeline on an input **with the same length** as the original prompt are much faster because it can reuse the optimized graph.
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+import torch_xla.core.xla_model as xm
+
+device = xm.xla_device()
+pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to(device)
+
+prompt = "A pokemon with green eyes and red legs."
+start = time()
+image = pipeline(prompt, num_inference_steps=inference_steps).images[0]
+print(f'Compilation time is {time()-start} sec')
+image.save("pokemon.png")
+
+start = time()
+image = pipeline(prompt, num_inference_steps=inference_steps).images[0]
+print(f'Inference time is {time()-start} sec after compilation')
+```
+
+
+
+
+## Next steps
+
+Congratulations on training a SDXL model! To learn more about how to use your new model, the following guides may be helpful:
+
+- Read the [Stable Diffusion XL](../using-diffusers/sdxl) guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting), how to use it's refiner model, and the different types of micro-conditionings.
+- Check out the [DreamBooth](dreambooth) and [LoRA](lora) training guides to learn how to train a personalized SDXL model with just a few example images. These two training techniques can even be combined!
\ No newline at end of file
diff --git a/docs/source/en/training/t2i_adapters.md b/docs/source/en/training/t2i_adapters.md
new file mode 100644
index 0000000..4e19dac
--- /dev/null
+++ b/docs/source/en/training/t2i_adapters.md
@@ -0,0 +1,227 @@
+
+
+# T2I-Adapter
+
+[T2I-Adapter](https://hf.co/papers/2302.08453) is a lightweight adapter model that provides an additional conditioning input image (line art, canny, sketch, depth, pose) to better control image generation. It is similar to a ControlNet, but it is a lot smaller (~77M parameters and ~300MB file size) because its only inserts weights into the UNet instead of copying and training it.
+
+The T2I-Adapter is only available for training with the Stable Diffusion XL (SDXL) model.
+
+This guide will explore the [train_t2i_adapter_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/t2i_adapter/train_t2i_adapter_sdxl.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+```bash
+cd examples/t2i_adapter
+pip install -r requirements.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/t2i_adapter/train_t2i_adapter_sdxl.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L233) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to activate gradient accumulation, add the `--gradient_accumulation_steps` parameter to the training command:
+
+```bash
+accelerate launch train_t2i_adapter_sdxl.py \
+ ----gradient_accumulation_steps=4
+```
+
+Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant T2I-Adapter parameters:
+
+- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)
+- `--crops_coords_top_left_h` and `--crops_coords_top_left_w`: height and width coordinates to include in SDXL's crop coordinate embeddings
+- `--conditioning_image_column`: the column of the conditioning images in the dataset
+- `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings
+
+## Training script
+
+As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the T2I-Adapter relevant parts of the script.
+
+The training script begins by preparing the dataset. This incudes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images.
+
+```py
+conditioning_image_transforms = transforms.Compose(
+ [
+ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
+ transforms.CenterCrop(args.resolution),
+ transforms.ToTensor(),
+ ]
+)
+```
+
+Within the [`main()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L770) function, the T2I-Adapter is either loaded from a pretrained adapter or it is randomly initialized:
+
+```py
+if args.adapter_model_name_or_path:
+ logger.info("Loading existing adapter weights.")
+ t2iadapter = T2IAdapter.from_pretrained(args.adapter_model_name_or_path)
+else:
+ logger.info("Initializing t2iadapter weights.")
+ t2iadapter = T2IAdapter(
+ in_channels=3,
+ channels=(320, 640, 1280, 1280),
+ num_res_blocks=2,
+ downscale_factor=16,
+ adapter_type="full_adapter_xl",
+ )
+```
+
+The [optimizer](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L952) is initialized for the T2I-Adapter parameters:
+
+```py
+params_to_optimize = t2iadapter.parameters()
+optimizer = optimizer_class(
+ params_to_optimize,
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Lastly, in the [training loop](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L1086), the adapter conditioning image and the text embeddings are passed to the UNet to predict the noise residual:
+
+```py
+t2iadapter_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype)
+down_block_additional_residuals = t2iadapter(t2iadapter_image)
+down_block_additional_residuals = [
+ sample.to(dtype=weight_dtype) for sample in down_block_additional_residuals
+]
+
+model_pred = unet(
+ inp_noisy_latents,
+ timesteps,
+ encoder_hidden_states=batch["prompt_ids"],
+ added_cond_kwargs=batch["unet_added_conditions"],
+ down_block_additional_residuals=down_block_additional_residuals,
+).sample
+```
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Now youโre ready to launch the training script! ๐
+
+For this example training, you'll use the [fusing/fill50k](https://huggingface.co/datasets/fusing/fill50k) dataset. You can also create and use your own dataset if you want (see the [Create a dataset for training](https://moon-ci-docs.huggingface.co/docs/diffusers/pr_5512/en/training/create_dataset) guide).
+
+Set the environment variable `MODEL_DIR` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where you want to save the model.
+
+Download the following images to condition your training with:
+
+```bash
+wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
+wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png
+```
+
+
+
+To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You'll also need to add the `--validation_image`, `--validation_prompt`, and `--validation_steps` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results.
+
+
+
+```bash
+export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
+export OUTPUT_DIR="path to save model"
+
+accelerate launch train_t2i_adapter_sdxl.py \
+ --pretrained_model_name_or_path=$MODEL_DIR \
+ --output_dir=$OUTPUT_DIR \
+ --dataset_name=fusing/fill50k \
+ --mixed_precision="fp16" \
+ --resolution=1024 \
+ --learning_rate=1e-5 \
+ --max_train_steps=15000 \
+ --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
+ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
+ --validation_steps=100 \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --report_to="wandb" \
+ --seed=42 \
+ --push_to_hub
+```
+
+Once training is complete, you can use your T2I-Adapter for inference:
+
+```py
+from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteSchedulerTest
+from diffusers.utils import load_image
+import torch
+
+adapter = T2IAdapter.from_pretrained("path/to/adapter", torch_dtype=torch.float16)
+pipeline = StableDiffusionXLAdapterPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", adapter=adapter, torch_dtype=torch.float16
+)
+
+pipeline.scheduler = EulerAncestralDiscreteSchedulerTest.from_config(pipe.scheduler.config)
+pipeline.enable_xformers_memory_efficient_attention()
+pipeline.enable_model_cpu_offload()
+
+control_image = load_image("./conditioning_image_1.png")
+prompt = "pale golden rod circle with old lace background"
+
+generator = torch.manual_seed(0)
+image = pipeline(
+ prompt, image=control_image, generator=generator
+).images[0]
+image.save("./output.png")
+```
+
+## Next steps
+
+Congratulations on training a T2I-Adapter model! ๐ To learn more:
+
+- Read the [Efficient Controllable Generation for SDXL with T2I-Adapters](https://huggingface.co/blog/t2i-sdxl-adapters) blog post to learn more details about the experimental results from the T2I-Adapter team.
diff --git a/docs/source/en/training/text2image.md b/docs/source/en/training/text2image.md
new file mode 100644
index 0000000..8795e97
--- /dev/null
+++ b/docs/source/en/training/text2image.md
@@ -0,0 +1,275 @@
+
+
+# Text-to-image
+
+
+
+The text-to-image script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset.
+
+
+
+Text-to-image models like Stable Diffusion are conditioned to generate images given a text prompt.
+
+Training a model can be taxing on your hardware, but if you enable `gradient_checkpointing` and `mixed_precision`, it is possible to train a model on a single 24GB GPU. If you're training with larger batch sizes or want to train faster, it's better to use GPUs with more than 30GB of memory. You can reduce your memory footprint by enabling memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing, gradient accumulation or xFormers. A GPU with at least 30GB of memory or a TPU v3 is recommended for training with Flax.
+
+This guide will explore the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+
+
+```bash
+cd examples/text_to_image
+pip install -r requirements.txt
+```
+
+
+```bash
+cd examples/text_to_image
+pip install -r requirements_flax.txt
+```
+
+
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+## Script parameters
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) and let us know if you have any questions or concerns.
+
+
+
+The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L193) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_text_to_image.py \
+ --mixed_precision="fp16"
+```
+
+Some basic and important parameters include:
+
+- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
+- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on
+- `--image_column`: the name of the image column in the dataset to train on
+- `--caption_column`: the name of the text column in the dataset to train on
+- `--output_dir`: where to save the trained model
+- `--push_to_hub`: whether to push the trained model to the Hub
+- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
+
+### Min-SNR weighting
+
+The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
+
+Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
+
+```bash
+accelerate launch train_text_to_image.py \
+ --snr_gamma=5.0
+```
+
+You can compare the loss surfaces for different `snr_gamma` values in this [Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) report. For smaller datasets, the effects of Min-SNR may not be as obvious compared to larger datasets.
+
+## Training script
+
+The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L490) function. If you need to adapt the training script, this is where you'll need to make your changes.
+
+The `train_text_to_image` script starts by [loading a scheduler](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L543) and tokenizer. You can choose to use a different scheduler here if you want:
+
+```py
+noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+tokenizer = CLIPTokenizer.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision
+)
+```
+
+Then the script [loads the UNet](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L619) model:
+
+```py
+load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
+model.register_to_config(**load_model.config)
+
+model.load_state_dict(load_model.state_dict())
+```
+
+Next, the text and image columns of the dataset need to be preprocessed. The [`tokenize_captions`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L724) function handles tokenizing the inputs, and the [`train_transforms`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L742) function specifies the type of transforms to apply to the image. Both of these functions are bundled into `preprocess_train`:
+
+```py
+def preprocess_train(examples):
+ images = [image.convert("RGB") for image in examples[image_column]]
+ examples["pixel_values"] = [train_transforms(image) for image in images]
+ examples["input_ids"] = tokenize_captions(examples)
+ return examples
+```
+
+Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L878) handles everything else. It encodes images into latent space, adds noise to the latents, computes the text embeddings to condition on, updates the model parameters, and saves and pushes the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! ๐
+
+
+
+
+Let's train on the [Pokรฉmon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own Pokรฉmon. Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path). If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command.
+
+
+
+To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to.
+
+
+
+```bash
+export MODEL_NAME="runwayml/stable-diffusion-v1-5"
+export dataset_name="lambdalabs/pokemon-blip-captions"
+
+accelerate launch --mixed_precision="fp16" train_text_to_image.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --dataset_name=$dataset_name \
+ --use_ema \
+ --resolution=512 --center_crop --random_flip \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --gradient_checkpointing \
+ --max_train_steps=15000 \
+ --learning_rate=1e-05 \
+ --max_grad_norm=1 \
+ --enable_xformers_memory_efficient_attention
+ --lr_scheduler="constant" --lr_warmup_steps=0 \
+ --output_dir="sd-pokemon-model" \
+ --push_to_hub
+```
+
+
+
+
+Training with Flax can be faster on TPUs and GPUs thanks to [@duongna211](https://github.com/duongna21). Flax is more efficient on a TPU, but GPU performance is also great.
+
+Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path).
+
+
+
+To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to.
+
+
+
+```bash
+export MODEL_NAME="runwayml/stable-diffusion-v1-5"
+export dataset_name="lambdalabs/pokemon-blip-captions"
+
+python train_text_to_image_flax.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --dataset_name=$dataset_name \
+ --resolution=512 --center_crop --random_flip \
+ --train_batch_size=1 \
+ --max_train_steps=15000 \
+ --learning_rate=1e-05 \
+ --max_grad_norm=1 \
+ --output_dir="sd-pokemon-model" \
+ --push_to_hub
+```
+
+
+
+
+Once training is complete, you can use your newly trained model for inference:
+
+
+
+
+```py
+from diffusers import StableDiffusionPipeline
+import torch
+
+pipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+
+image = pipeline(prompt="yoda").images[0]
+image.save("yoda-pokemon.png")
+```
+
+
+
+
+```py
+import jax
+import numpy as np
+from flax.jax_utils import replicate
+from flax.training.common_utils import shard
+from diffusers import FlaxStableDiffusionPipeline
+
+pipeline, params = FlaxStableDiffusionPipeline.from_pretrained("path/to/saved_model", dtype=jax.numpy.bfloat16)
+
+prompt = "yoda pokemon"
+prng_seed = jax.random.PRNGKey(0)
+num_inference_steps = 50
+
+num_samples = jax.device_count()
+prompt = num_samples * [prompt]
+prompt_ids = pipeline.prepare_inputs(prompt)
+
+# shard inputs and rng
+params = replicate(params)
+prng_seed = jax.random.split(prng_seed, jax.device_count())
+prompt_ids = shard(prompt_ids)
+
+images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
+image.save("yoda-pokemon.png")
+```
+
+
+
+
+## Next steps
+
+Congratulations on training your own text-to-image model! To learn more about how to use your new model, the following guides may be helpful:
+
+- Learn how to [load LoRA weights](../using-diffusers/loading_adapters#LoRA) for inference if you trained your model with LoRA.
+- Learn more about how certain parameters like guidance scale or techniques such as prompt weighting can help you control inference in the [Text-to-image](../using-diffusers/conditional_image_generation) task guide.
diff --git a/docs/source/en/training/text_inversion.md b/docs/source/en/training/text_inversion.md
new file mode 100644
index 0000000..49219c0
--- /dev/null
+++ b/docs/source/en/training/text_inversion.md
@@ -0,0 +1,298 @@
+
+
+# Textual Inversion
+
+[Textual Inversion](https://hf.co/papers/2208.01618) is a training technique for personalizing image generation models with just a few example images of what you want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide.
+
+If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. With the same configuration and setup as PyTorch, the Flax training script should be at least ~70% faster!
+
+This guide will explore the [textual_inversion.py](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Navigate to the example folder with the training script and install the required dependencies for the script you're using:
+
+
+
+
+```bash
+cd examples/textual_inversion
+pip install -r requirements.txt
+```
+
+
+
+
+```bash
+cd examples/textual_inversion
+pip install -r requirements_flax.txt
+```
+
+
+
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training script has many parameters to help you tailor the training run to your needs. All of the parameters and their descriptions are listed in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/839c2a5ece0af4e75530cb520d77bc7ed8acf474/examples/textual_inversion/textual_inversion.py#L176) function. Where applicable, Diffusers provides default values for each parameter such as the training batch size and learning rate, but feel free to change these values in the training command if you'd like.
+
+For example, to increase the number of gradient accumulation steps above the default value of 1:
+
+```bash
+accelerate launch textual_inversion.py \
+ --gradient_accumulation_steps=4
+```
+
+Some other basic and important parameters to specify include:
+
+- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
+- `--train_data_dir`: path to a folder containing the training dataset (example images)
+- `--output_dir`: where to save the trained model
+- `--push_to_hub`: whether to push the trained model to the Hub
+- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
+- `--num_vectors`: the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs
+- `--placeholder_token`: the special word to tie the learned embeddings to (you must use the word in your prompt for inference)
+- `--initializer_token`: a single-word that roughly describes the object or style you're trying to train on
+- `--learnable_property`: whether you're training the model to learn a new "style" (for example, Van Gogh's painting style) or "object" (for example, your dog)
+
+## Training script
+
+Unlike some of the other training scripts, textual_inversion.py has a custom dataset class, [`TextualInversionDataset`](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L487) for creating a dataset. You can customize the image size, placeholder token, interpolation method, whether to crop the image, and more. If you need to change how the dataset is created, you can modify `TextualInversionDataset`.
+
+Next, you'll find the dataset preprocessing code and training loop in the [`main()`](https://github.com/huggingface/diffusers/blob/839c2a5ece0af4e75530cb520d77bc7ed8acf474/examples/textual_inversion/textual_inversion.py#L573) function.
+
+The script starts by loading the [tokenizer](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L616), [scheduler and model](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L622):
+
+```py
+# Load tokenizer
+if args.tokenizer_name:
+ tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name)
+elif args.pretrained_model_name_or_path:
+ tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")
+
+# Load scheduler and models
+noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+text_encoder = CLIPTextModel.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
+)
+vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision)
+unet = UNet2DConditionModel.from_pretrained(
+ args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
+)
+```
+
+The special [placeholder token](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L632) is added next to the tokenizer, and the embedding is readjusted to account for the new token.
+
+Then, the script [creates a dataset](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L716) from the `TextualInversionDataset`:
+
+```py
+train_dataset = TextualInversionDataset(
+ data_root=args.train_data_dir,
+ tokenizer=tokenizer,
+ size=args.resolution,
+ placeholder_token=(" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids))),
+ repeats=args.repeats,
+ learnable_property=args.learnable_property,
+ center_crop=args.center_crop,
+ set="train",
+)
+train_dataloader = torch.utils.data.DataLoader(
+ train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.dataloader_num_workers
+)
+```
+
+Finally, the [training loop](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L784) handles everything else from predicting the noisy residual to updating the embedding weights of the special placeholder token.
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! ๐
+
+For this guide, you'll download some images of a [cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide).
+
+```py
+from huggingface_hub import snapshot_download
+
+local_dir = "./cat"
+snapshot_download(
+ "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
+)
+```
+
+Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, and `DATA_DIR` to the path where you just downloaded the cat images to. The script creates and saves the following files to your repository:
+
+- `learned_embeds.bin`: the learned embedding vectors corresponding to your example images
+- `token_identifier.txt`: the special placeholder token
+- `type_of_concept.txt`: the type of concept you're training on (either "object" or "style")
+
+
+
+A full training run takes ~1 hour on a single V100 GPU.
+
+
+
+One more thing before you launch the script. If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command:
+
+```bash
+--validation_prompt="A train"
+--num_validation_images=4
+--validation_steps=100
+```
+
+
+
+
+```bash
+export MODEL_NAME="runwayml/stable-diffusion-v1-5"
+export DATA_DIR="./cat"
+
+accelerate launch textual_inversion.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --train_data_dir=$DATA_DIR \
+ --learnable_property="object" \
+ --placeholder_token="" \
+ --initializer_token="toy" \
+ --resolution=512 \
+ --train_batch_size=1 \
+ --gradient_accumulation_steps=4 \
+ --max_train_steps=3000 \
+ --learning_rate=5.0e-04 \
+ --scale_lr \
+ --lr_scheduler="constant" \
+ --lr_warmup_steps=0 \
+ --output_dir="textual_inversion_cat" \
+ --push_to_hub
+```
+
+
+
+
+```bash
+export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
+export DATA_DIR="./cat"
+
+python textual_inversion_flax.py \
+ --pretrained_model_name_or_path=$MODEL_NAME \
+ --train_data_dir=$DATA_DIR \
+ --learnable_property="object" \
+ --placeholder_token="" \
+ --initializer_token="toy" \
+ --resolution=512 \
+ --train_batch_size=1 \
+ --max_train_steps=3000 \
+ --learning_rate=5.0e-04 \
+ --scale_lr \
+ --output_dir="textual_inversion_cat" \
+ --push_to_hub
+```
+
+
+
+
+After training is complete, you can use your newly trained model for inference like:
+
+
+
+
+```py
+from diffusers import StableDiffusionPipeline
+import torch
+
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+pipeline.load_textual_inversion("sd-concepts-library/cat-toy")
+image = pipeline("A train", num_inference_steps=50).images[0]
+image.save("cat-train.png")
+```
+
+
+
+
+Flax doesn't support the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method, but the textual_inversion_flax.py script [saves](https://github.com/huggingface/diffusers/blob/c0f058265161178f2a88849e92b37ffdc81f1dcc/examples/textual_inversion/textual_inversion_flax.py#L636C2-L636C2) the learned embeddings as a part of the model after training. This means you can use the model for inference like any other Flax model:
+
+```py
+import jax
+import numpy as np
+from flax.jax_utils import replicate
+from flax.training.common_utils import shard
+from diffusers import FlaxStableDiffusionPipeline
+
+model_path = "path-to-your-trained-model"
+pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)
+
+prompt = "A train"
+prng_seed = jax.random.PRNGKey(0)
+num_inference_steps = 50
+
+num_samples = jax.device_count()
+prompt = num_samples * [prompt]
+prompt_ids = pipeline.prepare_inputs(prompt)
+
+# shard inputs and rng
+params = replicate(params)
+prng_seed = jax.random.split(prng_seed, jax.device_count())
+prompt_ids = shard(prompt_ids)
+
+images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
+image.save("cat-train.png")
+```
+
+
+
+
+## Next steps
+
+Congratulations on training your own Textual Inversion model! ๐ To learn more about how to use your new model, the following guides may be helpful:
+
+- Learn how to [load Textual Inversion embeddings](../using-diffusers/loading_adapters) and also use them as negative embeddings.
+- Learn how to use [Textual Inversion](textual_inversion_inference) for inference with Stable Diffusion 1/2 and Stable Diffusion XL.
\ No newline at end of file
diff --git a/docs/source/en/training/unconditional_training.md b/docs/source/en/training/unconditional_training.md
new file mode 100644
index 0000000..3967422
--- /dev/null
+++ b/docs/source/en/training/unconditional_training.md
@@ -0,0 +1,207 @@
+
+
+# Unconditional image generation
+
+Unconditional image generation models are not conditioned on text or images during training. It only generates images that resemble its training data distribution.
+
+This guide will explore the [train_unconditional.py](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies:
+
+```bash
+cd examples/unconditional_image_generation
+pip install -r requirements.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+## Script parameters
+
+
+
+The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) and let us know if you have any questions or concerns.
+
+
+
+The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L55) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_unconditional.py \
+ --mixed_precision="bf16"
+```
+
+Some basic and important parameters to specify include:
+
+- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on
+- `--output_dir`: where to save the trained model
+- `--push_to_hub`: whether to push the trained model to the Hub
+- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
+
+Bring your dataset, and let the training script handle everything else!
+
+## Training script
+
+The code for preprocessing the dataset and the training loop is found in the [`main()`](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L275) function. If you need to adapt the training script, this is where you'll need to make your changes.
+
+The `train_unconditional` script [initializes a `UNet2DModel`](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L356) if you don't provide a model configuration. You can configure the UNet here if you'd like:
+
+```py
+model = UNet2DModel(
+ sample_size=args.resolution,
+ in_channels=3,
+ out_channels=3,
+ layers_per_block=2,
+ block_out_channels=(128, 128, 256, 256, 512, 512),
+ down_block_types=(
+ "DownBlock2D",
+ "DownBlock2D",
+ "DownBlock2D",
+ "DownBlock2D",
+ "AttnDownBlock2D",
+ "DownBlock2D",
+ ),
+ up_block_types=(
+ "UpBlock2D",
+ "AttnUpBlock2D",
+ "UpBlock2D",
+ "UpBlock2D",
+ "UpBlock2D",
+ "UpBlock2D",
+ ),
+)
+```
+
+Next, the script initializes a [scheduler](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L418) and [optimizer](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L429):
+
+```py
+# Initialize the scheduler
+accepts_prediction_type = "prediction_type" in set(inspect.signature(DDPMScheduler.__init__).parameters.keys())
+if accepts_prediction_type:
+ noise_scheduler = DDPMScheduler(
+ num_train_timesteps=args.ddpm_num_steps,
+ beta_schedule=args.ddpm_beta_schedule,
+ prediction_type=args.prediction_type,
+ )
+else:
+ noise_scheduler = DDPMScheduler(num_train_timesteps=args.ddpm_num_steps, beta_schedule=args.ddpm_beta_schedule)
+
+# Initialize the optimizer
+optimizer = torch.optim.AdamW(
+ model.parameters(),
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Then it [loads a dataset](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L451) and you can specify how to [preprocess](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L455) it:
+
+```py
+dataset = load_dataset("imagefolder", data_dir=args.train_data_dir, cache_dir=args.cache_dir, split="train")
+
+augmentations = transforms.Compose(
+ [
+ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
+ transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
+ transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
+ transforms.ToTensor(),
+ transforms.Normalize([0.5], [0.5]),
+ ]
+)
+```
+
+Finally, the [training loop](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L540) handles everything else such as adding noise to the images, predicting the noise residual, calculating the loss, saving checkpoints at specified steps, and saving and pushing the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! ๐
+
+
+
+A full training run takes 2 hours on 4xV100 GPUs.
+
+
+
+
+
+
+```bash
+accelerate launch train_unconditional.py \
+ --dataset_name="huggan/flowers-102-categories" \
+ --output_dir="ddpm-ema-flowers-64" \
+ --mixed_precision="fp16" \
+ --push_to_hub
+```
+
+
+
+
+If you're training with more than one GPU, add the `--multi_gpu` parameter to the training command:
+
+```bash
+accelerate launch --multi_gpu train_unconditional.py \
+ --dataset_name="huggan/flowers-102-categories" \
+ --output_dir="ddpm-ema-flowers-64" \
+ --mixed_precision="fp16" \
+ --push_to_hub
+```
+
+
+
+
+The training script creates and saves a checkpoint file in your repository. Now you can load and use your trained model for inference:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128").to("cuda")
+image = pipeline().images[0]
+```
diff --git a/docs/source/en/training/wuerstchen.md b/docs/source/en/training/wuerstchen.md
new file mode 100644
index 0000000..ecd816c
--- /dev/null
+++ b/docs/source/en/training/wuerstchen.md
@@ -0,0 +1,189 @@
+
+
+# Wuerstchen
+
+The [Wuerstchen](https://hf.co/papers/2306.00637) model drastically reduces computational costs by compressing the latent space by 42x, without compromising image quality and accelerating inference. During training, Wuerstchen uses two models (VQGAN + autoencoder) to compress the latents, and then a third model (text-conditioned latent diffusion model) is conditioned on this highly compressed space to generate an image.
+
+To fit the prior model into GPU memory and to speedup training, try enabling `gradient_accumulation_steps`, `gradient_checkpointing`, and `mixed_precision` respectively.
+
+This guide explores the [train_text_to_image_prior.py](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_prior.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
+
+Before running the script, make sure you install the library from source:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install .
+```
+
+Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
+
+```bash
+cd examples/wuerstchen/text_to_image
+pip install -r requirements.txt
+```
+
+
+
+๐ค Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the ๐ค Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
+
+
+
+Initialize an ๐ค Accelerate environment:
+
+```bash
+accelerate config
+```
+
+To setup a default ๐ค Accelerate environment without choosing any configurations:
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell, like a notebook, you can use:
+
+```bash
+from accelerate.utils import write_basic_config
+
+write_basic_config()
+```
+
+Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
+
+
+
+The following sections highlight parts of the training scripts that are important for understanding how to modify it, but it doesn't cover every aspect of the [script](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_prior.py) in detail. If you're interested in learning more, feel free to read through the scripts and let us know if you have any questions or concerns.
+
+
+
+## Script parameters
+
+The training scripts provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L192) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
+
+For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
+
+```bash
+accelerate launch train_text_to_image_prior.py \
+ --mixed_precision="fp16"
+```
+
+Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so let's dive right into the Wuerstchen training script!
+
+## Training script
+
+The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support Wuerstchen. This guide focuses on the code that is unique to the Wuerstchen training script.
+
+The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L441) function starts by initializing the image encoder - an [EfficientNet](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/modeling_efficient_net_encoder.py) - in addition to the usual scheduler and tokenizer.
+
+```py
+with ContextManagers(deepspeed_zero_init_disabled_context_manager()):
+ pretrained_checkpoint_file = hf_hub_download("dome272/wuerstchen", filename="model_v2_stage_b.pt")
+ state_dict = torch.load(pretrained_checkpoint_file, map_location="cpu")
+ image_encoder = EfficientNetEncoder()
+ image_encoder.load_state_dict(state_dict["effnet_state_dict"])
+ image_encoder.eval()
+```
+
+You'll also load the [`WuerstchenPrior`] model for optimization.
+
+```py
+prior = WuerstchenPrior.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="prior")
+
+optimizer = optimizer_cls(
+ prior.parameters(),
+ lr=args.learning_rate,
+ betas=(args.adam_beta1, args.adam_beta2),
+ weight_decay=args.adam_weight_decay,
+ eps=args.adam_epsilon,
+)
+```
+
+Next, you'll apply some [transforms](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L656) to the images and [tokenize](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L637) the captions:
+
+```py
+def preprocess_train(examples):
+ images = [image.convert("RGB") for image in examples[image_column]]
+ examples["effnet_pixel_values"] = [effnet_transforms(image) for image in images]
+ examples["text_input_ids"], examples["text_mask"] = tokenize_captions(examples)
+ return examples
+```
+
+Finally, the [training loop](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L656) handles compressing the images to latent space with the `EfficientNetEncoder`, adding noise to the latents, and predicting the noise residual with the [`WuerstchenPrior`] model.
+
+```py
+pred_noise = prior(noisy_latents, timesteps, prompt_embeds)
+```
+
+If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
+
+## Launch the script
+
+Once youโve made all your changes or youโre okay with the default configuration, youโre ready to launch the training script! ๐
+
+Set the `DATASET_NAME` environment variable to the dataset name from the Hub. This guide uses the [Pokรฉmon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset, but you can create and train on your own datasets as well (see the [Create a dataset for training](create_dataset) guide).
+
+
+
+To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. Youโll also need to add the `--validation_prompt` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results.
+
+
+
+```bash
+export DATASET_NAME="lambdalabs/pokemon-blip-captions"
+
+accelerate launch train_text_to_image_prior.py \
+ --mixed_precision="fp16" \
+ --dataset_name=$DATASET_NAME \
+ --resolution=768 \
+ --train_batch_size=4 \
+ --gradient_accumulation_steps=4 \
+ --gradient_checkpointing \
+ --dataloader_num_workers=4 \
+ --max_train_steps=15000 \
+ --learning_rate=1e-05 \
+ --max_grad_norm=1 \
+ --checkpoints_total_limit=3 \
+ --lr_scheduler="constant" \
+ --lr_warmup_steps=0 \
+ --validation_prompts="A robot pokemon, 4k photo" \
+ --report_to="wandb" \
+ --push_to_hub \
+ --output_dir="wuerstchen-prior-pokemon-model"
+```
+
+Once training is complete, you can use your newly trained model for inference!
+
+```py
+import torch
+from diffusers import AutoPipelineForText2Image
+from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
+
+pipeline = AutoPipelineForText2Image.from_pretrained("path/to/saved/model", torch_dtype=torch.float16).to("cuda")
+
+caption = "A cute bird pokemon holding a shield"
+images = pipeline(
+ caption,
+ width=1024,
+ height=1536,
+ prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
+ prior_guidance_scale=4.0,
+ num_images_per_prompt=2,
+).images
+```
+
+## Next steps
+
+Congratulations on training a Wuerstchen model! To learn more about how to use your new model, the following may be helpful:
+
+- Take a look at the [Wuerstchen](../api/pipelines/wuerstchen#text-to-image-generation) API documentation to learn more about how to use the pipeline for text-to-image generation and its limitations.
diff --git a/docs/source/en/tutorials/autopipeline.md b/docs/source/en/tutorials/autopipeline.md
new file mode 100644
index 0000000..2150d88
--- /dev/null
+++ b/docs/source/en/tutorials/autopipeline.md
@@ -0,0 +1,170 @@
+
+
+# AutoPipeline
+
+๐ค Diffusers is able to complete many different tasks, and you can often reuse the same pretrained weights for multiple tasks such as text-to-image, image-to-image, and inpainting. If you're new to the library and diffusion models though, it may be difficult to know which pipeline to use for a task. For example, if you're using the [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image, you might not know that you could also use it for image-to-image and inpainting by loading the checkpoint with the [`StableDiffusionImg2ImgPipeline`] and [`StableDiffusionInpaintPipeline`] classes respectively.
+
+The `AutoPipeline` class is designed to simplify the variety of pipelines in ๐ค Diffusers. It is a generic, *task-first* pipeline that lets you focus on the task. The `AutoPipeline` automatically detects the correct pipeline class to use, which makes it easier to load a checkpoint for a task without knowing the specific pipeline class name.
+
+
+
+Take a look at the [AutoPipeline](../api/pipelines/auto_pipeline) reference to see which tasks are supported. Currently, it supports text-to-image, image-to-image, and inpainting.
+
+
+
+This tutorial shows you how to use an `AutoPipeline` to automatically infer the pipeline class to load for a specific task, given the pretrained weights.
+
+## Choose an AutoPipeline for your task
+
+Start by picking a checkpoint. For example, if you're interested in text-to-image with the [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint, use [`AutoPipelineForText2Image`]:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+prompt = "peasant and dragon combat, wood cutting style, viking era, bevel with rune"
+
+image = pipeline(prompt, num_inference_steps=25).images[0]
+image
+```
+
+
+
+
+
+Under the hood, [`AutoPipelineForText2Image`]:
+
+1. automatically detects a `"stable-diffusion"` class from the [`model_index.json`](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json) file
+2. loads the corresponding text-to-image [`StableDiffusionPipeline`] based on the `"stable-diffusion"` class name
+
+Likewise, for image-to-image, [`AutoPipelineForImage2Image`] detects a `"stable-diffusion"` checkpoint from the `model_index.json` file and it'll load the corresponding [`StableDiffusionImg2ImgPipeline`] behind the scenes. You can also pass any additional arguments specific to the pipeline class such as `strength`, which determines the amount of noise or variation added to an input image:
+
+```py
+from diffusers import AutoPipelineForImage2Image
+import torch
+import requests
+from PIL import Image
+from io import BytesIO
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+).to("cuda")
+prompt = "a portrait of a dog wearing a pearl earring"
+
+url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/1665_Girl_with_a_Pearl_Earring.jpg/800px-1665_Girl_with_a_Pearl_Earring.jpg"
+
+response = requests.get(url)
+image = Image.open(BytesIO(response.content)).convert("RGB")
+image.thumbnail((768, 768))
+
+image = pipeline(prompt, image, num_inference_steps=200, strength=0.75, guidance_scale=10.5).images[0]
+image
+```
+
+
+
+
+
+And if you want to do inpainting, then [`AutoPipelineForInpainting`] loads the underlying [`StableDiffusionInpaintPipeline`] class in the same way:
+
+```py
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image
+import torch
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = load_image(img_url).convert("RGB")
+mask_image = load_image(mask_url).convert("RGB")
+
+prompt = "A majestic tiger sitting on a bench"
+image = pipeline(prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]
+image
+```
+
+
+
+
+
+If you try to load an unsupported checkpoint, it'll throw an error:
+
+```py
+from diffusers import AutoPipelineForImage2Image
+import torch
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "openai/shap-e-img2img", torch_dtype=torch.float16, use_safetensors=True
+)
+"ValueError: AutoPipeline can't find a pipeline linked to ShapEImg2ImgPipeline for None"
+```
+
+## Use multiple pipelines
+
+For some workflows or if you're loading many pipelines, it is more memory-efficient to reuse the same components from a checkpoint instead of reloading them which would unnecessarily consume additional memory. For example, if you're using a checkpoint for text-to-image and you want to use it again for image-to-image, use the [`~AutoPipelineForImage2Image.from_pipe`] method. This method creates a new pipeline from the components of a previously loaded pipeline at no additional memory cost.
+
+The [`~AutoPipelineForImage2Image.from_pipe`] method detects the original pipeline class and maps it to the new pipeline class corresponding to the task you want to do. For example, if you load a `"stable-diffusion"` class pipeline for text-to-image:
+
+```py
+from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image
+import torch
+
+pipeline_text2img = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
+print(type(pipeline_text2img))
+""
+```
+
+Then [`~AutoPipelineForImage2Image.from_pipe`] maps the original `"stable-diffusion"` pipeline class to [`StableDiffusionImg2ImgPipeline`]:
+
+```py
+pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img)
+print(type(pipeline_img2img))
+""
+```
+
+If you passed an optional argument - like disabling the safety checker - to the original pipeline, this argument is also passed on to the new pipeline:
+
+```py
+from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image
+import torch
+
+pipeline_text2img = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ requires_safety_checker=False,
+).to("cuda")
+
+pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img)
+print(pipeline_img2img.config.requires_safety_checker)
+"False"
+```
+
+You can overwrite any of the arguments and even configuration from the original pipeline if you want to change the behavior of the new pipeline. For example, to turn the safety checker back on and add the `strength` argument:
+
+```py
+pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img, requires_safety_checker=True, strength=0.3)
+print(pipeline_img2img.config.requires_safety_checker)
+"True"
+```
diff --git a/docs/source/en/tutorials/basic_training.md b/docs/source/en/tutorials/basic_training.md
new file mode 100644
index 0000000..c97ae2d
--- /dev/null
+++ b/docs/source/en/tutorials/basic_training.md
@@ -0,0 +1,403 @@
+
+
+[[open-in-colab]]
+
+# Train a diffusion model
+
+Unconditional image generation is a popular application of diffusion models that generates images that look like those in the dataset used for training. Typically, the best results are obtained from finetuning a pretrained model on a specific dataset. You can find many of these checkpoints on the [Hub](https://huggingface.co/search/full-text?q=unconditional-image-generation&type=model), but if you can't find one you like, you can always train your own!
+
+This tutorial will teach you how to train a [`UNet2DModel`] from scratch on a subset of the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset to generate your own ๐ฆ butterflies ๐ฆ.
+
+
+
+๐ก This training tutorial is based on the [Training with ๐งจ Diffusers](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) notebook. For additional details and context about diffusion models like how they work, check out the notebook!
+
+
+
+Before you begin, make sure you have ๐ค Datasets installed to load and preprocess image datasets, and ๐ค Accelerate, to simplify training on any number of GPUs. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training).
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install diffusers[training]
+```
+
+We encourage you to share your model with the community, and in order to do that, you'll need to login to your Hugging Face account (create one [here](https://hf.co/join) if you don't already have one!). You can login from a notebook and enter your token when prompted. Make sure your token has the write role.
+
+```py
+>>> from huggingface_hub import notebook_login
+
+>>> notebook_login()
+```
+
+Or login in from the terminal:
+
+```bash
+huggingface-cli login
+```
+
+Since the model checkpoints are quite large, install [Git-LFS](https://git-lfs.com/) to version these large files:
+
+```bash
+!sudo apt -qq install git-lfs
+!git config --global credential.helper store
+```
+
+## Training configuration
+
+For convenience, create a `TrainingConfig` class containing the training hyperparameters (feel free to adjust them):
+
+```py
+>>> from dataclasses import dataclass
+
+>>> @dataclass
+... class TrainingConfig:
+... image_size = 128 # the generated image resolution
+... train_batch_size = 16
+... eval_batch_size = 16 # how many images to sample during evaluation
+... num_epochs = 50
+... gradient_accumulation_steps = 1
+... learning_rate = 1e-4
+... lr_warmup_steps = 500
+... save_image_epochs = 10
+... save_model_epochs = 30
+... mixed_precision = "fp16" # `no` for float32, `fp16` for automatic mixed precision
+... output_dir = "ddpm-butterflies-128" # the model name locally and on the HF Hub
+
+... push_to_hub = True # whether to upload the saved model to the HF Hub
+... hub_model_id = "/" # the name of the repository to create on the HF Hub
+... hub_private_repo = False
+... overwrite_output_dir = True # overwrite the old model when re-running the notebook
+... seed = 0
+
+
+>>> config = TrainingConfig()
+```
+
+## Load the dataset
+
+You can easily load the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset with the ๐ค Datasets library:
+
+```py
+>>> from datasets import load_dataset
+
+>>> config.dataset_name = "huggan/smithsonian_butterflies_subset"
+>>> dataset = load_dataset(config.dataset_name, split="train")
+```
+
+
+
+๐ก You can find additional datasets from the [HugGan Community Event](https://huggingface.co/huggan) or you can use your own dataset by creating a local [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder). Set `config.dataset_name` to the repository id of the dataset if it is from the HugGan Community Event, or `imagefolder` if you're using your own images.
+
+
+
+๐ค Datasets uses the [`~datasets.Image`] feature to automatically decode the image data and load it as a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) which we can visualize:
+
+```py
+>>> import matplotlib.pyplot as plt
+
+>>> fig, axs = plt.subplots(1, 4, figsize=(16, 4))
+>>> for i, image in enumerate(dataset[:4]["image"]):
+... axs[i].imshow(image)
+... axs[i].set_axis_off()
+>>> fig.show()
+```
+
+
+
+
+
+The images are all different sizes though, so you'll need to preprocess them first:
+
+* `Resize` changes the image size to the one defined in `config.image_size`.
+* `RandomHorizontalFlip` augments the dataset by randomly mirroring the images.
+* `Normalize` is important to rescale the pixel values into a [-1, 1] range, which is what the model expects.
+
+```py
+>>> from torchvision import transforms
+
+>>> preprocess = transforms.Compose(
+... [
+... transforms.Resize((config.image_size, config.image_size)),
+... transforms.RandomHorizontalFlip(),
+... transforms.ToTensor(),
+... transforms.Normalize([0.5], [0.5]),
+... ]
+... )
+```
+
+Use ๐ค Datasets' [`~datasets.Dataset.set_transform`] method to apply the `preprocess` function on the fly during training:
+
+```py
+>>> def transform(examples):
+... images = [preprocess(image.convert("RGB")) for image in examples["image"]]
+... return {"images": images}
+
+
+>>> dataset.set_transform(transform)
+```
+
+Feel free to visualize the images again to confirm that they've been resized. Now you're ready to wrap the dataset in a [DataLoader](https://pytorch.org/docs/stable/data#torch.utils.data.DataLoader) for training!
+
+```py
+>>> import torch
+
+>>> train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True)
+```
+
+## Create a UNet2DModel
+
+Pretrained models in ๐งจ Diffusers are easily created from their model class with the parameters you want. For example, to create a [`UNet2DModel`]:
+
+```py
+>>> from diffusers import UNet2DModel
+
+>>> model = UNet2DModel(
+... sample_size=config.image_size, # the target image resolution
+... in_channels=3, # the number of input channels, 3 for RGB images
+... out_channels=3, # the number of output channels
+... layers_per_block=2, # how many ResNet layers to use per UNet block
+... block_out_channels=(128, 128, 256, 256, 512, 512), # the number of output channels for each UNet block
+... down_block_types=(
+... "DownBlock2D", # a regular ResNet downsampling block
+... "DownBlock2D",
+... "DownBlock2D",
+... "DownBlock2D",
+... "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention
+... "DownBlock2D",
+... ),
+... up_block_types=(
+... "UpBlock2D", # a regular ResNet upsampling block
+... "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention
+... "UpBlock2D",
+... "UpBlock2D",
+... "UpBlock2D",
+... "UpBlock2D",
+... ),
+... )
+```
+
+It is often a good idea to quickly check the sample image shape matches the model output shape:
+
+```py
+>>> sample_image = dataset[0]["images"].unsqueeze(0)
+>>> print("Input shape:", sample_image.shape)
+Input shape: torch.Size([1, 3, 128, 128])
+
+>>> print("Output shape:", model(sample_image, timestep=0).sample.shape)
+Output shape: torch.Size([1, 3, 128, 128])
+```
+
+Great! Next, you'll need a scheduler to add some noise to the image.
+
+## Create a scheduler
+
+The scheduler behaves differently depending on whether you're using the model for training or inference. During inference, the scheduler generates image from the noise. During training, the scheduler takes a model output - or a sample - from a specific point in the diffusion process and applies noise to the image according to a *noise schedule* and an *update rule*.
+
+Let's take a look at the [`DDPMScheduler`] and use the `add_noise` method to add some random noise to the `sample_image` from before:
+
+```py
+>>> import torch
+>>> from PIL import Image
+>>> from diffusers import DDPMScheduler
+
+>>> noise_scheduler = DDPMScheduler(num_train_timesteps=1000)
+>>> noise = torch.randn(sample_image.shape)
+>>> timesteps = torch.LongTensor([50])
+>>> noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps)
+
+>>> Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0])
+```
+
+
+
+
+
+The training objective of the model is to predict the noise added to the image. The loss at this step can be calculated by:
+
+```py
+>>> import torch.nn.functional as F
+
+>>> noise_pred = model(noisy_image, timesteps).sample
+>>> loss = F.mse_loss(noise_pred, noise)
+```
+
+## Train the model
+
+By now, you have most of the pieces to start training the model and all that's left is putting everything together.
+
+First, you'll need an optimizer and a learning rate scheduler:
+
+```py
+>>> from diffusers.optimization import get_cosine_schedule_with_warmup
+
+>>> optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
+>>> lr_scheduler = get_cosine_schedule_with_warmup(
+... optimizer=optimizer,
+... num_warmup_steps=config.lr_warmup_steps,
+... num_training_steps=(len(train_dataloader) * config.num_epochs),
+... )
+```
+
+Then, you'll need a way to evaluate the model. For evaluation, you can use the [`DDPMPipeline`] to generate a batch of sample images and save it as a grid:
+
+```py
+>>> from diffusers import DDPMPipeline
+>>> from diffusers.utils import make_image_grid
+>>> import os
+
+>>> def evaluate(config, epoch, pipeline):
+... # Sample some images from random noise (this is the backward diffusion process).
+... # The default pipeline output type is `List[PIL.Image]`
+... images = pipeline(
+... batch_size=config.eval_batch_size,
+... generator=torch.manual_seed(config.seed),
+... ).images
+
+... # Make a grid out of the images
+... image_grid = make_image_grid(images, rows=4, cols=4)
+
+... # Save the images
+... test_dir = os.path.join(config.output_dir, "samples")
+... os.makedirs(test_dir, exist_ok=True)
+... image_grid.save(f"{test_dir}/{epoch:04d}.png")
+```
+
+Now you can wrap all these components together in a training loop with ๐ค Accelerate for easy TensorBoard logging, gradient accumulation, and mixed precision training. To upload the model to the Hub, write a function to get your repository name and information and then push it to the Hub.
+
+
+
+๐ก The training loop below may look intimidating and long, but it'll be worth it later when you launch your training in just one line of code! If you can't wait and want to start generating images, feel free to copy and run the code below. You can always come back and examine the training loop more closely later, like when you're waiting for your model to finish training. ๐ค
+
+
+
+```py
+>>> from accelerate import Accelerator
+>>> from huggingface_hub import create_repo, upload_folder
+>>> from tqdm.auto import tqdm
+>>> from pathlib import Path
+>>> import os
+
+>>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
+... # Initialize accelerator and tensorboard logging
+... accelerator = Accelerator(
+... mixed_precision=config.mixed_precision,
+... gradient_accumulation_steps=config.gradient_accumulation_steps,
+... log_with="tensorboard",
+... project_dir=os.path.join(config.output_dir, "logs"),
+... )
+... if accelerator.is_main_process:
+... if config.output_dir is not None:
+... os.makedirs(config.output_dir, exist_ok=True)
+... if config.push_to_hub:
+... repo_id = create_repo(
+... repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True
+... ).repo_id
+... accelerator.init_trackers("train_example")
+
+... # Prepare everything
+... # There is no specific order to remember, you just need to unpack the
+... # objects in the same order you gave them to the prepare method.
+... model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+... model, optimizer, train_dataloader, lr_scheduler
+... )
+
+... global_step = 0
+
+... # Now you train the model
+... for epoch in range(config.num_epochs):
+... progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process)
+... progress_bar.set_description(f"Epoch {epoch}")
+
+... for step, batch in enumerate(train_dataloader):
+... clean_images = batch["images"]
+... # Sample noise to add to the images
+... noise = torch.randn(clean_images.shape, device=clean_images.device)
+... bs = clean_images.shape[0]
+
+... # Sample a random timestep for each image
+... timesteps = torch.randint(
+... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device,
+... dtype=torch.int64
+... )
+
+... # Add noise to the clean images according to the noise magnitude at each timestep
+... # (this is the forward diffusion process)
+... noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
+
+... with accelerator.accumulate(model):
+... # Predict the noise residual
+... noise_pred = model(noisy_images, timesteps, return_dict=False)[0]
+... loss = F.mse_loss(noise_pred, noise)
+... accelerator.backward(loss)
+
+... accelerator.clip_grad_norm_(model.parameters(), 1.0)
+... optimizer.step()
+... lr_scheduler.step()
+... optimizer.zero_grad()
+
+... progress_bar.update(1)
+... logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step}
+... progress_bar.set_postfix(**logs)
+... accelerator.log(logs, step=global_step)
+... global_step += 1
+
+... # After each epoch you optionally sample some demo images with evaluate() and save the model
+... if accelerator.is_main_process:
+... pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
+
+... if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
+... evaluate(config, epoch, pipeline)
+
+... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
+... if config.push_to_hub:
+... upload_folder(
+... repo_id=repo_id,
+... folder_path=config.output_dir,
+... commit_message=f"Epoch {epoch}",
+... ignore_patterns=["step_*", "epoch_*"],
+... )
+... else:
+... pipeline.save_pretrained(config.output_dir)
+```
+
+Phew, that was quite a bit of code! But you're finally ready to launch the training with ๐ค Accelerate's [`~accelerate.notebook_launcher`] function. Pass the function the training loop, all the training arguments, and the number of processes (you can change this value to the number of GPUs available to you) to use for training:
+
+```py
+>>> from accelerate import notebook_launcher
+
+>>> args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler)
+
+>>> notebook_launcher(train_loop, args, num_processes=1)
+```
+
+Once training is complete, take a look at the final ๐ฆ images ๐ฆ generated by your diffusion model!
+
+```py
+>>> import glob
+
+>>> sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png"))
+>>> Image.open(sample_images[-1])
+```
+
+
+
+
+
+## Next steps
+
+Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [๐งจ Diffusers Training Examples](../training/overview) page. Here are some examples of what you can learn:
+
+* [Textual Inversion](../training/text_inversion), an algorithm that teaches a model a specific visual concept and integrates it into the generated image.
+* [DreamBooth](../training/dreambooth), a technique for generating personalized images of a subject given several input images of the subject.
+* [Guide](../training/text2image) to finetuning a Stable Diffusion model on your own dataset.
+* [Guide](../training/lora) to using LoRA, a memory-efficient technique for finetuning really large models faster.
diff --git a/docs/source/en/tutorials/fast_diffusion.md b/docs/source/en/tutorials/fast_diffusion.md
new file mode 100644
index 0000000..f827d11
--- /dev/null
+++ b/docs/source/en/tutorials/fast_diffusion.md
@@ -0,0 +1,322 @@
+
+
+# Accelerate inference of text-to-image diffusion models
+
+Diffusion models are slower than their GAN counterparts because of the iterative and sequential reverse diffusion process. There are several techniques that can address this limitation such as progressive timestep distillation ([LCM LoRA](../using-diffusers/inference_with_lcm_lora)), model compression ([SSD-1B](https://huggingface.co/segmind/SSD-1B)), and reusing adjacent features of the denoiser ([DeepCache](../optimization/deepcache)).
+
+However, you don't necessarily need to use these techniques to speed up inference. With PyTorch 2 alone, you can accelerate the inference latency of text-to-image diffusion pipelines by up to 3x. This tutorial will show you how to progressively apply the optimizations found in PyTorch 2 to reduce inference latency. You'll use the [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) pipeline in this tutorial, but these techniques are applicable to other text-to-image diffusion pipelines too.
+
+Make sure you're using the latest version of Diffusers:
+
+```bash
+pip install -U diffusers
+```
+
+Then upgrade the other required libraries too:
+
+```bash
+pip install -U transformers accelerate peft
+```
+
+Install [PyTorch nightly](https://pytorch.org/) to benefit from the latest and fastest kernels:
+
+```bash
+pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
+```
+
+
+
+The results reported below are from a 80GB 400W A100 with its clock rate set to the maximum.
+
+If you're interested in the full benchmarking code, take a look at [huggingface/diffusion-fast](https://github.com/huggingface/diffusion-fast).
+
+
+
+## Baseline
+
+Let's start with a baseline. Disable reduced precision and the [`scaled_dot_product_attention` (SDPA)](../optimization/torch2.0#scaled-dot-product-attention) function which is automatically used by Diffusers:
+
+```python
+from diffusers import StableDiffusionXLPipeline
+
+# Load the pipeline in full-precision and place its model components on CUDA.
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0"
+).to("cuda")
+
+# Run the attention ops without SDPA.
+pipe.unet.set_default_attn_processor()
+pipe.vae.set_default_attn_processor()
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(prompt, num_inference_steps=30).images[0]
+```
+
+This default setup takes 7.36 seconds.
+
+
+
+
+
+## bfloat16
+
+Enable the first optimization, reduced precision or more specifically bfloat16. There are several benefits of using reduced precision:
+
+* Using a reduced numerical precision (such as float16 or bfloat16) for inference doesnโt affect the generation quality but significantly improves latency.
+* The benefits of using bfloat16 compared to float16 are hardware dependent, but modern GPUs tend to favor bfloat16.
+* bfloat16 is much more resilient when used with quantization compared to float16, but more recent versions of the quantization library ([torchao](https://github.com/pytorch-labs/ao)) we used don't have numerical issues with float16.
+
+```python
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
+).to("cuda")
+
+# Run the attention ops without SDPA.
+pipe.unet.set_default_attn_processor()
+pipe.vae.set_default_attn_processor()
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(prompt, num_inference_steps=30).images[0]
+```
+
+bfloat16 reduces the latency from 7.36 seconds to 4.63 seconds.
+
+
+
+
+
+
+
+In our later experiments with float16, recent versions of torchao do not incur numerical problems from float16.
+
+
+
+Take a look at the [Speed up inference](../optimization/fp16) guide to learn more about running inference with reduced precision.
+
+## SDPA
+
+Attention blocks are intensive to run. But with PyTorch's [`scaled_dot_product_attention`](../optimization/torch2.0#scaled-dot-product-attention) function, it is a lot more efficient. This function is used by default in Diffusers so you don't need to make any changes to the code.
+
+```python
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(prompt, num_inference_steps=30).images[0]
+```
+
+Scaled dot product attention improves the latency from 4.63 seconds to 3.31 seconds.
+
+
+
+
+
+## torch.compile
+
+PyTorch 2 includes `torch.compile` which uses fast and optimized kernels. In Diffusers, the UNet and VAE are usually compiled because these are the most compute-intensive modules. First, configure a few compiler flags (refer to the [full list](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py) for more options):
+
+```python
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+torch._inductor.config.conv_1x1_as_mm = True
+torch._inductor.config.coordinate_descent_tuning = True
+torch._inductor.config.epilogue_fusion = False
+torch._inductor.config.coordinate_descent_check_all_directions = True
+```
+
+It is also important to change the UNet and VAE's memory layout to "channels_last" when compiling them to ensure maximum speed.
+
+```python
+pipe.unet.to(memory_format=torch.channels_last)
+pipe.vae.to(memory_format=torch.channels_last)
+```
+
+Now compile and perform inference:
+
+```python
+# Compile the UNet and VAE.
+pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
+pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# First call to `pipe` is slow, subsequent ones are faster.
+image = pipe(prompt, num_inference_steps=30).images[0]
+```
+
+`torch.compile` offers different backends and modes. For maximum inference speed, use "max-autotune" for the inductor backend. โmax-autotuneโ uses CUDA graphs and optimizes the compilation graph specifically for latency. CUDA graphs greatly reduces the overhead of launching GPU operations by using a mechanism to launch multiple GPU operations through a single CPU operation.
+
+Using SDPA attention and compiling both the UNet and VAE cuts the latency from 3.31 seconds to 2.54 seconds.
+
+
+
+
+
+### Prevent graph breaks
+
+Specifying `fullgraph=True` ensures there are no graph breaks in the underlying model to take full advantage of `torch.compile` without any performance degradation. For the UNet and VAE, this means changing how you access the return variables.
+
+```diff
+- latents = unet(
+- latents, timestep=timestep, encoder_hidden_states=prompt_embeds
+-).sample
+
++ latents = unet(
++ latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False
++)[0]
+```
+
+### Remove GPU sync after compilation
+
+During the iterative reverse diffusion process, the `step()` function is [called](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1228) on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside `step()`, the `sigmas` variable is [indexed](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/schedulers/scheduling_euler_discrete.py#L476) which when placed on the GPU, causes a communication sync between the CPU and GPU. This introduces latency and it becomes more evident when the denoiser has already been compiled.
+
+But if the `sigmas` array always [stays on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240), the CPU and GPU sync doesnโt occur and you don't get any latency. In general, any CPU and GPU communication sync should be none or be kept to a bare minimum because it can impact inference latency.
+
+## Combine the attention block's projection matrices
+
+The UNet and VAE in SDXL use Transformer-like blocks which consists of attention blocks and feed-forward blocks.
+
+In an attention block, the input is projected into three sub-spaces using three different projection matrices โ Q, K, and V. These projections are performed separately on the input. But we can horizontally combine the projection matrices into a single matrix and perform the projection in one step. This increases the size of the matrix multiplications of the input projections and improves the impact of quantization.
+
+You can combine the projection matrices with just a single line of code:
+
+```python
+pipe.fuse_qkv_projections()
+```
+
+This provides a minor improvement from 2.54 seconds to 2.52 seconds.
+
+
+
+
+
+
+
+Support for [`~StableDiffusionXLPipeline.fuse_qkv_projections`] is limited and experimental. It's not available for many non-Stable Diffusion pipelines such as [Kandinsky](../using-diffusers/kandinsky). You can refer to this [PR](https://github.com/huggingface/diffusers/pull/6179) to get an idea about how to enable this for the other pipelines.
+
+
+
+## Dynamic quantization
+
+You can also use the ultra-lightweight PyTorch quantization library, [torchao](https://github.com/pytorch-labs/ao) (commit SHA `54bcd5a10d0abbe7b0c045052029257099f83fd9`), to apply [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE. Quantization adds additional conversion overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization). If the matmuls are too small, these techniques may degrade performance.
+
+First, configure all the compiler tags:
+
+```python
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+# Notice the two new flags at the end.
+torch._inductor.config.conv_1x1_as_mm = True
+torch._inductor.config.coordinate_descent_tuning = True
+torch._inductor.config.epilogue_fusion = False
+torch._inductor.config.coordinate_descent_check_all_directions = True
+torch._inductor.config.force_fuse_int_mm_with_mul = True
+torch._inductor.config.use_mixed_mm = True
+```
+
+Certain linear layers in the UNet and VAE donโt benefit from dynamic int8 quantization. You can filter out those layers with the [`dynamic_quant_filter_fn`](https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16) shown below.
+
+```python
+def dynamic_quant_filter_fn(mod, *args):
+ return (
+ isinstance(mod, torch.nn.Linear)
+ and mod.in_features > 16
+ and (mod.in_features, mod.out_features)
+ not in [
+ (1280, 640),
+ (1920, 1280),
+ (1920, 640),
+ (2048, 1280),
+ (2048, 2560),
+ (2560, 1280),
+ (256, 128),
+ (2816, 1280),
+ (320, 640),
+ (512, 1536),
+ (512, 256),
+ (512, 512),
+ (640, 1280),
+ (640, 1920),
+ (640, 320),
+ (640, 5120),
+ (640, 640),
+ (960, 320),
+ (960, 640),
+ ]
+ )
+
+
+def conv_filter_fn(mod, *args):
+ return (
+ isinstance(mod, torch.nn.Conv2d) and mod.kernel_size == (1, 1) and 128 in [mod.in_channels, mod.out_channels]
+ )
+```
+
+Finally, apply all the optimizations discussed so far:
+
+```python
+# SDPA + bfloat16.
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
+).to("cuda")
+
+# Combine attention projection matrices.
+pipe.fuse_qkv_projections()
+
+# Change the memory layout.
+pipe.unet.to(memory_format=torch.channels_last)
+pipe.vae.to(memory_format=torch.channels_last)
+```
+
+Since dynamic quantization is only limited to the linear layers, convert the appropriate pointwise convolution layers into linear layers to maximize its benefit.
+
+```python
+from torchao import swap_conv2d_1x1_to_linear
+
+swap_conv2d_1x1_to_linear(pipe.unet, conv_filter_fn)
+swap_conv2d_1x1_to_linear(pipe.vae, conv_filter_fn)
+```
+
+Apply dynamic quantization:
+
+```python
+from torchao import apply_dynamic_quant
+
+apply_dynamic_quant(pipe.unet, dynamic_quant_filter_fn)
+apply_dynamic_quant(pipe.vae, dynamic_quant_filter_fn)
+```
+
+Finally, compile and perform inference:
+
+```python
+pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
+pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(prompt, num_inference_steps=30).images[0]
+```
+
+Applying dynamic quantization improves the latency from 2.52 seconds to 2.43 seconds.
+
+
+
+
diff --git a/docs/source/en/tutorials/tutorial_overview.md b/docs/source/en/tutorials/tutorial_overview.md
new file mode 100644
index 0000000..bb9cc3d
--- /dev/null
+++ b/docs/source/en/tutorials/tutorial_overview.md
@@ -0,0 +1,23 @@
+
+
+# Overview
+
+Welcome to ๐งจ Diffusers! If you're new to diffusion models and generative AI, and want to learn more, then you've come to the right place. These beginner-friendly tutorials are designed to provide a gentle introduction to diffusion models and help you understand the library fundamentals - the core components and how ๐งจ Diffusers is meant to be used.
+
+You'll learn how to use a pipeline for inference to rapidly generate things, and then deconstruct that pipeline to really understand how to use the library as a modular toolbox for building your own diffusion systems. In the next lesson, you'll learn how to train your own diffusion model to generate what you want.
+
+After completing the tutorials, you'll have gained the necessary skills to start exploring the library on your own and see how to use it for your own projects and applications.
+
+Feel free to join our community on [Discord](https://discord.com/invite/JfAtkvEtRb) or the [forums](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) to connect and collaborate with other users and developers!
+
+Let's start diffusing! ๐งจ
diff --git a/docs/source/en/tutorials/using_peft_for_inference.md b/docs/source/en/tutorials/using_peft_for_inference.md
new file mode 100644
index 0000000..1e12c2a
--- /dev/null
+++ b/docs/source/en/tutorials/using_peft_for_inference.md
@@ -0,0 +1,151 @@
+
+
+[[open-in-colab]]
+
+# Load LoRAs for inference
+
+There are many adapter types (with [LoRAs](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) being the most popular) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images.
+
+In this tutorial, you'll learn how to easily load and manage adapters for inference with the ๐ค [PEFT](https://huggingface.co/docs/peft/index) integration in ๐ค Diffusers. You'll use LoRA as the main adapter technique, so you'll see the terms LoRA and adapter used interchangeably.
+
+Let's first install all the required libraries.
+
+```bash
+!pip install -q transformers accelerate peft diffusers
+```
+
+Now, load a pipeline with a [Stable Diffusion XL (SDXL)](../api/pipelines/stable_diffusion/stable_diffusion_xl) checkpoint:
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
+pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
+```
+
+Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. With the ๐ค PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`.
+
+```python
+pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
+```
+
+Make sure to include the token `toy_face` in the prompt and then you can perform inference:
+
+```python
+prompt = "toy_face of a hacker with a hoodie"
+
+lora_scale= 0.9
+image = pipe(
+ prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
+).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/b3b53/b3b530661bbdfd4a941c512bf208edef66c3d056" alt="toy-face"
+
+With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images and call it `"pixel"`.
+
+The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method:
+
+```python
+pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
+pipe.set_adapters("pixel")
+```
+
+Make sure you include the token `pixel art` in your prompt to generate a pixel art image:
+
+```python
+prompt = "a hacker with a hoodie, pixel art"
+image = pipe(
+ prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
+).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/45bf9/45bf9f67161963de31bfaaf1d4fe0b7820c75736" alt="pixel-art"
+
+## Merge adapters
+
+You can also merge different adapter checkpoints for inference to blend their styles together.
+
+Once again, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged.
+
+```python
+pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])
+```
+
+
+
+LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts.
+
+
+
+Remember to use the trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) (these are found in their repositories) in the prompt to generate an image.
+
+```python
+prompt = "toy_face of a hacker with a hoodie, pixel art"
+image = pipe(
+ prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0)
+).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/7aefb/7aefb2d480d619d77cada5aa7ce17ed68c2b5db9" alt="toy-face-pixel-art"
+
+Impressive! As you can see, the model generated an image that mixed the characteristics of both adapters.
+
+> [!TIP]
+> Through its PEFT integration, Diffusers also offers more efficient merging methods which you can learn about in the [Merge LoRAs](../using-diffusers/merge_loras) guide!
+
+To return to only using one adapter, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `"toy"` adapter:
+
+```python
+pipe.set_adapters("toy")
+
+prompt = "toy_face of a hacker with a hoodie"
+lora_scale= 0.9
+image = pipe(
+ prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
+).images[0]
+image
+```
+
+Or to disable all adapters entirely, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.disable_lora`] method to return the base model.
+
+```python
+pipe.disable_lora()
+
+prompt = "toy_face of a hacker with a hoodie"
+lora_scale= 0.9
+image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
+image
+```
+
+## Manage active adapters
+
+You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.LoraLoaderMixin.get_active_adapters`] method to check the list of active adapters:
+
+```py
+active_adapters = pipe.get_active_adapters()
+active_adapters
+["toy", "pixel"]
+```
+
+You can also get the active adapters of each pipeline component with [`~diffusers.loaders.LoraLoaderMixin.get_list_adapters`]:
+
+```py
+list_adapters_component_wise = pipe.get_list_adapters()
+list_adapters_component_wise
+{"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]}
+```
diff --git a/docs/source/en/using-diffusers/callback.md b/docs/source/en/using-diffusers/callback.md
new file mode 100644
index 0000000..296245c
--- /dev/null
+++ b/docs/source/en/using-diffusers/callback.md
@@ -0,0 +1,181 @@
+
+
+# Pipeline callbacks
+
+The denoising loop of a pipeline can be modified with custom defined functions using the `callback_on_step_end` parameter. The callback function is executed at the end of each step, and modifies the pipeline attributes and variables for the next step. This is really useful for *dynamically* adjusting certain pipeline attributes or modifying tensor variables. This versatility allows for interesting use-cases such as changing the prompt embeddings at each timestep, assigning different weights to the prompt embeddings, and editing the guidance scale. With callbacks, you can implement new features without modifying the underlying code!
+
+> [!TIP]
+> ๐ค Diffusers currently only supports `callback_on_step_end`, but feel free to open a [feature request](https://github.com/huggingface/diffusers/issues/new/choose) if you have a cool use-case and require a callback function with a different execution point!
+
+This guide will demonstrate how callbacks work by a few features you can implement with them.
+
+## Dynamic classifier-free guidance
+
+Dynamic classifier-free guidance (CFG) is a feature that allows you to disable CFG after a certain number of inference steps which can help you save compute with minimal cost to performance. The callback function for this should have the following arguments:
+
+* `pipeline` (or the pipeline instance) provides access to important properties such as `num_timesteps` and `guidance_scale`. You can modify these properties by updating the underlying attributes. For this example, you'll disable CFG by setting `pipeline._guidance_scale=0.0`.
+* `step_index` and `timestep` tell you where you are in the denoising loop. Use `step_index` to turn off CFG after reaching 40% of `num_timesteps`.
+* `callback_kwargs` is a dict that contains tensor variables you can modify during the denoising loop. It only includes variables specified in the `callback_on_step_end_tensor_inputs` argument, which is passed to the pipeline's `__call__` method. Different pipelines may use different sets of variables, so please check a pipeline's `_callback_tensor_inputs` attribute for the list of variables you can modify. Some common variables include `latents` and `prompt_embeds`. For this function, change the batch size of `prompt_embeds` after setting `guidance_scale=0.0` in order for it to work properly.
+
+Your callback function should look something like this:
+
+```python
+def callback_dynamic_cfg(pipe, step_index, timestep, callback_kwargs):
+ # adjust the batch_size of prompt_embeds according to guidance_scale
+ if step_index == int(pipeline.num_timesteps * 0.4):
+ prompt_embeds = callback_kwargs["prompt_embeds"]
+ prompt_embeds = prompt_embeds.chunk(2)[-1]
+
+ # update guidance_scale and prompt_embeds
+ pipeline._guidance_scale = 0.0
+ callback_kwargs["prompt_embeds"] = prompt_embeds
+ return callback_kwargs
+```
+
+Now, you can pass the callback function to the `callback_on_step_end` parameter and the `prompt_embeds` to `callback_on_step_end_tensor_inputs`.
+
+```py
+import torch
+from diffusers import StableDiffusionPipeline
+
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+pipeline = pipeline.to("cuda")
+
+prompt = "a photo of an astronaut riding a horse on mars"
+
+generator = torch.Generator(device="cuda").manual_seed(1)
+out = pipeline(
+ prompt,
+ generator=generator,
+ callback_on_step_end=callback_dynamic_cfg,
+ callback_on_step_end_tensor_inputs=['prompt_embeds']
+)
+
+out.images[0].save("out_custom_cfg.png")
+```
+
+## Interrupt the diffusion process
+
+> [!TIP]
+> The interruption callback is supported for text-to-image, image-to-image, and inpainting for the [StableDiffusionPipeline](../api/pipelines/stable_diffusion/overview) and [StableDiffusionXLPipeline](../api/pipelines/stable_diffusion/stable_diffusion_xl).
+
+Stopping the diffusion process early is useful when building UIs that work with Diffusers because it allows users to stop the generation process if they're unhappy with the intermediate results. You can incorporate this into your pipeline with a callback.
+
+This callback function should take the following arguments: `pipeline`, `i`, `t`, and `callback_kwargs` (this must be returned). Set the pipeline's `_interrupt` attribute to `True` to stop the diffusion process after a certain number of steps. You are also free to implement your own custom stopping logic inside the callback.
+
+In this example, the diffusion process is stopped after 10 steps even though `num_inference_steps` is set to 50.
+
+```python
+from diffusers import StableDiffusionPipeline
+
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+pipeline.enable_model_cpu_offload()
+num_inference_steps = 50
+
+def interrupt_callback(pipeline, i, t, callback_kwargs):
+ stop_idx = 10
+ if i == stop_idx:
+ pipeline._interrupt = True
+
+ return callback_kwargs
+
+pipeline(
+ "A photo of a cat",
+ num_inference_steps=num_inference_steps,
+ callback_on_step_end=interrupt_callback,
+)
+```
+
+## Display image after each generation step
+
+> [!TIP]
+> This tip was contributed by [asomoza](https://github.com/asomoza).
+
+Display an image after each generation step by accessing and converting the latents after each step into an image. The latent space is compressed to 128x128, so the images are also 128x128 which is useful for a quick preview.
+
+1. Use the function below to convert the SDXL latents (4 channels) to RGB tensors (3 channels) as explained in the [Explaining the SDXL latent space](https://huggingface.co/blog/TimothyAlexisVass/explaining-the-sdxl-latent-space) blog post.
+
+```py
+def latents_to_rgb(latents):
+ weights = (
+ (60, -60, 25, -70),
+ (60, -5, 15, -50),
+ (60, 10, -5, -35)
+ )
+
+ weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
+ biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
+ rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
+ image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
+ image_array = image_array.transpose(1, 2, 0)
+
+ return Image.fromarray(image_array)
+```
+
+2. Create a function to decode and save the latents into an image.
+
+```py
+def decode_tensors(pipe, step, timestep, callback_kwargs):
+ latents = callback_kwargs["latents"]
+
+ image = latents_to_rgb(latents)
+ image.save(f"{step}.png")
+
+ return callback_kwargs
+```
+
+3. Pass the `decode_tensors` function to the `callback_on_step_end` parameter to decode the tensors after each step. You also need to specify what you want to modify in the `callback_on_step_end_tensor_inputs` parameter, which in this case are the latents.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+from PIL import Image
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16,
+ variant="fp16",
+ use_safetensors=True
+).to("cuda")
+
+image = pipe(
+ prompt = "A croissant shaped like a cute bear."
+ negative_prompt = "Deformed, ugly, bad anatomy"
+ callback_on_step_end=decode_tensors,
+ callback_on_step_end_tensor_inputs=["latents"],
+).images[0]
+```
+
+
+
+
+ step 0
+
+
+
+ step 19
+
+
+
+
+ step 29
+
+
+
+ step 39
+
+
+
+ step 49
+
+
diff --git a/docs/source/en/using-diffusers/conditional_image_generation.md b/docs/source/en/using-diffusers/conditional_image_generation.md
new file mode 100644
index 0000000..379fc05
--- /dev/null
+++ b/docs/source/en/using-diffusers/conditional_image_generation.md
@@ -0,0 +1,316 @@
+
+
+# Text-to-image
+
+[[open-in-colab]]
+
+When you think of diffusion models, text-to-image is usually one of the first things that come to mind. Text-to-image generates an image from a text description (for example, "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k") which is also known as a *prompt*.
+
+From a very high level, a diffusion model takes a prompt and some random initial noise, and iteratively removes the noise to construct an image. The *denoising* process is guided by the prompt, and once the denoising process ends after a predetermined number of time steps, the image representation is decoded into an image.
+
+
+
+Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog post to learn more about how a latent diffusion model works.
+
+
+
+You can generate images from a prompt in ๐ค Diffusers in two steps:
+
+1. Load a checkpoint into the [`AutoPipelineForText2Image`] class, which automatically detects the appropriate pipeline class to use based on the checkpoint:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+```
+
+2. Pass a prompt to the pipeline to generate an image:
+
+```py
+image = pipeline(
+ "stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
+).images[0]
+image
+```
+
+
+
+
+
+## Popular models
+
+The most common text-to-image models are [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder). There are also ControlNet models or adapters that can be used with text-to-image models for more direct control in generating images. The results from each model are slightly different because of their architecture and training process, but no matter which model you choose, their usage is more or less the same. Let's use the same prompt for each model and compare their results.
+
+### Stable Diffusion v1.5
+
+[Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) is a latent diffusion model initialized from [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), and finetuned for 595K steps on 512x512 images from the LAION-Aesthetics V2 dataset. You can use this model like:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+generator = torch.Generator("cuda").manual_seed(31)
+image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
+image
+```
+
+### Stable Diffusion XL
+
+SDXL is a much larger version of the previous Stable Diffusion models, and involves a two-stage model process that adds even more details to an image. It also includes some additional *micro-conditionings* to generate high-quality images centered subjects. Take a look at the more comprehensive [SDXL](sdxl) guide to learn more about how to use it. In general, you can use SDXL like:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+generator = torch.Generator("cuda").manual_seed(31)
+image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
+image
+```
+
+### Kandinsky 2.2
+
+The Kandinsky model is a bit different from the Stable Diffusion models because it also uses an image prior model to create embeddings that are used to better align text and images in the diffusion model.
+
+The easiest way to use Kandinsky 2.2 is:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
+).to("cuda")
+generator = torch.Generator("cuda").manual_seed(31)
+image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
+image
+```
+
+### ControlNet
+
+ControlNet models are auxiliary models or adapters that are finetuned on top of text-to-image models, such as [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5). Using ControlNet models in combination with text-to-image models offers diverse options for more explicit control over how to generate an image. With ControlNet, you add an additional conditioning input image to the model. For example, if you provide an image of a human pose (usually represented as multiple keypoints that are connected into a skeleton) as a conditioning input, the model generates an image that follows the pose of the image. Check out the more in-depth [ControlNet](controlnet) guide to learn more about other conditioning inputs and how to use them.
+
+In this example, let's condition the ControlNet with a human pose estimation image. Load the ControlNet model pretrained on human pose estimations:
+
+```py
+from diffusers import ControlNetModel, AutoPipelineForText2Image
+from diffusers.utils import load_image
+import torch
+
+controlnet = ControlNetModel.from_pretrained(
+ "lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+pose_image = load_image("https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png")
+```
+
+Pass the `controlnet` to the [`AutoPipelineForText2Image`], and provide the prompt and pose estimation image:
+
+```py
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+generator = torch.Generator("cuda").manual_seed(31)
+image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=pose_image, generator=generator).images[0]
+image
+```
+
+
+
+
+ Stable Diffusion v1.5
+
+
+
+ Stable Diffusion XL
+
+
+
+ Kandinsky 2.2
+
+
+
+ ControlNet (pose conditioning)
+
+
+
+## Configure pipeline parameters
+
+There are a number of parameters that can be configured in the pipeline that affect how an image is generated. You can change the image's output size, specify a negative prompt to improve image quality, and more. This section dives deeper into how to use these parameters.
+
+### Height and width
+
+The `height` and `width` parameters control the height and width (in pixels) of the generated image. By default, the Stable Diffusion v1.5 model outputs 512x512 images, but you can change this to any size that is a multiple of 8. For example, to create a rectangular image:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+image = pipeline(
+ "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", height=768, width=512
+).images[0]
+image
+```
+
+
+
+
+
+
+
+Other models may have different default image sizes depending on the image sizes in the training dataset. For example, SDXL's default image size is 1024x1024 and using lower `height` and `width` values may result in lower quality images. Make sure you check the model's API reference first!
+
+
+
+### Guidance scale
+
+The `guidance_scale` parameter affects how much the prompt influences image generation. A lower value gives the model "creativity" to generate images that are more loosely related to the prompt. Higher `guidance_scale` values push the model to follow the prompt more closely, and if this value is too high, you may observe some artifacts in the generated image.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
+).to("cuda")
+image = pipeline(
+ "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", guidance_scale=3.5
+).images[0]
+image
+```
+
+
+
+
+ guidance_scale = 2.5
+
+
+
+ guidance_scale = 7.5
+
+
+
+ guidance_scale = 10.5
+
+
+
+### Negative prompt
+
+Just like how a prompt guides generation, a *negative prompt* steers the model away from things you don't want the model to generate. This is commonly used to improve overall image quality by removing poor or bad image features such as "low resolution" or "bad details". You can also use a negative prompt to remove or modify the content and style of an image.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
+).to("cuda")
+image = pipeline(
+ prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
+ negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy",
+).images[0]
+image
+```
+
+
+
+### Generator
+
+A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator) object enables reproducibility in a pipeline by setting a manual seed. You can use a `Generator` to generate batches of images and iteratively improve on an image generated from a seed as detailed in the [Improve image quality with deterministic generation](reusing_seeds) guide.
+
+You can set a seed and `Generator` as shown below. Creating an image with a `Generator` should return the same result each time instead of randomly generating a new image.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
+).to("cuda")
+generator = torch.Generator(device="cuda").manual_seed(30)
+image = pipeline(
+ "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
+ generator=generator,
+).images[0]
+image
+```
+
+## Control image generation
+
+There are several ways to exert more control over how an image is generated outside of configuring a pipeline's parameters, such as prompt weighting and ControlNet models.
+
+### Prompt weighting
+
+Prompt weighting is a technique for increasing or decreasing the importance of concepts in a prompt to emphasize or minimize certain features in an image. We recommend using the [Compel](https://github.com/damian0815/compel) library to help you generate the weighted prompt embeddings.
+
+
+
+Learn how to create the prompt embeddings in the [Prompt weighting](weighted_prompts) guide. This example focuses on how to use the prompt embeddings in the pipeline.
+
+
+
+Once you've created the embeddings, you can pass them to the `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter in the pipeline.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
+).to("cuda")
+image = pipeline(
+ prompt_embeds=prompt_embeds, # generated from Compel
+ negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
+).images[0]
+```
+
+### ControlNet
+
+As you saw in the [ControlNet](#controlnet) section, these models offer a more flexible and accurate way to generate images by incorporating an additional conditioning image input. Each ControlNet model is pretrained on a particular type of conditioning image to generate new images that resemble it. For example, if you take a ControlNet model pretrained on depth maps, you can give the model a depth map as a conditioning input and it'll generate an image that preserves the spatial information in it. This is quicker and easier than specifying the depth information in a prompt. You can even combine multiple conditioning inputs with a [MultiControlNet](controlnet#multicontrolnet)!
+
+There are many types of conditioning inputs you can use, and ๐ค Diffusers supports ControlNet for Stable Diffusion and SDXL models. Take a look at the more comprehensive [ControlNet](controlnet) guide to learn how you can use these models.
+
+## Optimize
+
+Diffusion models are large, and the iterative nature of denoising an image is computationally expensive and intensive. But this doesn't mean you need access to powerful - or even many - GPUs to use them. There are many optimization techniques for running diffusion models on consumer and free-tier resources. For example, you can load model weights in half-precision to save GPU memory and increase speed or offload the entire model to the GPU to save even more memory.
+
+PyTorch 2.0 also supports a more memory-efficient attention mechanism called [*scaled dot product attention*](../optimization/torch2.0#scaled-dot-product-attention) that is automatically enabled if you're using PyTorch 2.0. You can combine this with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speed your code up even more:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16").to("cuda")
+pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+For more tips on how to optimize your code to save memory and speed up inference, read the [Memory and speed](../optimization/fp16) and [Torch 2.0](../optimization/torch2.0) guides.
diff --git a/docs/source/en/using-diffusers/contribute_pipeline.md b/docs/source/en/using-diffusers/contribute_pipeline.md
new file mode 100644
index 0000000..e9cf1ed
--- /dev/null
+++ b/docs/source/en/using-diffusers/contribute_pipeline.md
@@ -0,0 +1,184 @@
+
+
+# Contribute a community pipeline
+
+
+
+๐ก Take a look at GitHub Issue [#841](https://github.com/huggingface/diffusers/issues/841) for more context about why we're adding community pipelines to help everyone easily share their work without being slowed down.
+
+
+
+Community pipelines allow you to add any additional features you'd like on top of the [`DiffusionPipeline`]. The main benefit of building on top of the `DiffusionPipeline` is anyone can load and use your pipeline by only adding one more argument, making it super easy for the community to access.
+
+This guide will show you how to create a community pipeline and explain how they work. To keep things simple, you'll create a "one-step" pipeline where the `UNet` does a single forward pass and calls the scheduler once.
+
+## Initialize the pipeline
+
+You should start by creating a `one_step_unet.py` file for your community pipeline. In this file, create a pipeline class that inherits from the [`DiffusionPipeline`] to be able to load model weights and the scheduler configuration from the Hub. The one-step pipeline needs a `UNet` and a scheduler, so you'll need to add these as arguments to the `__init__` function:
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
+ def __init__(self, unet, scheduler):
+ super().__init__()
+```
+
+To ensure your pipeline and its components (`unet` and `scheduler`) can be saved with [`~DiffusionPipeline.save_pretrained`], add them to the `register_modules` function:
+
+```diff
+ from diffusers import DiffusionPipeline
+ import torch
+
+ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
+ def __init__(self, unet, scheduler):
+ super().__init__()
+
++ self.register_modules(unet=unet, scheduler=scheduler)
+```
+
+Cool, the `__init__` step is done and you can move to the forward pass now! ๐ฅ
+
+## Define the forward pass
+
+In the forward pass, which we recommend defining as `__call__`, you have complete creative freedom to add whatever feature you'd like. For our amazing one-step pipeline, create a random image and only call the `unet` and `scheduler` once by setting `timestep=1`:
+
+```diff
+ from diffusers import DiffusionPipeline
+ import torch
+
+ class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
+ def __init__(self, unet, scheduler):
+ super().__init__()
+
+ self.register_modules(unet=unet, scheduler=scheduler)
+
++ def __call__(self):
++ image = torch.randn(
++ (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
++ )
++ timestep = 1
+
++ model_output = self.unet(image, timestep).sample
++ scheduler_output = self.scheduler.step(model_output, timestep, image).prev_sample
+
++ return scheduler_output
+```
+
+That's it! ๐ You can now run this pipeline by passing a `unet` and `scheduler` to it:
+
+```python
+from diffusers import DDPMScheduler, UNet2DModel
+
+scheduler = DDPMScheduler()
+unet = UNet2DModel()
+
+pipeline = UnetSchedulerOneForwardPipeline(unet=unet, scheduler=scheduler)
+
+output = pipeline()
+```
+
+But what's even better is you can load pre-existing weights into the pipeline if the pipeline structure is identical. For example, you can load the [`google/ddpm-cifar10-32`](https://huggingface.co/google/ddpm-cifar10-32) weights into the one-step pipeline:
+
+```python
+pipeline = UnetSchedulerOneForwardPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True)
+
+output = pipeline()
+```
+
+## Share your pipeline
+
+Open a Pull Request on the ๐งจ Diffusers [repository](https://github.com/huggingface/diffusers) to add your awesome pipeline in `one_step_unet.py` to the [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) subfolder.
+
+Once it is merged, anyone with `diffusers >= 0.4.0` installed can use this pipeline magically ๐ช by specifying it in the `custom_pipeline` argument:
+
+```python
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="one_step_unet", use_safetensors=True
+)
+pipe()
+```
+
+Another way to share your community pipeline is to upload the `one_step_unet.py` file directly to your preferred [model repository](https://huggingface.co/docs/hub/models-uploading) on the Hub. Instead of specifying the `one_step_unet.py` file, pass the model repository id to the `custom_pipeline` argument:
+
+```python
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="stevhliu/one_step_unet", use_safetensors=True
+)
+```
+
+Take a look at the following table to compare the two sharing workflows to help you decide the best option for you:
+
+| | GitHub community pipeline | HF Hub community pipeline |
+|----------------|------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
+| usage | same | same |
+| review process | open a Pull Request on GitHub and undergo a review process from the Diffusers team before merging; may be slower | upload directly to a Hub repository without any review; this is the fastest workflow |
+| visibility | included in the official Diffusers repository and documentation | included on your HF Hub profile and relies on your own usage/promotion to gain visibility |
+
+
+
+๐ก You can use whatever package you want in your community pipeline file - as long as the user has it installed, everything will work fine. Make sure you have one and only one pipeline class that inherits from `DiffusionPipeline` because this is automatically detected.
+
+
+
+## How do community pipelines work?
+
+A community pipeline is a class that inherits from [`DiffusionPipeline`] which means:
+
+- It can be loaded with the [`custom_pipeline`] argument.
+- The model weights and scheduler configuration are loaded from [`pretrained_model_name_or_path`].
+- The code that implements a feature in the community pipeline is defined in a `pipeline.py` file.
+
+Sometimes you can't load all the pipeline components weights from an official repository. In this case, the other components should be passed directly to the pipeline:
+
+```python
+from diffusers import DiffusionPipeline
+from transformers import CLIPImageProcessor, CLIPModel
+
+model_id = "CompVis/stable-diffusion-v1-4"
+clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
+
+feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
+clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16)
+
+pipeline = DiffusionPipeline.from_pretrained(
+ model_id,
+ custom_pipeline="clip_guided_stable_diffusion",
+ clip_model=clip_model,
+ feature_extractor=feature_extractor,
+ scheduler=scheduler,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+)
+```
+
+The magic behind community pipelines is contained in the following code. It allows the community pipeline to be loaded from GitHub or the Hub, and it'll be available to all ๐งจ Diffusers packages.
+
+```python
+# 2. Load the pipeline class, if using custom module then load it from the Hub
+# if we load from explicit class, let's use it
+if custom_pipeline is not None:
+ pipeline_class = get_class_from_dynamic_module(
+ custom_pipeline, module_file=CUSTOM_PIPELINE_FILE_NAME, cache_dir=custom_pipeline
+ )
+elif cls != DiffusionPipeline:
+ pipeline_class = cls
+else:
+ diffusers_module = importlib.import_module(cls.__module__.split(".")[0])
+ pipeline_class = getattr(diffusers_module, config_dict["_class_name"])
+```
diff --git a/docs/source/en/using-diffusers/control_brightness.md b/docs/source/en/using-diffusers/control_brightness.md
new file mode 100644
index 0000000..5fad664
--- /dev/null
+++ b/docs/source/en/using-diffusers/control_brightness.md
@@ -0,0 +1,58 @@
+
+
+# Control image brightness
+
+The Stable Diffusion pipeline is mediocre at generating images that are either very bright or dark as explained in the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) paper. The solutions proposed in the paper are currently implemented in the [`DDIMScheduler`] which you can use to improve the lighting in your images.
+
+
+
+๐ก Take a look at the paper linked above for more details about the proposed solutions!
+
+
+
+One of the solutions is to train a model with *v prediction* and *v loss*. Add the following flag to the [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts to enable `v_prediction`:
+
+```bash
+--prediction_type="v_prediction"
+```
+
+For example, let's use the [`ptx0/pseudo-journey-v2`](https://huggingface.co/ptx0/pseudo-journey-v2) checkpoint which has been finetuned with `v_prediction`.
+
+Next, configure the following parameters in the [`DDIMScheduler`]:
+
+1. `rescale_betas_zero_snr=True`, rescales the noise schedule to zero terminal signal-to-noise ratio (SNR)
+2. `timestep_spacing="trailing"`, starts sampling from the last timestep
+
+```py
+from diffusers import DiffusionPipeline, DDIMScheduler
+
+pipeline = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", use_safetensors=True)
+
+# switch the scheduler in the pipeline to use the DDIMScheduler
+pipeline.scheduler = DDIMScheduler.from_config(
+ pipeline.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
+)
+pipeline.to("cuda")
+```
+
+Finally, in your call to the pipeline, set `guidance_rescale` to prevent overexposure:
+
+```py
+prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
+image = pipeline(prompt, guidance_rescale=0.7).images[0]
+image
+```
+
+
+
+
diff --git a/docs/source/en/using-diffusers/controlling_generation.md b/docs/source/en/using-diffusers/controlling_generation.md
new file mode 100644
index 0000000..c1320dc
--- /dev/null
+++ b/docs/source/en/using-diffusers/controlling_generation.md
@@ -0,0 +1,217 @@
+
+
+# Controlled generation
+
+Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.
+
+Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose.
+
+Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.
+
+We will document some of the techniques `diffusers` supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced. If something needs clarifying or you have a suggestion, don't hesitate to open a discussion on the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or a [GitHub issue](https://github.com/huggingface/diffusers/issues).
+
+We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.
+
+Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined. For example, one can combine Textual Inversion with SEGA to provide more semantic guidance to the outputs generated using Textual Inversion.
+
+Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.
+
+1. [InstructPix2Pix](#instruct-pix2pix)
+2. [Pix2Pix Zero](#pix2pix-zero)
+3. [Attend and Excite](#attend-and-excite)
+4. [Semantic Guidance](#semantic-guidance-sega)
+5. [Self-attention Guidance](#self-attention-guidance-sag)
+6. [Depth2Image](#depth2image)
+7. [MultiDiffusion Panorama](#multidiffusion-panorama)
+8. [DreamBooth](#dreambooth)
+9. [Textual Inversion](#textual-inversion)
+10. [ControlNet](#controlnet)
+11. [Prompt Weighting](#prompt-weighting)
+12. [Custom Diffusion](#custom-diffusion)
+13. [Model Editing](#model-editing)
+14. [DiffEdit](#diffedit)
+15. [T2I-Adapter](#t2i-adapter)
+16. [FABRIC](#fabric)
+
+For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.
+
+| **Method** | **Inference only** | **Requires training / fine-tuning** | **Comments** |
+| :-------------------------------------------------: | :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: |
+| [InstructPix2Pix](#instruct-pix2pix) | โ | โ | Can additionally be fine-tuned for better performance on specific edit instructions. |
+| [Pix2Pix Zero](#pix2pix-zero) | โ | โ | |
+| [Attend and Excite](#attend-and-excite) | โ | โ | |
+| [Semantic Guidance](#semantic-guidance-sega) | โ | โ | |
+| [Self-attention Guidance](#self-attention-guidance-sag) | โ | โ | |
+| [Depth2Image](#depth2image) | โ | โ | |
+| [MultiDiffusion Panorama](#multidiffusion-panorama) | โ | โ | |
+| [DreamBooth](#dreambooth) | โ | โ | |
+| [Textual Inversion](#textual-inversion) | โ | โ | |
+| [ControlNet](#controlnet) | โ | โ | A ControlNet can be trained/fine-tuned on a custom conditioning. |
+| [Prompt Weighting](#prompt-weighting) | โ | โ | |
+| [Custom Diffusion](#custom-diffusion) | โ | โ | |
+| [Model Editing](#model-editing) | โ | โ | |
+| [DiffEdit](#diffedit) | โ | โ | |
+| [T2I-Adapter](#t2i-adapter) | โ | โ | |
+| [Fabric](#fabric) | โ | โ | |
+## InstructPix2Pix
+
+[Paper](https://arxiv.org/abs/2211.09800)
+
+[InstructPix2Pix](../api/pipelines/pix2pix) is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
+InstructPix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.
+
+## Pix2Pix Zero
+
+[Paper](https://arxiv.org/abs/2302.03027)
+
+[Pix2Pix Zero](../api/pipelines/pix2pix_zero) allows modifying an image so that one concept or subject is translated to another one while preserving general image semantics.
+
+The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.
+
+Pix2Pix Zero can be used both to edit synthetic images as well as real images.
+
+- To edit synthetic images, one first generates an image given a caption.
+ Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
+- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies DDIM inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
+
+
+
+Pix2Pix Zero is the first model that allows "zero-shot" image editing. This means that the model
+can edit an image in less than a minute on a consumer GPU as shown [here](../api/pipelines/pix2pix_zero#usage-example).
+
+
+
+As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall
+pipeline might require more memory than a standard [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
+
+
+
+An important distinction between methods like InstructPix2Pix and Pix2Pix Zero is that the former
+involves fine-tuning the pre-trained weights while the latter does not. This means that you can
+apply Pix2Pix Zero to any of the available Stable Diffusion models.
+
+
+
+## Attend and Excite
+
+[Paper](https://arxiv.org/abs/2301.13826)
+
+[Attend and Excite](../api/pipelines/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
+
+A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is guaranteed to have a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.
+
+Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
+
+## Semantic Guidance (SEGA)
+
+[Paper](https://arxiv.org/abs/2301.12247)
+
+[SEGA](../api/pipelines/semantic_stable_diffusion) allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait.
+
+Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.
+
+Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
+
+## Self-attention Guidance (SAG)
+
+[Paper](https://arxiv.org/abs/2210.00939)
+
+[Self-attention Guidance](../api/pipelines/self_attention_guidance) improves the general quality of images.
+
+SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.
+
+## Depth2Image
+
+[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
+
+[Depth2Image](../api/pipelines/stable_diffusion/depth2img) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
+
+It conditions on a monocular depth estimate of the original image.
+
+## MultiDiffusion Panorama
+
+[Paper](https://arxiv.org/abs/2302.08113)
+
+[MultiDiffusion Panorama](../api/pipelines/panorama) defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
+MultiDiffusion Panorama allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
+
+## Fine-tuning your own models
+
+In addition to pre-trained models, Diffusers has training scripts for fine-tuning models on user-provided data.
+
+## DreamBooth
+
+[Project](https://dreambooth.github.io/)
+
+[DreamBooth](../training/dreambooth) fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.
+
+## Textual Inversion
+
+[Paper](https://arxiv.org/abs/2208.01618)
+
+[Textual Inversion](../training/text_inversion) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.
+
+## ControlNet
+
+[Paper](https://arxiv.org/abs/2302.05543)
+
+[ControlNet](../api/pipelines/controlnet) is an auxiliary network which adds an extra condition.
+There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles,
+depth maps, and semantic segmentations.
+
+## Prompt Weighting
+
+[Prompt weighting](../using-diffusers/weighted_prompts) is a simple technique that puts more attention weight on certain parts of the text
+input.
+
+## Custom Diffusion
+
+[Paper](https://arxiv.org/abs/2212.04488)
+
+[Custom Diffusion](../training/custom_diffusion) only fine-tunes the cross-attention maps of a pre-trained
+text-to-image diffusion model. It also allows for additionally performing Textual Inversion. It supports
+multi-concept training by design. Like DreamBooth and Textual Inversion, Custom Diffusion is also used to
+teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the
+concept(s) of interest.
+
+## Model Editing
+
+[Paper](https://arxiv.org/abs/2303.08084)
+
+The [text-to-image model editing pipeline](../api/pipelines/model_editing) helps you mitigate some of the incorrect implicit assumptions a pre-trained text-to-image
+diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for "A pack of roses", the roses in the generated images
+are more likely to be red. This pipeline helps you change that assumption.
+
+## DiffEdit
+
+[Paper](https://arxiv.org/abs/2210.11427)
+
+[DiffEdit](../api/pipelines/diffedit) allows for semantic editing of input images along with
+input prompts while preserving the original input images as much as possible.
+
+## T2I-Adapter
+
+[Paper](https://arxiv.org/abs/2302.08453)
+
+[T2I-Adapter](../api/pipelines/stable_diffusion/adapter) is an auxiliary network which adds an extra condition.
+There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch,
+depth maps, and semantic segmentations.
+
+## Fabric
+
+[Paper](https://arxiv.org/abs/2307.10159)
+
+[Fabric](https://github.com/huggingface/diffusers/tree/442017ccc877279bcf24fbe92f92d3d0def191b6/examples/community#stable-diffusion-fabric-pipeline) is a training-free
+approach applicable to a wide range of popular diffusion models, which exploits
+the self-attention layer present in the most widely used architectures to condition
+the diffusion process on a set of feedback images.
diff --git a/docs/source/en/using-diffusers/controlnet.md b/docs/source/en/using-diffusers/controlnet.md
new file mode 100644
index 0000000..2a1295d
--- /dev/null
+++ b/docs/source/en/using-diffusers/controlnet.md
@@ -0,0 +1,587 @@
+
+
+# ControlNet
+
+ControlNet is a type of model for controlling image diffusion models by conditioning the model with an additional input image. There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model. This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much.
+
+
+
+Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper v1 for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub.
+
+For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the ๐ค [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub.
+
+
+
+A ControlNet model has two sets of weights (or blocks) connected by a zero-convolution layer:
+
+- a *locked copy* keeps everything a large pretrained diffusion model has learned
+- a *trainable copy* is trained on the additional conditioning input
+
+Since the locked copy preserves the pretrained model, training and implementing a ControlNet on a new conditioning input is as fast as finetuning any other model because you aren't training the model from scratch.
+
+This guide will show you how to use ControlNet for text-to-image, image-to-image, inpainting, and more! There are many types of ControlNet conditioning inputs to choose from, but in this guide we'll only focus on several of them. Feel free to experiment with other conditioning inputs!
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q diffusers transformers accelerate opencv-python
+```
+
+## Text-to-image
+
+For text-to-image, you normally pass a text prompt to the model. But with ControlNet, you can specify an additional conditioning input. Let's condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline.
+
+Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image:
+
+```py
+from diffusers.utils import load_image, make_image_grid
+from PIL import Image
+import cv2
+import numpy as np
+
+original_image = load_image(
+ "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
+)
+
+image = np.array(original_image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+```
+
+
+
+
+ original image
+
+
+
+ canny image
+
+
+
+Next, load a ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
+
+```py
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
+import torch
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+)
+
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+```
+
+Now pass your prompt and canny image to the pipeline:
+
+```py
+output = pipe(
+ "the mona lisa", image=canny_image
+).images[0]
+make_image_grid([original_image, canny_image, output], rows=1, cols=3)
+```
+
+
+
+
+
+## Image-to-image
+
+For image-to-image, you'd typically pass an initial image and a prompt to the pipeline to generate a new image. With ControlNet, you can pass an additional conditioning input to guide the model. Let's condition the model with a depth map, an image which contains spatial information. This way, the ControlNet can use the depth map as a control to guide the model to generate an image that preserves spatial information.
+
+You'll use the [`StableDiffusionControlNetImg2ImgPipeline`] for this task, which is different from the [`StableDiffusionControlNetPipeline`] because it allows you to pass an initial image as the starting point for the image generation process.
+
+Load an image and use the `depth-estimation` [`~transformers.Pipeline`] from ๐ค Transformers to extract the depth map of an image:
+
+```py
+import torch
+import numpy as np
+
+from transformers import pipeline
+from diffusers.utils import load_image, make_image_grid
+
+image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
+)
+
+def get_depth_map(image, depth_estimator):
+ image = depth_estimator(image)["depth"]
+ image = np.array(image)
+ image = image[:, :, None]
+ image = np.concatenate([image, image, image], axis=2)
+ detected_map = torch.from_numpy(image).float() / 255.0
+ depth_map = detected_map.permute(2, 0, 1)
+ return depth_map
+
+depth_estimator = pipeline("depth-estimation")
+depth_map = get_depth_map(image, depth_estimator).unsqueeze(0).half().to("cuda")
+```
+
+Next, load a ControlNet model conditioned on depth maps and pass it to the [`StableDiffusionControlNetImg2ImgPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
+
+```py
+from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
+import torch
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+)
+
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+```
+
+Now pass your prompt, initial image, and depth map to the pipeline:
+
+```py
+output = pipe(
+ "lego batman and robin", image=image, control_image=depth_map,
+).images[0]
+make_image_grid([image, output], rows=1, cols=2)
+```
+
+
+
+
+ original image
+
+
+
+ generated image
+
+
+
+## Inpainting
+
+For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Letโs condition the model with an inpainting mask. This way, the ControlNet can use the inpainting mask as a control to guide the model to generate an image within the mask area.
+
+Load an initial image and a mask image:
+
+```py
+from diffusers.utils import load_image, make_image_grid
+
+init_image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
+)
+init_image = init_image.resize((512, 512))
+
+mask_image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
+)
+mask_image = mask_image.resize((512, 512))
+make_image_grid([init_image, mask_image], rows=1, cols=2)
+```
+
+Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold.
+
+```py
+import numpy as np
+import torch
+
+def make_inpaint_condition(image, image_mask):
+ image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
+ image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
+
+ assert image.shape[0:1] == image_mask.shape[0:1]
+ image[image_mask > 0.5] = -1.0 # set as masked pixel
+ image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
+ image = torch.from_numpy(image)
+ return image
+
+control_image = make_inpaint_condition(init_image, mask_image)
+```
+
+
+
+
+ original image
+
+
+
+ mask image
+
+
+
+Load a ControlNet model conditioned on inpainting and pass it to the [`StableDiffusionControlNetInpaintPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
+
+```py
+from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
+)
+
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+```
+
+Now pass your prompt, initial image, mask image, and control image to the pipeline:
+
+```py
+output = pipe(
+ "corgi face with large ears, detailed, pixar, animated, disney",
+ num_inference_steps=20,
+ eta=1.0,
+ image=init_image,
+ mask_image=mask_image,
+ control_image=control_image,
+).images[0]
+make_image_grid([init_image, mask_image, output], rows=1, cols=3)
+```
+
+
+
+
+
+## Guess mode
+
+[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) does not require supplying a prompt to a ControlNet at all! This forces the ControlNet encoder to do it's best to "guess" the contents of the input control map (depth map, pose estimation, canny edge, etc.).
+
+Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest `DownBlock` corresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of the `MidBlock` output becomes 1.0.
+
+
+
+Guess mode does not have any impact on prompt conditioning and you can still provide a prompt if you want.
+
+
+
+Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) to set the `guidance_scale` value between 3.0 and 5.0.
+
+```py
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+from diffusers.utils import load_image, make_image_grid
+import numpy as np
+import torch
+from PIL import Image
+import cv2
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
+pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")
+
+original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")
+
+image = np.array(original_image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+
+image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
+make_image_grid([original_image, canny_image, image], rows=1, cols=3)
+```
+
+
+
+
+ regular mode with prompt
+
+
+
+ guess mode without prompt
+
+
+
+## ControlNet with Stable Diffusion XL
+
+There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the [๐ค Diffusers Hub organization](https://huggingface.co/diffusers)!
+
+Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image:
+
+```py
+from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
+from diffusers.utils import load_image, make_image_grid
+from PIL import Image
+import cv2
+import numpy as np
+import torch
+
+original_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
+)
+
+image = np.array(original_image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+make_image_grid([original_image, canny_image], rows=1, cols=2)
+```
+
+
+
+
+ original image
+
+
+
+ canny image
+
+
+
+Load a SDXL ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionXLControlNetPipeline`]. You can also enable model offloading to reduce memory usage.
+
+```py
+controlnet = ControlNetModel.from_pretrained(
+ "diffusers/controlnet-canny-sdxl-1.0",
+ torch_dtype=torch.float16,
+ use_safetensors=True
+)
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ controlnet=controlnet,
+ vae=vae,
+ torch_dtype=torch.float16,
+ use_safetensors=True
+)
+pipe.enable_model_cpu_offload()
+```
+
+Now pass your prompt (and optionally a negative prompt if you're using one) and canny image to the pipeline:
+
+
+
+The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter determines how much weight to assign to the conditioning inputs. A value of 0.5 is recommended for good generalization, but feel free to experiment with this number!
+
+
+
+```py
+prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
+negative_prompt = 'low quality, bad quality, sketches'
+
+image = pipe(
+ prompt,
+ negative_prompt=negative_prompt,
+ image=canny_image,
+ controlnet_conditioning_scale=0.5,
+).images[0]
+make_image_grid([original_image, canny_image, image], rows=1, cols=3)
+```
+
+
+
+
+
+You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by setting the parameter to `True`:
+
+```py
+from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
+from diffusers.utils import load_image, make_image_grid
+import numpy as np
+import torch
+import cv2
+from PIL import Image
+
+prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
+negative_prompt = "low quality, bad quality, sketches"
+
+original_image = load_image(
+ "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
+)
+
+controlnet = ControlNetModel.from_pretrained(
+ "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
+)
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
+)
+pipe.enable_model_cpu_offload()
+
+image = np.array(original_image)
+image = cv2.Canny(image, 100, 200)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+
+image = pipe(
+ prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
+).images[0]
+make_image_grid([original_image, canny_image, image], rows=1, cols=3)
+```
+
+
+
+You can use a refiner model with `StableDiffusionXLControlNetPipeline` to improve image quality, just like you can with a regular `StableDiffusionXLPipeline`.
+See the [Refine image quality](./sdxl#refine-image-quality) section to learn how to use the refiner model.
+Make sure to use `StableDiffusionXLControlNetPipeline` and pass `image` and `controlnet_conditioning_scale`.
+
+```py
+base = StableDiffusionXLControlNetPipeline(...)
+image = base(
+ prompt=prompt,
+ controlnet_conditioning_scale=0.5,
+ image=canny_image,
+ num_inference_steps=40,
+ denoising_end=0.8,
+ output_type="latent",
+).images
+# rest exactly as with StableDiffusionXLPipeline
+```
+
+
+
+## MultiControlNet
+
+
+
+Replace the SDXL model with a model like [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) to use multiple conditioning inputs with Stable Diffusion models.
+
+
+
+You can compose multiple ControlNet conditionings from different image inputs to create a *MultiControlNet*. To get better results, it is often helpful to:
+
+1. mask conditionings such that they don't overlap (for example, mask the area of a canny image where the pose conditioning is located)
+2. experiment with the [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter to determine how much weight to assign to each conditioning input
+
+In this example, you'll combine a canny image and a human pose estimation image to generate a new image.
+
+Prepare the canny image conditioning:
+
+```py
+from diffusers.utils import load_image, make_image_grid
+from PIL import Image
+import numpy as np
+import cv2
+
+original_image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
+)
+image = np.array(original_image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+
+# zero out middle columns of image where pose will be overlaid
+zero_start = image.shape[1] // 4
+zero_end = zero_start + image.shape[1] // 2
+image[:, zero_start:zero_end] = 0
+
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+make_image_grid([original_image, canny_image], rows=1, cols=2)
+```
+
+
+
+
+ original image
+
+
+
+ canny image
+
+
+
+For human pose estimation, install [controlnet_aux](https://github.com/patrickvonplaten/controlnet_aux):
+
+```py
+# uncomment to install the necessary library in Colab
+#!pip install -q controlnet-aux
+```
+
+Prepare the human pose estimation conditioning:
+
+```py
+from controlnet_aux import OpenposeDetector
+
+openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
+original_image = load_image(
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
+)
+openpose_image = openpose(original_image)
+make_image_grid([original_image, openpose_image], rows=1, cols=2)
+```
+
+
+
+
+ original image
+
+
+
+ human pose image
+
+
+
+Load a list of ControlNet models that correspond to each conditioning, and pass them to the [`StableDiffusionXLControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to reduce memory usage.
+
+```py
+from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
+import torch
+
+controlnets = [
+ ControlNetModel.from_pretrained(
+ "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
+ ),
+ ControlNetModel.from_pretrained(
+ "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
+ ),
+]
+
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
+)
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+```
+
+Now you can pass your prompt (an optional negative prompt if you're using one), canny image, and pose image to the pipeline:
+
+```py
+prompt = "a giant standing in a fantasy landscape, best quality"
+negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
+
+generator = torch.manual_seed(1)
+
+images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]
+
+images = pipe(
+ prompt,
+ image=images,
+ num_inference_steps=25,
+ generator=generator,
+ negative_prompt=negative_prompt,
+ num_images_per_prompt=3,
+ controlnet_conditioning_scale=[1.0, 0.8],
+).images
+make_image_grid([original_image, canny_image, openpose_image,
+ images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)
+```
+
+
+
+
diff --git a/docs/source/en/using-diffusers/custom_pipeline_examples.md b/docs/source/en/using-diffusers/custom_pipeline_examples.md
new file mode 100644
index 0000000..203302e
--- /dev/null
+++ b/docs/source/en/using-diffusers/custom_pipeline_examples.md
@@ -0,0 +1,119 @@
+
+
+# Community pipelines
+
+[[open-in-colab]]
+
+
+
+For more context about the design choices behind community pipelines, please have a look at [this issue](https://github.com/huggingface/diffusers/issues/841).
+
+
+
+Community pipelines allow you to get creative and build your own unique pipelines to share with the community. You can find all community pipelines in the [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) folder along with inference and training examples for how to use them. This guide showcases some of the community pipelines and hopefully it'll inspire you to create your own (feel free to open a PR with your own pipeline and we will merge it!).
+
+To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community):
+
+```py
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True
+)
+```
+
+If a community pipeline doesn't work as expected, please open a GitHub issue and mention the author.
+
+You can learn more about community pipelines in the how to [load community pipelines](custom_pipeline_overview) and how to [contribute a community pipeline](contribute_pipeline) guides.
+
+## Multilingual Stable Diffusion
+
+The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages.
+
+```py
+import torch
+from diffusers import DiffusionPipeline
+from diffusers.utils import make_image_grid
+from transformers import (
+ pipeline,
+ MBart50TokenizerFast,
+ MBartForConditionalGeneration,
+)
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+device_dict = {"cuda": 0, "cpu": -1}
+
+# add language detection pipeline
+language_detection_model_ckpt = "papluca/xlm-roberta-base-language-detection"
+language_detection_pipeline = pipeline("text-classification",
+ model=language_detection_model_ckpt,
+ device=device_dict[device])
+
+# add model for language translation
+translation_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
+translation_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device)
+
+diffuser_pipeline = DiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ custom_pipeline="multilingual_stable_diffusion",
+ detection_pipeline=language_detection_pipeline,
+ translation_model=translation_model,
+ translation_tokenizer=translation_tokenizer,
+ torch_dtype=torch.float16,
+)
+
+diffuser_pipeline.enable_attention_slicing()
+diffuser_pipeline = diffuser_pipeline.to(device)
+
+prompt = ["a photograph of an astronaut riding a horse",
+ "Una casa en la playa",
+ "Ein Hund, der Orange isst",
+ "Un restaurant parisien"]
+
+images = diffuser_pipeline(prompt).images
+make_image_grid(images, rows=2, cols=2)
+```
+
+
+
+
+
+## MagicMix
+
+[MagicMix](https://huggingface.co/papers/2210.16056) is a pipeline that can mix an image and text prompt to generate a new image that preserves the image structure. The `mix_factor` determines how much influence the prompt has on the layout generation, `kmin` controls the number of steps during the content generation process, and `kmax` determines how much information is kept in the layout of the original image.
+
+```py
+from diffusers import DiffusionPipeline, DDIMScheduler
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ custom_pipeline="magic_mix",
+ scheduler=DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
+).to('cuda')
+
+img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg")
+mix_img = pipeline(img, prompt="bed", kmin=0.3, kmax=0.5, mix_factor=0.5)
+make_image_grid([img, mix_img], rows=1, cols=2)
+```
+
+
+
+
+ original image
+
+
+
+ image and text prompt mix
+
+
diff --git a/docs/source/en/using-diffusers/custom_pipeline_overview.md b/docs/source/en/using-diffusers/custom_pipeline_overview.md
new file mode 100644
index 0000000..70283c3
--- /dev/null
+++ b/docs/source/en/using-diffusers/custom_pipeline_overview.md
@@ -0,0 +1,243 @@
+
+
+# Load community pipelines and components
+
+[[open-in-colab]]
+
+## Community pipelines
+
+Community pipelines are any [`DiffusionPipeline`] class that are different from the original implementation as specified in their paper (for example, the [`StableDiffusionControlNetPipeline`] corresponds to the [Text-to-Image Generation with ControlNet Conditioning](https://arxiv.org/abs/2302.05543) paper). They provide additional functionality or extend the original implementation of a pipeline.
+
+There are many cool community pipelines like [Speech to Image](https://github.com/huggingface/diffusers/tree/main/examples/community#speech-to-image) or [Composable Stable Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/community#composable-stable-diffusion), and you can find all the official community pipelines [here](https://github.com/huggingface/diffusers/tree/main/examples/community).
+
+To load any community pipeline on the Hub, pass the repository id of the community pipeline to the `custom_pipeline` argument and the model repository where you'd like to load the pipeline weights and components from. For example, the example below loads a dummy pipeline from [`hf-internal-testing/diffusers-dummy-pipeline`](https://huggingface.co/hf-internal-testing/diffusers-dummy-pipeline/blob/main/pipeline.py) and the pipeline weights and components from [`google/ddpm-cifar10-32`](https://huggingface.co/google/ddpm-cifar10-32):
+
+
+
+๐ By loading a community pipeline from the Hugging Face Hub, you are trusting that the code you are loading is safe. Make sure to inspect the code online before loading and running it automatically!
+
+
+
+```py
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline", use_safetensors=True
+)
+```
+
+Loading an official community pipeline is similar, but you can mix loading weights from an official repository id and pass pipeline components directly. The example below loads the community [CLIP Guided Stable Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/community#clip-guided-stable-diffusion) pipeline, and you can pass the CLIP model components directly to it:
+
+```py
+from diffusers import DiffusionPipeline
+from transformers import CLIPImageProcessor, CLIPModel
+
+clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
+
+feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
+clip_model = CLIPModel.from_pretrained(clip_model_id)
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ custom_pipeline="clip_guided_stable_diffusion",
+ clip_model=clip_model,
+ feature_extractor=feature_extractor,
+ use_safetensors=True,
+)
+```
+
+### Load from a local file
+
+Community pipelines can also be loaded from a local file if you pass a file path instead. The path to the passed directory must contain a `pipeline.py` file that contains the pipeline class in order to successfully load it.
+
+```py
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ custom_pipeline="./path/to/pipeline_directory/",
+ clip_model=clip_model,
+ feature_extractor=feature_extractor,
+ use_safetensors=True,
+)
+```
+
+### Load from a specific version
+
+By default, community pipelines are loaded from the latest stable version of Diffusers. To load a community pipeline from another version, use the `custom_revision` parameter.
+
+
+
+
+For example, to load from the `main` branch:
+
+```py
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ custom_pipeline="clip_guided_stable_diffusion",
+ custom_revision="main",
+ clip_model=clip_model,
+ feature_extractor=feature_extractor,
+ use_safetensors=True,
+)
+```
+
+
+
+
+For example, to load from a previous version of Diffusers like `v0.25.0`:
+
+```py
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ custom_pipeline="clip_guided_stable_diffusion",
+ custom_revision="v0.25.0",
+ clip_model=clip_model,
+ feature_extractor=feature_extractor,
+ use_safetensors=True,
+)
+```
+
+
+
+
+
+For more information about community pipelines, take a look at the [Community pipelines](custom_pipeline_examples) guide for how to use them and if you're interested in adding a community pipeline check out the [How to contribute a community pipeline](contribute_pipeline) guide!
+
+## Community components
+
+Community components allow users to build pipelines that may have customized components that are not a part of Diffusers. If your pipeline has custom components that Diffusers doesn't already support, you need to provide their implementations as Python modules. These customized components could be a VAE, UNet, and scheduler. In most cases, the text encoder is imported from the Transformers library. The pipeline code itself can also be customized.
+
+This section shows how users should use community components to build a community pipeline.
+
+You'll use the [showlab/show-1-base](https://huggingface.co/showlab/show-1-base) pipeline checkpoint as an example. So, let's start loading the components:
+
+1. Import and load the text encoder from Transformers:
+
+```python
+from transformers import T5Tokenizer, T5EncoderModel
+
+pipe_id = "showlab/show-1-base"
+tokenizer = T5Tokenizer.from_pretrained(pipe_id, subfolder="tokenizer")
+text_encoder = T5EncoderModel.from_pretrained(pipe_id, subfolder="text_encoder")
+```
+
+2. Load a scheduler:
+
+```python
+from diffusers import DPMSolverMultistepScheduler
+
+scheduler = DPMSolverMultistepScheduler.from_pretrained(pipe_id, subfolder="scheduler")
+```
+
+3. Load an image processor:
+
+```python
+from transformers import CLIPFeatureExtractor
+
+feature_extractor = CLIPFeatureExtractor.from_pretrained(pipe_id, subfolder="feature_extractor")
+```
+
+
+
+In steps 4 and 5, the custom [UNet](https://github.com/showlab/Show-1/blob/main/showone/models/unet_3d_condition.py) and [pipeline](https://huggingface.co/sayakpaul/show-1-base-with-code/blob/main/unet/showone_unet_3d_condition.py) implementation must match the format shown in their files for this example to work.
+
+
+
+4. Now you'll load a [custom UNet](https://github.com/showlab/Show-1/blob/main/showone/models/unet_3d_condition.py), which in this example, has already been implemented in the `showone_unet_3d_condition.py` [script](https://huggingface.co/sayakpaul/show-1-base-with-code/blob/main/unet/showone_unet_3d_condition.py) for your convenience. You'll notice the `UNet3DConditionModel` class name is changed to `ShowOneUNet3DConditionModel` because [`UNet3DConditionModel`] already exists in Diffusers. Any components needed for the `ShowOneUNet3DConditionModel` class should be placed in the `showone_unet_3d_condition.py` script.
+
+Once this is done, you can initialize the UNet:
+
+```python
+from showone_unet_3d_condition import ShowOneUNet3DConditionModel
+
+unet = ShowOneUNet3DConditionModel.from_pretrained(pipe_id, subfolder="unet")
+```
+
+5. Finally, you'll load the custom pipeline code. For this example, it has already been created for you in the `pipeline_t2v_base_pixel.py` [script](https://huggingface.co/sayakpaul/show-1-base-with-code/blob/main/pipeline_t2v_base_pixel.py). This script contains a custom `TextToVideoIFPipeline` class for generating videos from text. Just like the custom UNet, any code needed for the custom pipeline to work should go in the `pipeline_t2v_base_pixel.py` script.
+
+Once everything is in place, you can initialize the `TextToVideoIFPipeline` with the `ShowOneUNet3DConditionModel`:
+
+```python
+from pipeline_t2v_base_pixel import TextToVideoIFPipeline
+import torch
+
+pipeline = TextToVideoIFPipeline(
+ unet=unet,
+ text_encoder=text_encoder,
+ tokenizer=tokenizer,
+ scheduler=scheduler,
+ feature_extractor=feature_extractor
+)
+pipeline = pipeline.to(device="cuda")
+pipeline.torch_dtype = torch.float16
+```
+
+Push the pipeline to the Hub to share with the community!
+
+```python
+pipeline.push_to_hub("custom-t2v-pipeline")
+```
+
+After the pipeline is successfully pushed, you need a couple of changes:
+
+1. Change the `_class_name` attribute in [`model_index.json`](https://huggingface.co/sayakpaul/show-1-base-with-code/blob/main/model_index.json#L2) to `"pipeline_t2v_base_pixel"` and `"TextToVideoIFPipeline"`.
+2. Upload `showone_unet_3d_condition.py` to the `unet` [directory](https://huggingface.co/sayakpaul/show-1-base-with-code/blob/main/unet/showone_unet_3d_condition.py).
+3. Upload `pipeline_t2v_base_pixel.py` to the pipeline base [directory](https://huggingface.co/sayakpaul/show-1-base-with-code/blob/main/unet/showone_unet_3d_condition.py).
+
+To run inference, simply add the `trust_remote_code` argument while initializing the pipeline to handle all the "magic" behind the scenes.
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "/", trust_remote_code=True, torch_dtype=torch.float16
+).to("cuda")
+
+prompt = "hello"
+
+# Text embeds
+prompt_embeds, negative_embeds = pipeline.encode_prompt(prompt)
+
+# Keyframes generation (8x64x40, 2fps)
+video_frames = pipeline(
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_embeds,
+ num_frames=8,
+ height=40,
+ width=64,
+ num_inference_steps=2,
+ guidance_scale=9.0,
+ output_type="pt"
+).frames
+```
+
+As an additional reference example, you can refer to the repository structure of [stabilityai/japanese-stable-diffusion-xl](https://huggingface.co/stabilityai/japanese-stable-diffusion-xl/), that makes use of the `trust_remote_code` feature:
+
+```python
+
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/japanese-stable-diffusion-xl", trust_remote_code=True
+)
+pipeline.to("cuda")
+
+# if using torch < 2.0
+# pipeline.enable_xformers_memory_efficient_attention()
+
+prompt = "ๆด็ฌใใซใฉใใซใขใผใ"
+
+image = pipeline(prompt=prompt).images[0]
+
+```
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/depth2img.md b/docs/source/en/using-diffusers/depth2img.md
new file mode 100644
index 0000000..c092972
--- /dev/null
+++ b/docs/source/en/using-diffusers/depth2img.md
@@ -0,0 +1,46 @@
+
+
+# Text-guided depth-to-image generation
+
+[[open-in-colab]]
+
+The [`StableDiffusionDepth2ImgPipeline`] lets you pass a text prompt and an initial image to condition the generation of new images. In addition, you can also pass a `depth_map` to preserve the image structure. If no `depth_map` is provided, the pipeline automatically predicts the depth via an integrated [depth-estimation model](https://github.com/isl-org/MiDaS).
+
+Start by creating an instance of the [`StableDiffusionDepth2ImgPipeline`]:
+
+```python
+import torch
+from diffusers import StableDiffusionDepth2ImgPipeline
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+).to("cuda")
+```
+
+Now pass your prompt to the pipeline. You can also pass a `negative_prompt` to prevent certain words from guiding how an image is generated:
+
+```python
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+init_image = load_image(url)
+prompt = "two tigers"
+negative_prompt = "bad, deformed, ugly, bad anatomy"
+image = pipeline(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+| Input | Output |
+|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
+| | |
diff --git a/docs/source/en/using-diffusers/diffedit.md b/docs/source/en/using-diffusers/diffedit.md
new file mode 100644
index 0000000..f7e19fd
--- /dev/null
+++ b/docs/source/en/using-diffusers/diffedit.md
@@ -0,0 +1,285 @@
+
+
+# DiffEdit
+
+[[open-in-colab]]
+
+Image editing typically requires providing a mask of the area to be edited. DiffEdit automatically generates the mask for you based on a text query, making it easier overall to create a mask without image editing software. The DiffEdit algorithm works in three steps:
+
+1. the diffusion model denoises an image conditioned on some query text and reference text which produces different noise estimates for different areas of the image; the difference is used to infer a mask to identify which area of the image needs to be changed to match the query text
+2. the input image is encoded into latent space with DDIM
+3. the latents are decoded with the diffusion model conditioned on the text query, using the mask as a guide such that pixels outside the mask remain the same as in the input image
+
+This guide will show you how to use DiffEdit to edit images without manually creating a mask.
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q diffusers transformers accelerate
+```
+
+The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then:
+
+```py
+source_prompt = "a bowl of fruits"
+target_prompt = "a bowl of pears"
+```
+
+The partially inverted latents are generated from the [`~StableDiffusionDiffEditPipeline.invert`] function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions!
+
+Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage:
+
+```py
+import torch
+from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
+
+pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1",
+ torch_dtype=torch.float16,
+ safety_checker=None,
+ use_safetensors=True,
+)
+pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
+pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
+pipeline.enable_model_cpu_offload()
+pipeline.enable_vae_slicing()
+```
+
+Load the image to edit:
+
+```py
+from diffusers.utils import load_image, make_image_grid
+
+img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
+raw_image = load_image(img_url).resize((768, 768))
+raw_image
+```
+
+Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:
+
+```py
+from PIL import Image
+
+source_prompt = "a bowl of fruits"
+target_prompt = "a basket of pears"
+mask_image = pipeline.generate_mask(
+ image=raw_image,
+ source_prompt=source_prompt,
+ target_prompt=target_prompt,
+)
+Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
+```
+
+Next, create the inverted latents and pass it a caption describing the image:
+
+```py
+inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents
+```
+
+Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`:
+
+```py
+output_image = pipeline(
+ prompt=target_prompt,
+ mask_image=mask_image,
+ image_latents=inv_latents,
+ negative_prompt=source_prompt,
+).images[0]
+mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
+make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
+```
+
+
+
+
+ original image
+
+
+
+ edited image
+
+
+
+## Generate source and target embeddings
+
+The source and target embeddings can be automatically generated with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model instead of creating them manually.
+
+Load the Flan-T5 model and tokenizer from the ๐ค Transformers library:
+
+```py
+import torch
+from transformers import AutoTokenizer, T5ForConditionalGeneration
+
+tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
+model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
+```
+
+Provide some initial text to prompt the model to generate the source and target prompts.
+
+```py
+source_concept = "bowl"
+target_concept = "basket"
+
+source_text = f"Provide a caption for images containing a {source_concept}. "
+"The captions should be in English and should be no longer than 150 characters."
+
+target_text = f"Provide a caption for images containing a {target_concept}. "
+"The captions should be in English and should be no longer than 150 characters."
+```
+
+Next, create a utility function to generate the prompts:
+
+```py
+@torch.no_grad()
+def generate_prompts(input_prompt):
+ input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
+
+ outputs = model.generate(
+ input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
+ )
+ return tokenizer.batch_decode(outputs, skip_special_tokens=True)
+
+source_prompts = generate_prompts(source_text)
+target_prompts = generate_prompts(target_text)
+print(source_prompts)
+print(target_prompts)
+```
+
+
+
+Check out the [generation strategy](https://huggingface.co/docs/transformers/main/en/generation_strategies) guide if you're interested in learning more about strategies for generating different quality text.
+
+
+
+Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings:
+
+```py
+import torch
+from diffusers import StableDiffusionDiffEditPipeline
+
+pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+pipeline.enable_vae_slicing()
+
+@torch.no_grad()
+def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
+ embeddings = []
+ for sent in sentences:
+ text_inputs = tokenizer(
+ sent,
+ padding="max_length",
+ max_length=tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_input_ids = text_inputs.input_ids
+ prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
+ embeddings.append(prompt_embeds)
+ return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
+
+source_embeds = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
+target_embeds = embed_prompts(target_prompts, pipeline.tokenizer, pipeline.text_encoder)
+```
+
+Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_mask`] and [`~StableDiffusionDiffEditPipeline.invert`] functions, and pipeline to generate the image:
+
+```diff
+ from diffusers import DDIMInverseScheduler, DDIMScheduler
+ from diffusers.utils import load_image, make_image_grid
+ from PIL import Image
+
+ pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
+ pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
+
+ img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
+ raw_image = load_image(img_url).resize((768, 768))
+
+ mask_image = pipeline.generate_mask(
+ image=raw_image,
+- source_prompt=source_prompt,
+- target_prompt=target_prompt,
++ source_prompt_embeds=source_embeds,
++ target_prompt_embeds=target_embeds,
+ )
+
+ inv_latents = pipeline.invert(
+- prompt=source_prompt,
++ prompt_embeds=source_embeds,
+ image=raw_image,
+ ).latents
+
+ output_image = pipeline(
+ mask_image=mask_image,
+ image_latents=inv_latents,
+- prompt=target_prompt,
+- negative_prompt=source_prompt,
++ prompt_embeds=target_embeds,
++ negative_prompt_embeds=source_embeds,
+ ).images[0]
+ mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L")
+ make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
+```
+
+## Generate a caption for inversion
+
+While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model to automatically generate a caption.
+
+Load the BLIP model and processor from the ๐ค Transformers library:
+
+```py
+import torch
+from transformers import BlipForConditionalGeneration, BlipProcessor
+
+processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16, low_cpu_mem_usage=True)
+```
+
+Create a utility function to generate a caption from the input image:
+
+```py
+@torch.no_grad()
+def generate_caption(images, caption_generator, caption_processor):
+ text = "a photograph of"
+
+ inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
+ caption_generator.to("cuda")
+ outputs = caption_generator.generate(**inputs, max_new_tokens=128)
+
+ # offload caption generator
+ caption_generator.to("cpu")
+
+ caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
+ return caption
+```
+
+Load an input image and generate a caption for it using the `generate_caption` function:
+
+```py
+from diffusers.utils import load_image
+
+img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
+raw_image = load_image(img_url).resize((768, 768))
+caption = generate_caption(raw_image, model, processor)
+```
+
+
+
+
+ generated caption: "a photograph of a bowl of fruit on a table"
+
+
+
+Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents!
diff --git a/docs/source/en/using-diffusers/distilled_sd.md b/docs/source/en/using-diffusers/distilled_sd.md
new file mode 100644
index 0000000..c4c5f7a
--- /dev/null
+++ b/docs/source/en/using-diffusers/distilled_sd.md
@@ -0,0 +1,133 @@
+
+
+# Distilled Stable Diffusion inference
+
+[[open-in-colab]]
+
+Stable Diffusion inference can be a computationally intensive process because it must iteratively denoise the latents to generate an image. To reduce the computational burden, you can use a *distilled* version of the Stable Diffusion model from [Nota AI](https://huggingface.co/nota-ai). The distilled version of their Stable Diffusion model eliminates some of the residual and attention blocks from the UNet, reducing the model size by 51% and improving latency on CPU/GPU by 43%.
+
+
+
+Read this [blog post](https://huggingface.co/blog/sd_distillation) to learn more about how knowledge distillation training works to produce a faster, smaller, and cheaper generative model.
+
+
+
+Let's load the distilled Stable Diffusion model and compare it against the original Stable Diffusion model:
+
+```py
+from diffusers import StableDiffusionPipeline
+import torch
+
+distilled = StableDiffusionPipeline.from_pretrained(
+ "nota-ai/bk-sdm-small", torch_dtype=torch.float16, use_safetensors=True,
+).to("cuda")
+
+original = StableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, use_safetensors=True,
+).to("cuda")
+```
+
+Given a prompt, get the inference time for the original model:
+
+```py
+import time
+
+seed = 2023
+generator = torch.manual_seed(seed)
+
+NUM_ITERS_TO_RUN = 3
+NUM_INFERENCE_STEPS = 25
+NUM_IMAGES_PER_PROMPT = 4
+
+prompt = "a golden vase with different flowers"
+
+start = time.time_ns()
+for _ in range(NUM_ITERS_TO_RUN):
+ images = original(
+ prompt,
+ num_inference_steps=NUM_INFERENCE_STEPS,
+ generator=generator,
+ num_images_per_prompt=NUM_IMAGES_PER_PROMPT
+ ).images
+end = time.time_ns()
+original_sd = f"{(end - start) / 1e6:.1f}"
+
+print(f"Execution time -- {original_sd} ms\n")
+"Execution time -- 45781.5 ms"
+```
+
+Time the distilled model inference:
+
+```py
+start = time.time_ns()
+for _ in range(NUM_ITERS_TO_RUN):
+ images = distilled(
+ prompt,
+ num_inference_steps=NUM_INFERENCE_STEPS,
+ generator=generator,
+ num_images_per_prompt=NUM_IMAGES_PER_PROMPT
+ ).images
+end = time.time_ns()
+
+distilled_sd = f"{(end - start) / 1e6:.1f}"
+print(f"Execution time -- {distilled_sd} ms\n")
+"Execution time -- 29884.2 ms"
+```
+
+
+
+
+ original Stable Diffusion (45781.5 ms)
+
+
+
+ distilled Stable Diffusion (29884.2 ms)
+
+
+
+## Tiny AutoEncoder
+
+To speed inference up even more, use a tiny distilled version of the [Stable Diffusion VAE](https://huggingface.co/sayakpaul/taesdxl-diffusers) to denoise the latents into images. Replace the VAE in the distilled Stable Diffusion model with the tiny VAE:
+
+```py
+from diffusers import AutoencoderTiny
+
+distilled.vae = AutoencoderTiny.from_pretrained(
+ "sayakpaul/taesd-diffusers", torch_dtype=torch.float16, use_safetensors=True,
+).to("cuda")
+```
+
+Time the distilled model and distilled VAE inference:
+
+```py
+start = time.time_ns()
+for _ in range(NUM_ITERS_TO_RUN):
+ images = distilled(
+ prompt,
+ num_inference_steps=NUM_INFERENCE_STEPS,
+ generator=generator,
+ num_images_per_prompt=NUM_IMAGES_PER_PROMPT
+ ).images
+end = time.time_ns()
+
+distilled_tiny_sd = f"{(end - start) / 1e6:.1f}"
+print(f"Execution time -- {distilled_tiny_sd} ms\n")
+"Execution time -- 27165.7 ms"
+```
+
+
diff --git a/docs/source/en/using-diffusers/freeu.md b/docs/source/en/using-diffusers/freeu.md
new file mode 100644
index 0000000..7b1fb90
--- /dev/null
+++ b/docs/source/en/using-diffusers/freeu.md
@@ -0,0 +1,135 @@
+
+
+# Improve generation quality with FreeU
+
+[[open-in-colab]]
+
+The UNet is responsible for denoising during the reverse diffusion process, and there are two distinct features in its architecture:
+
+1. Backbone features primarily contribute to the denoising process
+2. Skip features mainly introduce high-frequency features into the decoder module and can make the network overlook the semantics in the backbone features
+
+However, the skip connection can sometimes introduce unnatural image details. [FreeU](https://hf.co/papers/2309.11497) is a technique for improving image quality by rebalancing the contributions from the UNetโs skip connections and backbone feature maps.
+
+FreeU is applied during inference and it does not require any additional training. The technique works for different tasks such as text-to-image, image-to-image, and text-to-video.
+
+In this guide, you will apply FreeU to the [`StableDiffusionPipeline`], [`StableDiffusionXLPipeline`], and [`TextToVideoSDPipeline`]. You need to install Diffusers from source to run the examples below.
+
+## StableDiffusionPipeline
+
+Load the pipeline:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, safety_checker=None
+).to("cuda")
+```
+
+Then enable the FreeU mechanism with the FreeU-specific hyperparameters. These values are scaling factors for the backbone and skip features.
+
+```py
+pipeline.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+```
+
+The values above are from the official FreeU [code repository](https://github.com/ChenyangSi/FreeU) where you can also find [reference hyperparameters](https://github.com/ChenyangSi/FreeU#range-for-more-parameters) for different models.
+
+
+
+Disable the FreeU mechanism by calling `disable_freeu()` on a pipeline.
+
+
+
+And then run inference:
+
+```py
+prompt = "A squirrel eating a burger"
+seed = 2023
+image = pipeline(prompt, generator=torch.manual_seed(seed)).images[0]
+image
+```
+
+The figure below compares non-FreeU and FreeU results respectively for the same hyperparameters used above (`prompt` and `seed`):
+
+data:image/s3,"s3://crabby-images/89f65/89f6538ab5924edd71e5ae7926aef2b764b6d161" alt=""
+
+
+Let's see how Stable Diffusion 2 results are impacted:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, safety_checker=None
+).to("cuda")
+
+prompt = "A squirrel eating a burger"
+seed = 2023
+
+pipeline.enable_freeu(s1=0.9, s2=0.2, b1=1.1, b2=1.2)
+image = pipeline(prompt, generator=torch.manual_seed(seed)).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/d1fbf/d1fbf10650fbc08cc2e97dec151a9d948a6016fa" alt=""
+
+## Stable Diffusion XL
+
+Finally, let's take a look at how FreeU affects Stable Diffusion XL results:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16,
+).to("cuda")
+
+prompt = "A squirrel eating a burger"
+seed = 2023
+
+# Comes from
+# https://wandb.ai/nasirk24/UNET-FreeU-SDXL/reports/FreeU-SDXL-Optimal-Parameters--Vmlldzo1NDg4NTUw
+pipeline.enable_freeu(s1=0.6, s2=0.4, b1=1.1, b2=1.2)
+image = pipeline(prompt, generator=torch.manual_seed(seed)).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/1416f/1416fa0250b68cd6e8b632e7e52b07acfa2e68cc" alt=""
+
+## Text-to-video generation
+
+FreeU can also be used to improve video quality:
+
+```python
+from diffusers import DiffusionPipeline
+from diffusers.utils import export_to_video
+import torch
+
+model_id = "cerspense/zeroscope_v2_576w"
+pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+
+prompt = "an astronaut riding a horse on mars"
+seed = 2023
+
+# The values come from
+# https://github.com/lyn-rgb/FreeU_Diffusers#video-pipelines
+pipe.enable_freeu(b1=1.2, b2=1.4, s1=0.9, s2=0.2)
+video_frames = pipe(prompt, height=320, width=576, num_frames=30, generator=torch.manual_seed(seed)).frames[0]
+export_to_video(video_frames, "astronaut_rides_horse.mp4")
+```
+
+Thanks to [kadirnar](https://github.com/kadirnar/) for helping to integrate the feature, and to [justindujardin](https://github.com/justindujardin) for the helpful discussions.
diff --git a/docs/source/en/using-diffusers/img2img.md b/docs/source/en/using-diffusers/img2img.md
new file mode 100644
index 0000000..0ebe146
--- /dev/null
+++ b/docs/source/en/using-diffusers/img2img.md
@@ -0,0 +1,605 @@
+
+
+# Image-to-image
+
+[[open-in-colab]]
+
+Image-to-image is similar to [text-to-image](conditional_image_generation), but in addition to a prompt, you can also pass an initial image as a starting point for the diffusion process. The initial image is encoded to latent space and noise is added to it. Then the latent diffusion model takes a prompt and the noisy latent image, predicts the added noise, and removes the predicted noise from the initial latent image to get the new latent image. Lastly, a decoder decodes the new latent image back into an image.
+
+With ๐ค Diffusers, this is as easy as 1-2-3:
+
+1. Load a checkpoint into the [`AutoPipelineForImage2Image`] class; this pipeline automatically handles loading the correct pipeline class based on the checkpoint:
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+```
+
+
+
+You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, then you don't need to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention).
+
+
+
+2. Load an image to pass to the pipeline:
+
+```py
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
+```
+
+3. Pass a prompt and image to the pipeline to generate an image:
+
+```py
+prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
+image = pipeline(prompt, image=init_image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+ initial image
+
+
+
+ generated image
+
+
+
+## Popular models
+
+The most popular image-to-image models are [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder). The results from the Stable Diffusion and Kandinsky models vary due to their architecture differences and training process; you can generally expect SDXL to produce higher quality images than Stable Diffusion v1.5. Let's take a quick look at how to use each of these models and compare their results.
+
+### Stable Diffusion v1.5
+
+Stable Diffusion v1.5 is a latent diffusion model initialized from an earlier checkpoint, and further finetuned for 595K steps on 512x512 images. To use this pipeline for image-to-image, you'll need to prepare an initial image to pass to the pipeline. Then you can pass a prompt and the image to the pipeline to generate a new image:
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, image=init_image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+ initial image
+
+
+
+ generated image
+
+
+
+### Stable Diffusion XL (SDXL)
+
+SDXL is a more powerful version of the Stable Diffusion model. It uses a larger base model, and an additional refiner model to increase the quality of the base model's output. Read the [SDXL](sdxl) guide for a more detailed walkthrough of how to use this model, and other techniques it uses to produce high quality images.
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, image=init_image, strength=0.5).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+ initial image
+
+
+
+ generated image
+
+
+
+### Kandinsky 2.2
+
+The Kandinsky model is different from the Stable Diffusion models because it uses an image prior model to create image embeddings. The embeddings help create a better alignment between text and images, allowing the latent diffusion model to generate better images.
+
+The simplest way to use Kandinsky 2.2 is:
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, image=init_image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+ initial image
+
+
+
+ generated image
+
+
+
+## Configure pipeline parameters
+
+There are several important parameters you can configure in the pipeline that'll affect the image generation process and image quality. Let's take a closer look at what these parameters do and how changing them affects the output.
+
+### Strength
+
+`strength` is one of the most important parameters to consider and it'll have a huge impact on your generated image. It determines how much the generated image resembles the initial image. In other words:
+
+- ๐ a higher `strength` value gives the model more "creativity" to generate an image that's different from the initial image; a `strength` value of 1.0 means the initial image is more or less ignored
+- ๐ a lower `strength` value means the generated image is more similar to the initial image
+
+The `strength` and `num_inference_steps` parameters are related because `strength` determines the number of noise steps to add. For example, if the `num_inference_steps` is 50 and `strength` is 0.8, then this means adding 40 (50 * 0.8) steps of noise to the initial image and then denoising for 40 steps to get the newly generated image.
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, image=init_image, strength=0.8).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+ strength = 0.4
+
+
+
+ strength = 0.6
+
+
+
+ strength = 1.0
+
+
+
+### Guidance scale
+
+The `guidance_scale` parameter is used to control how closely aligned the generated image and text prompt are. A higher `guidance_scale` value means your generated image is more aligned with the prompt, while a lower `guidance_scale` value means your generated image has more space to deviate from the prompt.
+
+You can combine `guidance_scale` with `strength` for even more precise control over how expressive the model is. For example, combine a high `strength + guidance_scale` for maximum creativity or use a combination of low `strength` and low `guidance_scale` to generate an image that resembles the initial image but is not as strictly bound to the prompt.
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, image=init_image, guidance_scale=8.0).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+ guidance_scale = 0.1
+
+
+
+ guidance_scale = 5.0
+
+
+
+ guidance_scale = 10.0
+
+
+
+### Negative prompt
+
+A negative prompt conditions the model to *not* include things in an image, and it can be used to improve image quality or modify an image. For example, you can improve image quality by including negative prompts like "poor details" or "blurry" to encourage the model to generate a higher quality image. Or you can modify an image by specifying things to exclude from an image.
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, negative_prompt=negative_prompt, image=init_image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+## Chained image-to-image pipelines
+
+There are some other interesting ways you can use an image-to-image pipeline aside from just generating an image (although that is pretty cool too). You can take it a step further and chain it with other pipelines.
+
+### Text-to-image-to-image
+
+Chaining a text-to-image and image-to-image pipeline allows you to generate an image from text and use the generated image as the initial image for the image-to-image pipeline. This is useful if you want to generate an image entirely from scratch. For example, let's chain a Stable Diffusion and a Kandinsky model.
+
+Start by generating an image with the text-to-image pipeline:
+
+```py
+from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image
+import torch
+from diffusers.utils import make_image_grid
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+text2image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
+text2image
+```
+
+Now you can pass this generated image to the image-to-image pipeline:
+
+```py
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+image2image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=text2image).images[0]
+make_image_grid([text2image, image2image], rows=1, cols=2)
+```
+
+### Image-to-image-to-image
+
+You can also chain multiple image-to-image pipelines together to create more interesting images. This can be useful for iteratively performing style transfer on an image, generating short GIFs, restoring color to an image, or restoring missing areas of an image.
+
+Start by generating an image:
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image = pipeline(prompt, image=init_image, output_type="latent").images[0]
+```
+
+
+
+It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE.
+
+
+
+Pass the latent output from this pipeline to the next pipeline to generate an image in a [comic book art style](https://huggingface.co/ogkalu/Comic-Diffusion):
+
+```py
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "ogkalu/Comic-Diffusion", torch_dtype=torch.float16
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# need to include the token "charliebo artstyle" in the prompt to use this checkpoint
+image = pipeline("Astronaut in a jungle, charliebo artstyle", image=image, output_type="latent").images[0]
+```
+
+Repeat one more time to generate the final image in a [pixel art style](https://huggingface.co/kohbanye/pixel-art-style):
+
+```py
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "kohbanye/pixel-art-style", torch_dtype=torch.float16
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# need to include the token "pixelartstyle" in the prompt to use this checkpoint
+image = pipeline("Astronaut in a jungle, pixelartstyle", image=image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+### Image-to-upscaler-to-super-resolution
+
+Another way you can chain your image-to-image pipeline is with an upscaler and super-resolution pipeline to really increase the level of details in an image.
+
+Start with an image-to-image pipeline:
+
+```py
+import torch
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+image_1 = pipeline(prompt, image=init_image, output_type="latent").images[0]
+```
+
+
+
+It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in *latent* space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE.
+
+
+
+Chain it to an upscaler pipeline to increase the image resolution:
+
+```py
+from diffusers import StableDiffusionLatentUpscalePipeline
+
+upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(
+ "stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+upscaler.enable_model_cpu_offload()
+upscaler.enable_xformers_memory_efficient_attention()
+
+image_2 = upscaler(prompt, image=image_1, output_type="latent").images[0]
+```
+
+Finally, chain it to a super-resolution pipeline to further enhance the resolution:
+
+```py
+from diffusers import StableDiffusionUpscalePipeline
+
+super_res = StableDiffusionUpscalePipeline.from_pretrained(
+ "stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+super_res.enable_model_cpu_offload()
+super_res.enable_xformers_memory_efficient_attention()
+
+image_3 = super_res(prompt, image=image_2).images[0]
+make_image_grid([init_image, image_3.resize((512, 512))], rows=1, cols=2)
+```
+
+## Control image generation
+
+Trying to generate an image that looks exactly the way you want can be difficult, which is why controlled generation techniques and models are so useful. While you can use the `negative_prompt` to partially control image generation, there are more robust methods like prompt weighting and ControlNets.
+
+### Prompt weighting
+
+Prompt weighting allows you to scale the representation of each concept in a prompt. For example, in a prompt like "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", you can choose to increase or decrease the embeddings of "astronaut" and "jungle". The [Compel](https://github.com/damian0815/compel) library provides a simple syntax for adjusting prompt weights and generating the embeddings. You can learn how to create the embeddings in the [Prompt weighting](weighted_prompts) guide.
+
+[`AutoPipelineForImage2Image`] has a `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter where you can pass the embeddings which replaces the `prompt` parameter.
+
+```py
+from diffusers import AutoPipelineForImage2Image
+import torch
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+image = pipeline(prompt_embeds=prompt_embeds, # generated from Compel
+ negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
+ image=init_image,
+).images[0]
+```
+
+### ControlNet
+
+ControlNets provide a more flexible and accurate way to control image generation because you can use an additional conditioning image. The conditioning image can be a canny image, depth map, image segmentation, and even scribbles! Whatever type of conditioning image you choose, the ControlNet generates an image that preserves the information in it.
+
+For example, let's condition an image with a depth map to keep the spatial information in the image.
+
+```py
+from diffusers.utils import load_image, make_image_grid
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+init_image = init_image.resize((958, 960)) # resize to depth image dimensions
+depth_image = load_image("https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png")
+make_image_grid([init_image, depth_image], rows=1, cols=2)
+```
+
+Load a ControlNet model conditioned on depth maps and the [`AutoPipelineForImage2Image`]:
+
+```py
+from diffusers import ControlNetModel, AutoPipelineForImage2Image
+import torch
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+```
+
+Now generate a new image conditioned on the depth map, initial image, and prompt:
+
+```py
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image_control_net = pipeline(prompt, image=init_image, control_image=depth_image).images[0]
+make_image_grid([init_image, depth_image, image_control_net], rows=1, cols=3)
+```
+
+
+
+
+ initial image
+
+
+
+ depth image
+
+
+
+ ControlNet image
+
+
+
+Let's apply a new [style](https://huggingface.co/nitrosocke/elden-ring-diffusion) to the image generated from the ControlNet by chaining it with an image-to-image pipeline:
+
+```py
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16,
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+prompt = "elden ring style astronaut in a jungle" # include the token "elden ring style" in the prompt
+negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
+
+image_elden_ring = pipeline(prompt, negative_prompt=negative_prompt, image=image_control_net, strength=0.45, guidance_scale=10.5).images[0]
+make_image_grid([init_image, depth_image, image_control_net, image_elden_ring], rows=2, cols=2)
+```
+
+
+
+
+
+## Optimize
+
+Running diffusion models is computationally expensive and intensive, but with a few optimization tricks, it is entirely possible to run them on consumer and free-tier GPUs. For example, you can use a more memory-efficient form of attention such as PyTorch 2.0's [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention) or [xFormers](../optimization/xformers) (you can use one or the other, but there's no need to use both). You can also offload the model to the GPU while the other pipeline components wait on the CPU.
+
+```diff
++ pipeline.enable_model_cpu_offload()
++ pipeline.enable_xformers_memory_efficient_attention()
+```
+
+With [`torch.compile`](../optimization/torch2.0#torchcompile), you can boost your inference speed even more by wrapping your UNet with it:
+
+```py
+pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+To learn more, take a look at the [Reduce memory usage](../optimization/memory) and [Torch 2.0](../optimization/torch2.0) guides.
diff --git a/docs/source/en/using-diffusers/inference_with_lcm.md b/docs/source/en/using-diffusers/inference_with_lcm.md
new file mode 100644
index 0000000..798de67
--- /dev/null
+++ b/docs/source/en/using-diffusers/inference_with_lcm.md
@@ -0,0 +1,274 @@
+
+
+[[open-in-colab]]
+
+# Latent Consistency Model
+
+Latent Consistency Models (LCM) enable quality image generation in typically 2-4 steps making it possible to use diffusion models in almost real-time settings.
+
+From the [official website](https://latent-consistency-models.github.io/):
+
+> LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (~32 A100 GPU Hours) for generating high quality 768 x 768 resolution images in 2~4 steps or even one step, significantly accelerating text-to-image generation. We employ LCM to distill the Dreamshaper-V7 version of SD in just 4,000 training iterations.
+
+For a more technical overview of LCMs, refer to [the paper](https://huggingface.co/papers/2310.04378).
+
+LCM distilled models are available for [stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and the [SSD-1B](https://huggingface.co/segmind/SSD-1B) model. All the checkpoints can be found in this [collection](https://huggingface.co/collections/latent-consistency/latent-consistency-models-weights-654ce61a95edd6dffccef6a8).
+
+This guide shows how to perform inference with LCMs for
+- text-to-image
+- image-to-image
+- combined with style LoRAs
+- ControlNet/T2I-Adapter
+
+## Text-to-image
+
+You'll use the [`StableDiffusionXLPipeline`] pipeline with the [`LCMScheduler`] and then load the LCM-LoRA. Together with the LCM-LoRA and the scheduler, the pipeline enables a fast inference workflow, overcoming the slow iterative nature of diffusion models.
+
+```python
+from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler
+import torch
+
+unet = UNet2DConditionModel.from_pretrained(
+ "latent-consistency/lcm-sdxl",
+ torch_dtype=torch.float16,
+ variant="fp16",
+)
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16",
+).to("cuda")
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
+
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0
+).images[0]
+```
+
+data:image/s3,"s3://crabby-images/03bce/03bceecda875552c399ebe05f8cfda0932dc796f" alt=""
+
+Notice that we use only 4 steps for generation which is way less than what's typically used for standard SDXL.
+
+Some details to keep in mind:
+
+* To perform classifier-free guidance, batch size is usually doubled inside the pipeline. LCM, however, applies guidance using guidance embeddings, so the batch size does not have to be doubled in this case. This leads to a faster inference time, with the drawback that negative prompts don't have any effect on the denoising process.
+* The UNet was trained using the [3., 13.] guidance scale range. So, that is the ideal range for `guidance_scale`. However, disabling `guidance_scale` using a value of 1.0 is also effective in most cases.
+
+
+## Image-to-image
+
+LCMs can be applied to image-to-image tasks too. For this example, we'll use the [LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) model, but the same steps can be applied to other LCM models as well.
+
+```python
+import torch
+from diffusers import AutoPipelineForImage2Image, UNet2DConditionModel, LCMScheduler
+from diffusers.utils import make_image_grid, load_image
+
+unet = UNet2DConditionModel.from_pretrained(
+ "SimianLuo/LCM_Dreamshaper_v7",
+ subfolder="unet",
+ torch_dtype=torch.float16,
+)
+
+pipe = AutoPipelineForImage2Image.from_pretrained(
+ "Lykon/dreamshaper-7",
+ unet=unet,
+ torch_dtype=torch.float16,
+ variant="fp16",
+).to("cuda")
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt,
+ image=init_image,
+ num_inference_steps=4,
+ guidance_scale=7.5,
+ strength=0.5,
+ generator=generator
+).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/c82b8/c82b837ea185ddbfa6b6fbfe746fda8cbaad236d" alt=""
+
+
+
+
+You can get different results based on your prompt and the image you provide. To get the best results, we recommend trying different values for `num_inference_steps`, `strength`, and `guidance_scale` parameters and choose the best one.
+
+
+
+
+## Combine with style LoRAs
+
+LCMs can be used with other styled LoRAs to generate styled-images in very few steps (4-8). In the following example, we'll use the [papercut LoRA](TheLastBen/Papercut_SDXL).
+
+```python
+from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler
+import torch
+
+unet = UNet2DConditionModel.from_pretrained(
+ "latent-consistency/lcm-sdxl",
+ torch_dtype=torch.float16,
+ variant="fp16",
+)
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16",
+).to("cuda")
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut")
+
+prompt = "papercut, a cute fox"
+
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0
+).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/ac296/ac2964aa2df69dfc02de4fe77c2d9234812e6121" alt=""
+
+
+## ControlNet/T2I-Adapter
+
+Let's look at how we can perform inference with ControlNet/T2I-Adapter and a LCM.
+
+### ControlNet
+For this example, we'll use the [LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) model with canny ControlNet, but the same steps can be applied to other LCM models as well.
+
+```python
+import torch
+import cv2
+import numpy as np
+from PIL import Image
+
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler
+from diffusers.utils import load_image, make_image_grid
+
+image = load_image(
+ "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
+).resize((512, 512))
+
+image = np.array(image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "SimianLuo/LCM_Dreamshaper_v7",
+ controlnet=controlnet,
+ torch_dtype=torch.float16,
+ safety_checker=None,
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+generator = torch.manual_seed(0)
+image = pipe(
+ "the mona lisa",
+ image=canny_image,
+ num_inference_steps=4,
+ generator=generator,
+).images[0]
+make_image_grid([canny_image, image], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/1b1ac/1b1ac18fc3d7c05f4217271b48e016a80f184b41" alt=""
+
+
+
+The inference parameters in this example might not work for all examples, so we recommend trying different values for the `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale`, and `cross_attention_kwargs` parameters and choosing the best one.
+
+
+### T2I-Adapter
+
+This example shows how to use the `lcm-sdxl` with the [Canny T2I-Adapter](TencentARC/t2i-adapter-canny-sdxl-1.0).
+
+```python
+import torch
+import cv2
+import numpy as np
+from PIL import Image
+
+from diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler
+from diffusers.utils import load_image, make_image_grid
+
+# Prepare image
+# Detect the canny map in low resolution to avoid high-frequency details
+image = load_image(
+ "https://huggingface.co/Adapter/t2iadapter/resolve/main/figs_SDXLV1.0/org_canny.jpg"
+).resize((384, 384))
+
+image = np.array(image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image).resize((1024, 1216))
+
+# load adapter
+adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, varient="fp16").to("cuda")
+
+unet = UNet2DConditionModel.from_pretrained(
+ "latent-consistency/lcm-sdxl",
+ torch_dtype=torch.float16,
+ variant="fp16",
+)
+pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ unet=unet,
+ adapter=adapter,
+ torch_dtype=torch.float16,
+ variant="fp16",
+).to("cuda")
+
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+prompt = "Mystical fairy in real, magic, 4k picture, high quality"
+negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured"
+
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ image=canny_image,
+ num_inference_steps=4,
+ guidance_scale=5,
+ adapter_conditioning_scale=0.8,
+ adapter_conditioning_factor=1,
+ generator=generator,
+).images[0]
+grid = make_image_grid([canny_image, image], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/9be46/9be46abef878a02e7361a053f0dc8ebe6f4f8f99" alt=""
diff --git a/docs/source/en/using-diffusers/inference_with_lcm_lora.md b/docs/source/en/using-diffusers/inference_with_lcm_lora.md
new file mode 100644
index 0000000..36120a0
--- /dev/null
+++ b/docs/source/en/using-diffusers/inference_with_lcm_lora.md
@@ -0,0 +1,422 @@
+
+
+[[open-in-colab]]
+
+# Performing inference with LCM-LoRA
+
+Latent Consistency Models (LCM) enable quality image generation in typically 2-4 steps making it possible to use diffusion models in almost real-time settings.
+
+From the [official website](https://latent-consistency-models.github.io/):
+
+> LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (~32 A100 GPU Hours) for generating high quality 768 x 768 resolution images in 2~4 steps or even one step, significantly accelerating text-to-image generation. We employ LCM to distill the Dreamshaper-V7 version of SD in just 4,000 training iterations.
+
+For a more technical overview of LCMs, refer to [the paper](https://huggingface.co/papers/2310.04378).
+
+However, each model needs to be distilled separately for latent consistency distillation. The core idea with LCM-LoRA is to train just a few adapter layers, the adapter being LoRA in this case.
+This way, we don't have to train the full model and keep the number of trainable parameters manageable. The resulting LoRAs can then be applied to any fine-tuned version of the model without distilling them separately.
+Additionally, the LoRAs can be applied to image-to-image, ControlNet/T2I-Adapter, inpainting, AnimateDiff etc.
+The LCM-LoRA can also be combined with other LoRAs to generate styled images in very few steps (4-8).
+
+LCM-LoRAs are available for [stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and the [SSD-1B](https://huggingface.co/segmind/SSD-1B) model. All the checkpoints can be found in this [collection](https://huggingface.co/collections/latent-consistency/latent-consistency-models-loras-654cdd24e111e16f0865fba6).
+
+For more details about LCM-LoRA, refer to [the technical report](https://huggingface.co/papers/2311.05556).
+
+This guide shows how to perform inference with LCM-LoRAs for
+- text-to-image
+- image-to-image
+- combined with styled LoRAs
+- ControlNet/T2I-Adapter
+- inpainting
+- AnimateDiff
+
+Before going through this guide, we'll take a look at the general workflow for performing inference with LCM-LoRAs.
+LCM-LoRAs are similar to other Stable Diffusion LoRAs so they can be used with any [`DiffusionPipeline`] that supports LoRAs.
+
+- Load the task specific pipeline and model.
+- Set the scheduler to [`LCMScheduler`].
+- Load the LCM-LoRA weights for the model.
+- Reduce the `guidance_scale` between `[1.0, 2.0]` and set the `num_inference_steps` between [4, 8].
+- Perform inference with the pipeline with the usual parameters.
+
+Let's look at how we can perform inference with LCM-LoRAs for different tasks.
+
+First, make sure you have [peft](https://github.com/huggingface/peft) installed, for better LoRA support.
+
+```bash
+pip install -U peft
+```
+
+## Text-to-image
+
+You'll use the [`StableDiffusionXLPipeline`] with the scheduler: [`LCMScheduler`] and then load the LCM-LoRA. Together with the LCM-LoRA and the scheduler, the pipeline enables a fast inference workflow overcoming the slow iterative nature of diffusion models.
+
+```python
+import torch
+from diffusers import DiffusionPipeline, LCMScheduler
+
+pipe = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ variant="fp16",
+ torch_dtype=torch.float16
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
+
+prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
+
+generator = torch.manual_seed(42)
+image = pipe(
+ prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0
+).images[0]
+```
+
+data:image/s3,"s3://crabby-images/63c00/63c000c187d9a4bd1dbaf7cdcbd3f88c27b8db48" alt=""
+
+Notice that we use only 4 steps for generation which is way less than what's typically used for standard SDXL.
+
+
+
+You may have noticed that we set `guidance_scale=1.0`, which disables classifer-free-guidance. This is because the LCM-LoRA is trained with guidance, so the batch size does not have to be doubled in this case. This leads to a faster inference time, with the drawback that negative prompts don't have any effect on the denoising process.
+
+You can also use guidance with LCM-LoRA, but due to the nature of training the model is very sensitve to the `guidance_scale` values, high values can lead to artifacts in the generated images. In our experiments, we found that the best values are in the range of [1.0, 2.0].
+
+
+
+### Inference with a fine-tuned model
+
+As mentioned above, the LCM-LoRA can be applied to any fine-tuned version of the model without having to distill them separately. Let's look at how we can perform inference with a fine-tuned model. In this example, we'll use the [animagine-xl](https://huggingface.co/Linaqruf/animagine-xl) model, which is a fine-tuned version of the SDXL model for generating anime.
+
+```python
+from diffusers import DiffusionPipeline, LCMScheduler
+
+pipe = DiffusionPipeline.from_pretrained(
+ "Linaqruf/animagine-xl",
+ variant="fp16",
+ torch_dtype=torch.float16
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
+
+prompt = "face focus, cute, masterpiece, best quality, 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck"
+
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0
+).images[0]
+```
+
+data:image/s3,"s3://crabby-images/2d2c6/2d2c69f8d901316a19417cca30174d278c5b399e" alt=""
+
+
+## Image-to-image
+
+LCM-LoRA can be applied to image-to-image tasks too. Let's look at how we can perform image-to-image generation with LCMs. For this example we'll use the [dreamshaper-7](https://huggingface.co/Lykon/dreamshaper-7) model and the LCM-LoRA for `stable-diffusion-v1-5 `.
+
+```python
+import torch
+from diffusers import AutoPipelineForImage2Image, LCMScheduler
+from diffusers.utils import make_image_grid, load_image
+
+pipe = AutoPipelineForImage2Image.from_pretrained(
+ "Lykon/dreamshaper-7",
+ torch_dtype=torch.float16,
+ variant="fp16",
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
+
+# prepare image
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
+init_image = load_image(url)
+prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k"
+
+# pass prompt and image to pipeline
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt,
+ image=init_image,
+ num_inference_steps=4,
+ guidance_scale=1,
+ strength=0.6,
+ generator=generator
+).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/d41c3/d41c392e95e23562c47b8364fd96759a4eca516c" alt=""
+
+
+
+
+You can get different results based on your prompt and the image you provide. To get the best results, we recommend trying different values for `num_inference_steps`, `strength`, and `guidance_scale` parameters and choose the best one.
+
+
+
+
+## Combine with styled LoRAs
+
+LCM-LoRA can be combined with other LoRAs to generate styled-images in very few steps (4-8). In the following example, we'll use the LCM-LoRA with the [papercut LoRA](TheLastBen/Papercut_SDXL).
+To learn more about how to combine LoRAs, refer to [this guide](https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#combine-multiple-adapters).
+
+```python
+import torch
+from diffusers import DiffusionPipeline, LCMScheduler
+
+pipe = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ variant="fp16",
+ torch_dtype=torch.float16
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LoRAs
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", adapter_name="lcm")
+pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut")
+
+# Combine LoRAs
+pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8])
+
+prompt = "papercut, a cute fox"
+generator = torch.manual_seed(0)
+image = pipe(prompt, num_inference_steps=4, guidance_scale=1, generator=generator).images[0]
+image
+```
+
+data:image/s3,"s3://crabby-images/fa4f9/fa4f9890717c2ef6df4fd014bc268980f9e2604e" alt=""
+
+
+## ControlNet/T2I-Adapter
+
+Let's look at how we can perform inference with ControlNet/T2I-Adapter and LCM-LoRA.
+
+### ControlNet
+For this example, we'll use the SD-v1-5 model and the LCM-LoRA for SD-v1-5 with canny ControlNet.
+
+```python
+import torch
+import cv2
+import numpy as np
+from PIL import Image
+
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler
+from diffusers.utils import load_image
+
+image = load_image(
+ "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
+).resize((512, 512))
+
+image = np.array(image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image)
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ controlnet=controlnet,
+ torch_dtype=torch.float16,
+ safety_checker=None,
+ variant="fp16"
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
+
+generator = torch.manual_seed(0)
+image = pipe(
+ "the mona lisa",
+ image=canny_image,
+ num_inference_steps=4,
+ guidance_scale=1.5,
+ controlnet_conditioning_scale=0.8,
+ cross_attention_kwargs={"scale": 1},
+ generator=generator,
+).images[0]
+make_image_grid([canny_image, image], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/2df2f/2df2fff37350bbd8af8d45c743f2465bc5b8bd71" alt=""
+
+
+
+The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one.
+
+
+### T2I-Adapter
+
+This example shows how to use the LCM-LoRA with the [Canny T2I-Adapter](TencentARC/t2i-adapter-canny-sdxl-1.0) and SDXL.
+
+```python
+import torch
+import cv2
+import numpy as np
+from PIL import Image
+
+from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, LCMScheduler
+from diffusers.utils import load_image, make_image_grid
+
+# Prepare image
+# Detect the canny map in low resolution to avoid high-frequency details
+image = load_image(
+ "https://huggingface.co/Adapter/t2iadapter/resolve/main/figs_SDXLV1.0/org_canny.jpg"
+).resize((384, 384))
+
+image = np.array(image)
+
+low_threshold = 100
+high_threshold = 200
+
+image = cv2.Canny(image, low_threshold, high_threshold)
+image = image[:, :, None]
+image = np.concatenate([image, image, image], axis=2)
+canny_image = Image.fromarray(image).resize((1024, 1024))
+
+# load adapter
+adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, varient="fp16").to("cuda")
+
+pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ adapter=adapter,
+ torch_dtype=torch.float16,
+ variant="fp16",
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
+
+prompt = "Mystical fairy in real, magic, 4k picture, high quality"
+negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured"
+
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ image=canny_image,
+ num_inference_steps=4,
+ guidance_scale=1.5,
+ adapter_conditioning_scale=0.8,
+ adapter_conditioning_factor=1,
+ generator=generator,
+).images[0]
+make_image_grid([canny_image, image], rows=1, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/2f091/2f09125aae3edf935129e9a399d68c7a851e3661" alt=""
+
+
+## Inpainting
+
+LCM-LoRA can be used for inpainting as well.
+
+```python
+import torch
+from diffusers import AutoPipelineForInpainting, LCMScheduler
+from diffusers.utils import load_image, make_image_grid
+
+pipe = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting",
+ torch_dtype=torch.float16,
+ variant="fp16",
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+# generator = torch.Generator("cuda").manual_seed(92)
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+generator = torch.manual_seed(0)
+image = pipe(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ generator=generator,
+ num_inference_steps=4,
+ guidance_scale=4,
+).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+data:image/s3,"s3://crabby-images/f44d8/f44d84e6f90cba4d12b8d682dbb4d21e8defe0aa" alt=""
+
+
+## AnimateDiff
+
+[`AnimateDiff`] allows you to animate images using Stable Diffusion models. To get good results, we need to generate multiple frames (16-24), and doing this with standard SD models can be very slow.
+LCM-LoRA can be used to speed up the process significantly, as you just need to do 4-8 steps for each frame. Let's look at how we can perform animation with LCM-LoRA and AnimateDiff.
+
+```python
+import torch
+from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler, LCMScheduler
+from diffusers.utils import export_to_gif
+
+adapter = MotionAdapter.from_pretrained("diffusers/animatediff-motion-adapter-v1-5")
+pipe = AnimateDiffPipeline.from_pretrained(
+ "frankjoshua/toonyou_beta6",
+ motion_adapter=adapter,
+).to("cuda")
+
+# set scheduler
+pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+# load LCM-LoRA
+pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm")
+pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora")
+
+pipe.set_adapters(["lcm", "motion-lora"], adapter_weights=[0.55, 1.2])
+
+prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress"
+generator = torch.manual_seed(0)
+frames = pipe(
+ prompt=prompt,
+ num_inference_steps=5,
+ guidance_scale=1.25,
+ cross_attention_kwargs={"scale": 1},
+ num_frames=24,
+ generator=generator
+).frames[0]
+export_to_gif(frames, "animation.gif")
+```
+
+data:image/s3,"s3://crabby-images/937e7/937e722070c5bd67a577c9ebb9bb3875f304478a" alt=""
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/inpaint.md b/docs/source/en/using-diffusers/inpaint.md
new file mode 100644
index 0000000..193f5a6
--- /dev/null
+++ b/docs/source/en/using-diffusers/inpaint.md
@@ -0,0 +1,804 @@
+
+
+# Inpainting
+
+[[open-in-colab]]
+
+Inpainting replaces or edits specific areas of an image. This makes it a useful tool for image restoration like removing defects and artifacts, or even replacing an image area with something entirely new. Inpainting relies on a mask to determine which regions of an image to fill in; the area to inpaint is represented by white pixels and the area to keep is represented by black pixels. The white pixels are filled in by the prompt.
+
+With ๐ค Diffusers, here is how you can do inpainting:
+
+1. Load an inpainting checkpoint with the [`AutoPipelineForInpainting`] class. This'll automatically detect the appropriate pipeline class to load based on the checkpoint:
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+```
+
+
+
+You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, it's not necessary to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention).
+
+
+
+2. Load the base and mask images:
+
+```py
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+```
+
+3. Create a prompt to inpaint the image with and pass it to the pipeline with the base and mask images:
+
+```py
+prompt = "a black cat with glowing eyes, cute, adorable, disney, pixar, highly detailed, 8k"
+negative_prompt = "bad anatomy, deformed, ugly, disfigured"
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+
+ base image
+
+
+
+ mask image
+
+
+
+ generated image
+
+
+
+## Create a mask image
+
+Throughout this guide, the mask image is provided in all of the code examples for convenience. You can inpaint on your own images, but you'll need to create a mask image for it. Use the Space below to easily create a mask image.
+
+Upload a base image to inpaint on and use the sketch tool to draw a mask. Once you're done, click **Run** to generate and download the mask image.
+
+
+
+### Mask blur
+
+The [`~VaeImageProcessor.blur`] method provides an option for how to blend the original image and inpaint area. The amount of blur is determined by the `blur_factor` parameter. Increasing the `blur_factor` increases the amount of blur applied to the mask edges, softening the transition between the original image and inpaint area. A low or zero `blur_factor` preserves the sharper edges of the mask.
+
+To use this, create a blurred mask with the image processor.
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image
+from PIL import Image
+
+pipeline = AutoPipelineForInpainting.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to('cuda')
+
+mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore_mask.png")
+blurred_mask = pipeline.mask_processor.blur(mask, blur_factor=33)
+blurred_mask
+```
+
+
+
+
+ mask with no blur
+
+
+
+ mask with blur applied
+
+
+
+## Popular models
+
+[Stable Diffusion Inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting), [Stable Diffusion XL (SDXL) Inpainting](https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1), and [Kandinsky 2.2 Inpainting](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder-inpaint) are among the most popular models for inpainting. SDXL typically produces higher resolution images than Stable Diffusion v1.5, and Kandinsky 2.2 is also capable of generating high-quality images.
+
+### Stable Diffusion Inpainting
+
+Stable Diffusion Inpainting is a latent diffusion model finetuned on 512x512 images on inpainting. It is a good starting point because it is relatively fast and generates good quality images. To use this model for inpainting, you'll need to pass a prompt, base and mask image to the pipeline:
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+generator = torch.Generator("cuda").manual_seed(92)
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+### Stable Diffusion XL (SDXL) Inpainting
+
+SDXL is a larger and more powerful version of Stable Diffusion v1.5. This model can follow a two-stage model process (though each model can also be used alone); the base model generates an image, and a refiner model takes that image and further enhances its details and quality. Take a look at the [SDXL](sdxl) guide for a more comprehensive guide on how to use SDXL and configure it's parameters.
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+generator = torch.Generator("cuda").manual_seed(92)
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+### Kandinsky 2.2 Inpainting
+
+The Kandinsky model family is similar to SDXL because it uses two models as well; the image prior model creates image embeddings, and the diffusion model generates images from them. You can load the image prior and diffusion model separately, but the easiest way to use Kandinsky 2.2 is to load it into the [`AutoPipelineForInpainting`] class which uses the [`KandinskyV22InpaintCombinedPipeline`] under the hood.
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+generator = torch.Generator("cuda").manual_seed(92)
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+
+ base image
+
+
+
+ Stable Diffusion Inpainting
+
+
+
+ Stable Diffusion XL Inpainting
+
+
+
+ Kandinsky 2.2 Inpainting
+
+
+
+## Non-inpaint specific checkpoints
+
+So far, this guide has used inpaint specific checkpoints such as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). But you can also use regular checkpoints like [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5). Let's compare the results of the two checkpoints.
+
+The image on the left is generated from a regular checkpoint, and the image on the right is from an inpaint checkpoint. You'll immediately notice the image on the left is not as clean, and you can still see the outline of the area the model is supposed to inpaint. The image on the right is much cleaner and the inpainted area appears more natural.
+
+
+
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+generator = torch.Generator("cuda").manual_seed(92)
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+generator = torch.Generator("cuda").manual_seed(92)
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+
+
+
+ runwayml/stable-diffusion-v1-5
+
+
+
+ runwayml/stable-diffusion-inpainting
+
+
+
+However, for more basic tasks like erasing an object from an image (like the rocks in the road for example), a regular checkpoint yields pretty good results. There isn't as noticeable of difference between the regular and inpaint checkpoint.
+
+
+
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/road-mask.png")
+
+image = pipeline(prompt="road", image=init_image, mask_image=mask_image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/road-mask.png")
+
+image = pipeline(prompt="road", image=init_image, mask_image=mask_image).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+
+
+
+ runwayml/stable-diffusion-v1-5
+
+
+
+ runwayml/stable-diffusion-inpainting
+
+
+
+The trade-off of using a non-inpaint specific checkpoint is the overall image quality may be lower, but it generally tends to preserve the mask area (that is why you can see the mask outline). The inpaint specific checkpoints are intentionally trained to generate higher quality inpainted images, and that includes creating a more natural transition between the masked and unmasked areas. As a result, these checkpoints are more likely to change your unmasked area.
+
+If preserving the unmasked area is important for your task, you can use the [`VaeImageProcessor.apply_overlay`] method to force the unmasked area of an image to remain the same at the expense of some more unnatural transitions between the masked and unmasked areas.
+
+```py
+import PIL
+import numpy as np
+import torch
+
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+device = "cuda"
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting",
+ torch_dtype=torch.float16,
+)
+pipeline = pipeline.to(device)
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+
+prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+repainted_image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
+repainted_image.save("repainted_image.png")
+
+unmasked_unchanged_image = pipeline.image_processor.apply_overlay(mask_image, init_image, repainted_image)
+unmasked_unchanged_image.save("force_unmasked_unchanged.png")
+make_image_grid([init_image, mask_image, repainted_image, unmasked_unchanged_image], rows=2, cols=2)
+```
+
+## Configure pipeline parameters
+
+Image features - like quality and "creativity" - are dependent on pipeline parameters. Knowing what these parameters do is important for getting the results you want. Let's take a look at the most important parameters and see how changing them affects the output.
+
+### Strength
+
+`strength` is a measure of how much noise is added to the base image, which influences how similar the output is to the base image.
+
+* ๐ a high `strength` value means more noise is added to an image and the denoising process takes longer, but you'll get higher quality images that are more different from the base image
+* ๐ a low `strength` value means less noise is added to an image and the denoising process is faster, but the image quality may not be as great and the generated image resembles the base image more
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.6).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+
+ strength = 0.6
+
+
+
+ strength = 0.8
+
+
+
+ strength = 1.0
+
+
+
+### Guidance scale
+
+`guidance_scale` affects how aligned the text prompt and generated image are.
+
+* ๐ a high `guidance_scale` value means the prompt and generated image are closely aligned, so the output is a stricter interpretation of the prompt
+* ๐ a low `guidance_scale` value means the prompt and generated image are more loosely aligned, so the output may be more varied from the prompt
+
+You can use `strength` and `guidance_scale` together for more control over how expressive the model is. For example, a combination high `strength` and `guidance_scale` values gives the model the most creative freedom.
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, guidance_scale=2.5).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+
+ guidance_scale = 2.5
+
+
+
+ guidance_scale = 7.5
+
+
+
+ guidance_scale = 12.5
+
+
+
+### Negative prompt
+
+A negative prompt assumes the opposite role of a prompt; it guides the model away from generating certain things in an image. This is useful for quickly improving image quality and preventing the model from generating things you don't want.
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+negative_prompt = "bad architecture, unstable, poor details, blurry"
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+### Padding mask crop
+
+A method for increasing the inpainting image quality is to use the [`padding_mask_crop`](https://huggingface.co/docs/diffusers/v0.25.0/en/api/pipelines/stable_diffusion/inpaint#diffusers.StableDiffusionInpaintPipeline.__call__.padding_mask_crop) parameter. When enabled, this option crops the masked area with some user-specified padding and it'll also crop the same area from the original image. Both the image and mask are upscaled to a higher resolution for inpainting, and then overlaid on the original image. This is a quick and easy way to improve image quality without using a separate pipeline like [`StableDiffusionUpscalePipeline`].
+
+Add the `padding_mask_crop` parameter to the pipeline call and set it to the desired padding value.
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image
+from PIL import Image
+
+generator = torch.Generator(device='cuda').manual_seed(0)
+pipeline = AutoPipelineForInpainting.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to('cuda')
+
+base = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png")
+mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore_mask.png")
+
+image = pipeline("boat", image=base, mask_image=mask, strength=0.75, generator=generator, padding_mask_crop=32).images[0]
+image
+```
+
+
+
+
+ default inpaint image
+
+
+
+ inpaint image with `padding_mask_crop` enabled
+
+
+
+## Chained inpainting pipelines
+
+[`AutoPipelineForInpainting`] can be chained with other ๐ค Diffusers pipelines to edit their outputs. This is often useful for improving the output quality from your other diffusion pipelines, and if you're using multiple pipelines, it can be more memory-efficient to chain them together to keep the outputs in latent space and reuse the same pipeline components.
+
+### Text-to-image-to-inpaint
+
+Chaining a text-to-image and inpainting pipeline allows you to inpaint the generated image, and you don't have to provide a base image to begin with. This makes it convenient to edit your favorite text-to-image outputs without having to generate an entirely new image.
+
+Start with the text-to-image pipeline to create a castle:
+
+```py
+import torch
+from diffusers import AutoPipelineForText2Image, AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+text2image = pipeline("concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k").images[0]
+```
+
+Load the mask image of the output from above:
+
+```py
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_text-chain-mask.png")
+```
+
+And let's inpaint the masked area with a waterfall:
+
+```py
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+prompt = "digital painting of a fantasy waterfall, cloudy"
+image = pipeline(prompt=prompt, image=text2image, mask_image=mask_image).images[0]
+make_image_grid([text2image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+
+ text-to-image
+
+
+
+ inpaint
+
+
+
+### Inpaint-to-image-to-image
+
+You can also chain an inpainting pipeline before another pipeline like image-to-image or an upscaler to improve the quality.
+
+Begin by inpainting an image:
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting, AutoPipelineForImage2Image
+from diffusers.utils import load_image, make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image_inpainting = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
+
+# resize image to 1024x1024 for SDXL
+image_inpainting = image_inpainting.resize((1024, 1024))
+```
+
+Now let's pass the image to another inpainting pipeline with SDXL's refiner model to enhance the image details and quality:
+
+```py
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+image = pipeline(prompt=prompt, image=image_inpainting, mask_image=mask_image, output_type="latent").images[0]
+```
+
+
+
+It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE. For example, in the [Text-to-image-to-inpaint](#text-to-image-to-inpaint) section, Kandinsky 2.2 uses a different VAE class than the Stable Diffusion model so it won't work. But if you use Stable Diffusion v1.5 for both pipelines, then you can keep everything in latent space because they both use [`AutoencoderKL`].
+
+
+
+Finally, you can pass this image to an image-to-image pipeline to put the finishing touches on it. It is more efficient to use the [`~AutoPipelineForImage2Image.from_pipe`] method to reuse the existing pipeline components, and avoid unnecessarily loading all the pipeline components into memory again.
+
+```py
+pipeline = AutoPipelineForImage2Image.from_pipe(pipeline)
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+image = pipeline(prompt=prompt, image=image).images[0]
+make_image_grid([init_image, mask_image, image_inpainting, image], rows=2, cols=2)
+```
+
+
+
+
+ initial image
+
+
+
+ inpaint
+
+
+
+ image-to-image
+
+
+
+Image-to-image and inpainting are actually very similar tasks. Image-to-image generates a new image that resembles the existing provided image. Inpainting does the same thing, but it only transforms the image area defined by the mask and the rest of the image is unchanged. You can think of inpainting as a more precise tool for making specific changes and image-to-image has a broader scope for making more sweeping changes.
+
+## Control image generation
+
+Getting an image to look exactly the way you want is challenging because the denoising process is random. While you can control certain aspects of generation by configuring parameters like `negative_prompt`, there are better and more efficient methods for controlling image generation.
+
+### Prompt weighting
+
+Prompt weighting provides a quantifiable way to scale the representation of concepts in a prompt. You can use it to increase or decrease the magnitude of the text embedding vector for each concept in the prompt, which subsequently determines how much of each concept is generated. The [Compel](https://github.com/damian0815/compel) library offers an intuitive syntax for scaling the prompt weights and generating the embeddings. Learn how to create the embeddings in the [Prompt weighting](../using-diffusers/weighted_prompts) guide.
+
+Once you've generated the embeddings, pass them to the `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter in the [`AutoPipelineForInpainting`]. The embeddings replace the `prompt` parameter:
+
+```py
+import torch
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import make_image_grid
+
+pipeline = AutoPipelineForInpainting.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16,
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+image = pipeline(prompt_embeds=prompt_embeds, # generated from Compel
+ negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
+ image=init_image,
+ mask_image=mask_image
+).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+### ControlNet
+
+ControlNet models are used with other diffusion models like Stable Diffusion, and they provide an even more flexible and accurate way to control how an image is generated. A ControlNet accepts an additional conditioning image input that guides the diffusion model to preserve the features in it.
+
+For example, let's condition an image with a ControlNet pretrained on inpaint images:
+
+```py
+import torch
+import numpy as np
+from diffusers import ControlNetModel, StableDiffusionControlNetInpaintPipeline
+from diffusers.utils import load_image, make_image_grid
+
+# load ControlNet
+controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, variant="fp16")
+
+# pass ControlNet to the pipeline
+pipeline = StableDiffusionControlNetInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+# load base and mask image
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")
+
+# prepare control image
+def make_inpaint_condition(init_image, mask_image):
+ init_image = np.array(init_image.convert("RGB")).astype(np.float32) / 255.0
+ mask_image = np.array(mask_image.convert("L")).astype(np.float32) / 255.0
+
+ assert init_image.shape[0:1] == mask_image.shape[0:1], "image and image_mask must have the same image size"
+ init_image[mask_image > 0.5] = -1.0 # set as masked pixel
+ init_image = np.expand_dims(init_image, 0).transpose(0, 3, 1, 2)
+ init_image = torch.from_numpy(init_image)
+ return init_image
+
+control_image = make_inpaint_condition(init_image, mask_image)
+```
+
+Now generate an image from the base, mask and control images. You'll notice features of the base image are strongly preserved in the generated image.
+
+```py
+prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, control_image=control_image).images[0]
+make_image_grid([init_image, mask_image, PIL.Image.fromarray(np.uint8(control_image[0][0])).convert('RGB'), image], rows=2, cols=2)
+```
+
+You can take this a step further and chain it with an image-to-image pipeline to apply a new [style](https://huggingface.co/nitrosocke/elden-ring-diffusion):
+
+```py
+from diffusers import AutoPipelineForImage2Image
+
+pipeline = AutoPipelineForImage2Image.from_pretrained(
+ "nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16,
+)
+pipeline.enable_model_cpu_offload()
+# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
+pipeline.enable_xformers_memory_efficient_attention()
+
+prompt = "elden ring style castle" # include the token "elden ring style" in the prompt
+negative_prompt = "bad architecture, deformed, disfigured, poor details"
+
+image_elden_ring = pipeline(prompt, negative_prompt=negative_prompt, image=image).images[0]
+make_image_grid([init_image, mask_image, image, image_elden_ring], rows=2, cols=2)
+```
+
+
+
+
+ initial image
+
+
+
+ ControlNet inpaint
+
+
+
+ image-to-image
+
+
+
+## Optimize
+
+It can be difficult and slow to run diffusion models if you're resource constrained, but it doesn't have to be with a few optimization tricks. One of the biggest (and easiest) optimizations you can enable is switching to memory-efficient attention. If you're using PyTorch 2.0, [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention) is automatically enabled and you don't need to do anything else. For non-PyTorch 2.0 users, you can install and use [xFormers](../optimization/xformers)'s implementation of memory-efficient attention. Both options reduce memory usage and accelerate inference.
+
+You can also offload the model to the CPU to save even more memory:
+
+```diff
++ pipeline.enable_xformers_memory_efficient_attention()
++ pipeline.enable_model_cpu_offload()
+```
+
+To speed-up your inference code even more, use [`torch_compile`](../optimization/torch2.0#torchcompile). You should wrap `torch.compile` around the most intensive component in the pipeline which is typically the UNet:
+
+```py
+pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+Learn more in the [Reduce memory usage](../optimization/memory) and [Torch 2.0](../optimization/torch2.0) guides.
diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md
new file mode 100644
index 0000000..4ae4035
--- /dev/null
+++ b/docs/source/en/using-diffusers/ip_adapter.md
@@ -0,0 +1,594 @@
+
+
+# IP-Adapter
+
+[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features.
+
+> [!TIP]
+> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section which requires manually loading the image encoder.
+
+This guide will walk you through using IP-Adapter for various tasks and use cases.
+
+## General tasks
+
+Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff!
+
+In all the following examples, you'll see the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
+
+> [!TIP]
+> In the examples below, try adding `low_cpu_mem_usage=True` to the [`~loaders.IPAdapterMixin.load_ip_adapter`] method to speed up the loading time.
+
+
+
+
+Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you'd like to express. Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results.
+
+Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights.
+
+```py
+from diffusers import AutoPipelineForText2Image
+from diffusers.utils import load_image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+pipeline.set_ip_adapter_scale(0.6)
+```
+
+Create a text prompt and load an image prompt before passing them to the pipeline to generate an image.
+
+```py
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
+generator = torch.Generator(device="cpu").manual_seed(0)
+images = pipeline(
+ prompt="a polar bear sitting in a chair drinking a milkshake",
+ ip_adapter_image=image,
+ negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
+ num_inference_steps=100,
+ generator=generator,
+).images
+images[0]
+```
+
+
+
+
+ IP-Adapter image
+
+
+
+ generated image
+
+
+
+
+
+
+IP-Adapter can also help with image-to-image by guiding the model to generate an image that resembles the original image and the image prompt.
+
+Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights.
+
+```py
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import load_image
+import torch
+
+pipeline = AutoPipelineForImage2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+pipeline.set_ip_adapter_scale(0.6)
+```
+
+Pass the original image and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality.
+
+```py
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
+ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_2.png")
+
+generator = torch.Generator(device="cpu").manual_seed(4)
+images = pipeline(
+ prompt="best quality, high quality",
+ image=image,
+ ip_adapter_image=ip_image,
+ generator=generator,
+ strength=0.6,
+).images
+images[0]
+```
+
+
+
+
+ original image
+
+
+
+ IP-Adapter image
+
+
+
+ generated image
+
+
+
+
+
+
+IP-Adapter is also useful for inpainting because the image prompt allows you to be much more specific about what you'd like to generate.
+
+Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights.
+
+```py
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image
+import torch
+
+pipeline = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16).to("cuda")
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+pipeline.set_ip_adapter_scale(0.6)
+```
+
+Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image.
+
+```py
+mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png")
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
+ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png")
+
+generator = torch.Generator(device="cpu").manual_seed(4)
+images = pipeline(
+ prompt="a cute gummy bear waving",
+ image=image,
+ mask_image=mask_image,
+ ip_adapter_image=ip_image,
+ generator=generator,
+ num_inference_steps=100,
+).images
+images[0]
+```
+
+
+
+
+ original image
+
+
+
+ IP-Adapter image
+
+
+
+ generated image
+
+
+
+
+
+
+IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with its motion adapter and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method.
+
+> [!WARNING]
+> If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline.
+
+```py
+import torch
+from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+from diffusers.utils import load_image
+
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16)
+scheduler = DDIMScheduler.from_pretrained(
+ "emilianJR/epiCRealism",
+ subfolder="scheduler",
+ clip_sample=False,
+ timestep_spacing="linspace",
+ beta_schedule="linear",
+ steps_offset=1,
+)
+pipeline.scheduler = scheduler
+pipeline.enable_vae_slicing()
+
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+pipeline.enable_model_cpu_offload()
+```
+
+Pass a prompt and an image prompt to the pipeline to generate a short video.
+
+```py
+ip_adapter_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png")
+
+output = pipeline(
+ prompt="A cute gummy bear waving",
+ negative_prompt="bad quality, worse quality, low resolution",
+ ip_adapter_image=ip_adapter_image,
+ num_frames=16,
+ guidance_scale=7.5,
+ num_inference_steps=50,
+ generator=torch.Generator(device="cpu").manual_seed(0),
+)
+frames = output.frames[0]
+export_to_gif(frames, "gummy_bear.gif")
+```
+
+
+
+
+ IP-Adapter image
+
+
+
+ generated video
+
+
+
+
+
+
+## Configure parameters
+
+There are a couple of IP-Adapter parameters that are useful to know about and can help you with your image generation tasks. These parameters can make your workflow more efficient or give you more control over image generation.
+
+### Image embeddings
+
+IP-Adapter enabled pipelines provide the `ip_adapter_image_embeds` parameter to accept precomputed image embeddings. This is particularly useful in scenarios where you need to run the IP-Adapter pipeline multiple times because you have more than one image. For example, [multi IP-Adapter](#multi-ip-adapter) is a specific use case where you provide multiple styling images to generate a specific image in a specific style. Loading and encoding multiple images each time you use the pipeline would be inefficient. Instead, you can precompute and save the image embeddings to disk (which can save a lot of space if you're using high-quality images) and load them when you need them.
+
+> [!TIP]
+> This parameter also gives you the flexibility to load embeddings from other sources. For example, ComfyUI image embeddings for IP-Adapters are compatible with Diffusers and should work ouf-of-the-box!
+
+Call the [`~StableDiffusionPipeline.prepare_ip_adapter_image_embeds`] method to encode and generate the image embeddings. Then you can save them to disk with `torch.save`.
+
+> [!TIP]
+> If you're using IP-Adapter with `ip_adapter_image_embedding` instead of `ip_adapter_image`', you can set `load_ip_adapter(image_encoder_folder=None,...)` because you don't need to load an encoder to generate the image embeddings.
+
+```py
+image_embeds = pipeline.prepare_ip_adapter_image_embeds(
+ ip_adapter_image=image,
+ ip_adapter_image_embeds=None,
+ device="cuda",
+ num_images_per_prompt=1,
+ do_classifier_free_guidance=True,
+)
+
+torch.save(image_embeds, "image_embeds.ipadpt")
+```
+
+Now load the image embeddings by passing them to the `ip_adapter_image_embeds` parameter.
+
+```py
+image_embeds = torch.load("image_embeds.ipadpt")
+images = pipeline(
+ prompt="a polar bear sitting in a chair drinking a milkshake",
+ ip_adapter_image_embeds=image_embeds,
+ negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
+ num_inference_steps=100,
+ generator=generator,
+).images
+```
+
+### IP-Adapter masking
+
+Binary masks specify which portion of the output image should be assigned to an IP-Adapter. This is useful for composing more than one IP-Adapter image. For each input IP-Adapter image, you must provide a binary mask an an IP-Adapter.
+
+To start, preprocess the input IP-Adapter images with the [`~image_processor.IPAdapterMaskProcessor.preprocess()`] to generate their masks. For optimal results, provide the output height and width to [`~image_processor.IPAdapterMaskProcessor.preprocess()`]. This ensures masks with different aspect ratios are appropriately stretched. If the input masks already match the aspect ratio of the generated image, you don't have to set the `height` and `width`.
+
+```py
+from diffusers.image_processor import IPAdapterMaskProcessor
+
+mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
+mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")
+
+output_height = 1024
+output_width = 1024
+
+processor = IPAdapterMaskProcessor()
+masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
+```
+
+
+
+
+ mask one
+
+
+
+ mask two
+
+
+
+When there is more than one input IP-Adapter image, load them as a list to ensure each image is assigned to a different IP-Adapter. Each of the input IP-Adapter images here correspond to the masks generated above.
+
+```py
+face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
+face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")
+
+ip_images = [[face_image1], [face_image2]]
+```
+
+
+
+## Specific use cases
+
+IP-Adapter's image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with!
+
+### Face model
+
+Generating accurate faces is challenging because they are complex and nuanced. Diffusers supports two IP-Adapter checkpoints specifically trained to generate faces:
+
+* [ip-adapter-full-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-full-face_sd15.safetensors) is conditioned with images of cropped faces and removed backgrounds
+* [ip-adapter-plus-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.safetensors) uses patch embeddings and is conditioned with images of cropped faces
+
+> [!TIP]
+>
+> [IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) is a face-specific IP-Adapter trained with face ID embeddings instead of CLIP image embeddings, allowing you to generate more consistent faces in different contexts and styles. Try out this popular [community pipeline](https://github.com/huggingface/diffusers/tree/main/examples/community#ip-adapter-face-id) and see how it compares to the other face IP-Adapters.
+
+For face models, use the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models.
+
+```py
+import torch
+from diffusers import StableDiffusionPipeline, DDIMScheduler
+from diffusers.utils import load_image
+
+pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16,
+).to("cuda")
+pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin")
+
+pipeline.set_ip_adapter_scale(0.5)
+
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png")
+generator = torch.Generator(device="cpu").manual_seed(26)
+
+image = pipeline(
+ prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant",
+ ip_adapter_image=image,
+ negative_prompt="lowres, bad anatomy, worst quality, low quality",
+ num_inference_steps=100,
+ generator=generator,
+).images[0]
+image
+```
+
+
+
+
+ IP-Adapter image
+
+
+
+ generated image
+
+
+
+### Multi IP-Adapter
+
+More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-Face to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style.
+
+> [!TIP]
+> Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder.
+
+Load the image encoder with [`~transformers.CLIPVisionModelWithProjection`].
+
+```py
+import torch
+from diffusers import AutoPipelineForText2Image, DDIMScheduler
+from transformers import CLIPVisionModelWithProjection
+from diffusers.utils import load_image
+
+image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ "h94/IP-Adapter",
+ subfolder="models/image_encoder",
+ torch_dtype=torch.float16,
+)
+```
+
+Next, you'll load a base model, scheduler, and the IP-Adapters. The IP-Adapters to use are passed as a list to the `weight_name` parameter:
+
+* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder
+* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces
+
+```py
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16,
+ image_encoder=image_encoder,
+)
+pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
+pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"]
+)
+pipeline.set_ip_adapter_scale([0.7, 0.3])
+pipeline.enable_model_cpu_offload()
+```
+
+Load an image prompt and a folder containing images of a certain style you want to use.
+
+```py
+face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
+style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
+style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]
+```
+
+
+
+
+ IP-Adapter image of face
+
+
+
+ IP-Adapter style images
+
+
+
+Pass the image prompt and style images as a list to the `ip_adapter_image` parameter, and run the pipeline!
+
+```py
+generator = torch.Generator(device="cpu").manual_seed(0)
+
+image = pipeline(
+ prompt="wonderwoman",
+ ip_adapter_image=[style_images, face_image],
+ negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
+ num_inference_steps=50, num_images_per_prompt=1,
+ generator=generator,
+).images[0]
+image
+```
+
+
+ย ย
+
+
+### Instant generation
+
+[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with an LCM feels "instantaneous". IP-Adapters can be plugged into an LCM-LoRA model to instantly generate images with an image prompt.
+
+The IP-Adapter weights need to be loaded first, then you can use [`~StableDiffusionPipeline.load_lora_weights`] to load the LoRA style and weight you want to apply to your image.
+
+```py
+from diffusers import DiffusionPipeline, LCMScheduler
+import torch
+from diffusers.utils import load_image
+
+model_id = "sd-dreambooth-library/herge-style"
+lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
+
+pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
+
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+pipeline.load_lora_weights(lcm_lora_id)
+pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
+pipeline.enable_model_cpu_offload()
+```
+
+Try using with a lower IP-Adapter scale to condition image generation more on the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style.
+
+```py
+pipeline.set_ip_adapter_scale(0.4)
+
+prompt = "herge_style woman in armor, best quality, high quality"
+generator = torch.Generator(device="cpu").manual_seed(0)
+
+ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
+image = pipeline(
+ prompt=prompt,
+ ip_adapter_image=ip_adapter_image,
+ num_inference_steps=4,
+ guidance_scale=1,
+).images[0]
+image
+```
+
+
+ย ย
+
+
+### Structural control
+
+To control image generation to an even greater degree, you can combine IP-Adapter with a model like [ControlNet](../using-diffusers/controlnet). A ControlNet is also an adapter that can be inserted into a diffusion model to allow for conditioning on an additional control image. The control image can be depth maps, edge maps, pose estimations, and more.
+
+Load a [`ControlNetModel`] checkpoint conditioned on depth maps, insert it into a diffusion model, and load the IP-Adapter.
+
+```py
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+import torch
+from diffusers.utils import load_image
+
+controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
+controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)
+
+pipeline = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
+pipeline.to("cuda")
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+```
+
+Now load the IP-Adapter image and depth map.
+
+```py
+ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
+depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")
+```
+
+
+
+
+ IP-Adapter image
+
+
+
+ depth map
+
+
+
+Pass the depth map and IP-Adapter image to the pipeline to generate an image.
+
+```py
+generator = torch.Generator(device="cpu").manual_seed(33)
+image = pipeline(
+ prompt="best quality, high quality",
+ image=depth_map,
+ ip_adapter_image=ip_adapter_image,
+ negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
+ num_inference_steps=50,
+ generator=generator,
+).images[0]
+image
+```
+
+
+ย ย
+
diff --git a/docs/source/en/using-diffusers/kandinsky.md b/docs/source/en/using-diffusers/kandinsky.md
new file mode 100644
index 0000000..e4f4778
--- /dev/null
+++ b/docs/source/en/using-diffusers/kandinsky.md
@@ -0,0 +1,768 @@
+
+
+# Kandinsky
+
+[[open-in-colab]]
+
+The Kandinsky models are a series of multilingual text-to-image generation models. The Kandinsky 2.0 model uses two multilingual text encoders and concatenates those results for the UNet.
+
+[Kandinsky 2.1](../api/pipelines/kandinsky) changes the architecture to include an image prior model ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip)) to generate a mapping between text and image embeddings. The mapping provides better text-image alignment and it is used with the text embeddings during training, leading to higher quality results. Finally, Kandinsky 2.1 uses a [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images.
+
+[Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes.
+
+[Kandinsky 3](../api/pipelines/kandinsky3) simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses [Flan-UL2](https://huggingface.co/google/flan-ul2) to encode text, a UNet with [BigGan-deep](https://hf.co/papers/1809.11096) blocks, and [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN) to decode the latents into images. Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet.
+
+This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more.
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q diffusers transformers accelerate
+```
+
+
+
+Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding.
+
+
+
+Kandinsky 3 has a more concise architecture and it doesn't require a prior model. This means it's usage is identical to other diffusion models like [Stable Diffusion XL](sdxl).
+
+
+
+## Text-to-image
+
+To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings. The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt `""`. For better results, you can pass an actual `negative_prompt` to the prior pipeline, but this'll increase the effective batch size of the prior pipeline by 2x.
+
+
+
+
+```py
+from diffusers import KandinskyPriorPipeline, KandinskyPipeline
+import torch
+
+prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda")
+pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")
+
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
+image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple()
+```
+
+Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate an image:
+
+```py
+image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
+image
+```
+
+
+
+
+
+
+
+
+```py
+from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
+import torch
+
+prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda")
+pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")
+
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
+image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
+```
+
+Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipeline`] to generate an image:
+
+```py
+image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
+image
+```
+
+
+
+
+
+
+
+
+Kandinsky 3 doesn't require a prior model so you can directly load the [`Kandinsky3Pipeline`] and pass a prompt to generate an image:
+
+```py
+from diffusers import Kandinsky3Pipeline
+import torch
+
+pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+image = pipeline(prompt).images[0]
+image
+```
+
+
+
+
+๐ค Diffusers also provides an end-to-end API with the [`KandinskyCombinedPipeline`] and [`KandinskyV22CombinedPipeline`], meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.
+
+Use the [`AutoPipelineForText2Image`] to automatically call the combined pipelines under the hood:
+
+
+
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality"
+
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
+image
+```
+
+
+
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality"
+
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
+image
+```
+
+
+
+
+## Image-to-image
+
+For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline:
+
+
+
+
+```py
+import torch
+from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline
+
+prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+```
+
+
+
+
+```py
+import torch
+from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline
+
+prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+```
+
+
+
+
+Kandinsky 3 doesn't require a prior model so you can directly load the image-to-image pipeline:
+
+```py
+from diffusers import Kandinsky3Img2ImgPipeline
+from diffusers.utils import load_image
+import torch
+
+pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+```
+
+
+
+
+Download an image to condition on:
+
+```py
+from diffusers.utils import load_image
+
+# download image
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+original_image = load_image(url)
+original_image = original_image.resize((768, 512))
+```
+
+
+
+
+
+Generate the `image_embeds` and `negative_image_embeds` with the prior pipeline:
+
+```py
+prompt = "A fantasy landscape, Cinematic lighting"
+negative_prompt = "low quality, bad quality"
+
+image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()
+```
+
+Now pass the original image, and all the prompts and embeddings to the pipeline to generate an image:
+
+
+
+
+```py
+from diffusers.utils import make_image_grid
+
+image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
+```
+
+
+
+
+
+
+```py
+image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0]
+image
+```
+
+
+
+
+๐ค Diffusers also provides an end-to-end API with the [`KandinskyImg2ImgCombinedPipeline`] and [`KandinskyV22Img2ImgCombinedPipeline`], meaning you don't have to separately load the prior and image-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.
+
+Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipelines under the hood:
+
+
+
+
+```py
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+import torch
+
+pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
+pipeline.enable_model_cpu_offload()
+
+prompt = "A fantasy landscape, Cinematic lighting"
+negative_prompt = "low quality, bad quality"
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+original_image = load_image(url)
+
+original_image.thumbnail((768, 768))
+
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
+```
+
+
+
+
+```py
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import make_image_grid, load_image
+import torch
+
+pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
+pipeline.enable_model_cpu_offload()
+
+prompt = "A fantasy landscape, Cinematic lighting"
+negative_prompt = "low quality, bad quality"
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+original_image = load_image(url)
+
+original_image.thumbnail((768, 768))
+
+image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
+make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
+```
+
+
+
+
+## Inpainting
+
+
+
+โ ๏ธ The Kandinsky models use โฌ๏ธ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels:
+
+```py
+# For PIL input
+import PIL.ImageOps
+mask = PIL.ImageOps.invert(mask)
+
+# For PyTorch and NumPy input
+mask = 1 - mask
+```
+
+
+
+For inpainting, you'll need the original image, a mask of the area to replace in the original image, and a text prompt of what to inpaint. Load the prior pipeline:
+
+
+
+
+```py
+from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
+from diffusers.utils import load_image, make_image_grid
+import torch
+import numpy as np
+from PIL import Image
+
+prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+```
+
+
+
+
+```py
+from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
+from diffusers.utils import load_image, make_image_grid
+import torch
+import numpy as np
+from PIL import Image
+
+prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+```
+
+
+
+
+Load an initial image and create a mask:
+
+```py
+init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+mask = np.zeros((768, 768), dtype=np.float32)
+# mask area above cat's head
+mask[:250, 250:-250] = 1
+```
+
+Generate the embeddings with the prior pipeline:
+
+```py
+prompt = "a hat"
+prior_output = prior_pipeline(prompt)
+```
+
+Now pass the initial image, mask, and prompt and embeddings to the pipeline to generate an image:
+
+
+
+
+```py
+output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
+```
+
+
+
+
+
+
+You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`KandinskyV22InpaintCombinedPipeline`] to call the prior and decoder pipelines together under the hood. Use the [`AutoPipelineForInpainting`] for this:
+
+
+
+
+```py
+import torch
+import numpy as np
+from PIL import Image
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+
+init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+mask = np.zeros((768, 768), dtype=np.float32)
+# mask area above cat's head
+mask[:250, 250:-250] = 1
+prompt = "a hat"
+
+output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
+```
+
+
+
+
+```py
+import torch
+import numpy as np
+from PIL import Image
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+
+init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+mask = np.zeros((768, 768), dtype=np.float32)
+# mask area above cat's head
+mask[:250, 250:-250] = 1
+prompt = "a hat"
+
+output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
+mask = Image.fromarray((mask*255).astype('uint8'), 'L')
+make_image_grid([init_image, mask, output_image], rows=1, cols=3)
+```
+
+
+
+
+## Interpolation
+
+Interpolation allows you to explore the latent space between the image and text embeddings which is a cool way to see some of the prior model's intermediate outputs. Load the prior pipeline and two images you'd like to interpolate:
+
+
+
+
+```py
+from diffusers import KandinskyPriorPipeline, KandinskyPipeline
+from diffusers.utils import load_image, make_image_grid
+import torch
+
+prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
+make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
+```
+
+
+
+
+```py
+from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
+from diffusers.utils import load_image, make_image_grid
+import torch
+
+prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
+img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
+make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
+```
+
+
+
+
+
+
+
+ a cat
+
+
+
+ Van Gogh's Starry Night painting
+
+
+
+Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation!
+
+```py
+images_texts = ["a cat", img_1, img_2]
+weights = [0.3, 0.3, 0.4]
+```
+
+Call the `interpolate` function to generate the embeddings, and then pass them to the pipeline to generate the image:
+
+
+
+
+```py
+# prompt can be left empty
+prompt = ""
+prior_out = prior_pipeline.interpolate(images_texts, weights)
+
+pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+
+image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
+image
+```
+
+
+
+
+
+
+## ControlNet
+
+
+
+โ ๏ธ ControlNet is only supported for Kandinsky 2.2!
+
+
+
+ControlNet enables conditioning large pretrained diffusion models with additional inputs such as a depth map or edge detection. For example, you can condition Kandinsky 2.2 with a depth map so the model understands and preserves the structure of the depth image.
+
+Let's load an image and extract it's depth map:
+
+```py
+from diffusers.utils import load_image
+
+img = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
+).resize((768, 768))
+img
+```
+
+
+
+
+
+Then you can use the `depth-estimation` [`~transformers.Pipeline`] from ๐ค Transformers to process the image and retrieve the depth map:
+
+```py
+import torch
+import numpy as np
+
+from transformers import pipeline
+
+def make_hint(image, depth_estimator):
+ image = depth_estimator(image)["depth"]
+ image = np.array(image)
+ image = image[:, :, None]
+ image = np.concatenate([image, image, image], axis=2)
+ detected_map = torch.from_numpy(image).float() / 255.0
+ hint = detected_map.permute(2, 0, 1)
+ return hint
+
+depth_estimator = pipeline("depth-estimation")
+hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
+```
+
+### Text-to-image [[controlnet-text-to-image]]
+
+Load the prior pipeline and the [`KandinskyV22ControlnetPipeline`]:
+
+```py
+from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
+
+prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+
+pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
+).to("cuda")
+```
+
+Generate the image embeddings from a prompt and negative prompt:
+
+```py
+prompt = "A robot, 4k photo"
+negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
+
+generator = torch.Generator(device="cuda").manual_seed(43)
+
+image_emb, zero_image_emb = prior_pipeline(
+ prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
+).to_tuple()
+```
+
+Finally, pass the image embeddings and the depth image to the [`KandinskyV22ControlnetPipeline`] to generate an image:
+
+```py
+image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
+image
+```
+
+
+
+
+
+### Image-to-image [[controlnet-image-to-image]]
+
+For image-to-image with ControlNet, you'll need to use the:
+
+- [`KandinskyV22PriorEmb2EmbPipeline`] to generate the image embeddings from a text prompt and an image
+- [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings
+
+Process and extract a depth map of an initial image of a cat with the `depth-estimation` [`~transformers.Pipeline`] from ๐ค Transformers:
+
+```py
+import torch
+import numpy as np
+
+from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
+from diffusers.utils import load_image
+from transformers import pipeline
+
+img = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
+).resize((768, 768))
+
+def make_hint(image, depth_estimator):
+ image = depth_estimator(image)["depth"]
+ image = np.array(image)
+ image = image[:, :, None]
+ image = np.concatenate([image, image, image], axis=2)
+ detected_map = torch.from_numpy(image).float() / 255.0
+ hint = detected_map.permute(2, 0, 1)
+ return hint
+
+depth_estimator = pipeline("depth-estimation")
+hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
+```
+
+Load the prior pipeline and the [`KandinskyV22ControlnetImg2ImgPipeline`]:
+
+```py
+prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+
+pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
+).to("cuda")
+```
+
+Pass a text prompt and the initial image to the prior pipeline to generate the image embeddings:
+
+```py
+prompt = "A robot, 4k photo"
+negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
+
+generator = torch.Generator(device="cuda").manual_seed(43)
+
+img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
+negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
+```
+
+Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings:
+
+```py
+image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
+make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
+```
+
+
+
+
+
+## Optimizations
+
+Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference.
+
+1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0:
+
+```diff
+ from diffusers import DiffusionPipeline
+ import torch
+
+ pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
++ pipe.enable_xformers_memory_efficient_attention()
+```
+
+2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA):
+
+```diff
+ pipe.unet.to(memory_format=torch.channels_last)
++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]:
+
+```py
+from diffusers.models.attention_processor import AttnAddedKVProcessor2_0
+
+pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
+```
+
+3. Offload the model to the CPU with [`~KandinskyPriorPipeline.enable_model_cpu_offload`] to avoid out-of-memory errors:
+
+```diff
+ from diffusers import DiffusionPipeline
+ import torch
+
+ pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
++ pipe.enable_model_cpu_offload()
+```
+
+4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality:
+
+```py
+from diffusers import DDPMScheduler
+from diffusers import DiffusionPipeline
+
+scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
+pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+```
diff --git a/docs/source/en/using-diffusers/loading.md b/docs/source/en/using-diffusers/loading.md
new file mode 100644
index 0000000..de87e53
--- /dev/null
+++ b/docs/source/en/using-diffusers/loading.md
@@ -0,0 +1,485 @@
+
+
+# Load pipelines, models, and schedulers
+
+[[open-in-colab]]
+
+Having an easy way to use a diffusion system for inference is essential to ๐งจ Diffusers. Diffusion systems often consist of multiple components like parameterized models, tokenizers, and schedulers that interact in complex ways. That is why we designed the [`DiffusionPipeline`] to wrap the complexity of the entire diffusion system into an easy-to-use API, while remaining flexible enough to be adapted for other use cases, such as loading each component individually as building blocks to assemble your own diffusion system.
+
+Everything you need for inference or training is accessible with the `from_pretrained()` method.
+
+This guide will show you how to load:
+
+- pipelines from the Hub and locally
+- different components into a pipeline
+- checkpoint variants such as different floating point types or non-exponential mean averaged (EMA) weights
+- models and schedulers
+
+## Diffusion Pipeline
+
+
+
+๐ก Skip to the [DiffusionPipeline explained](#diffusionpipeline-explained) section if you are interested in learning in more detail about how the [`DiffusionPipeline`] class works.
+
+
+
+The [`DiffusionPipeline`] class is the simplest and most generic way to load the latest trending diffusion model from the [Hub](https://huggingface.co/models?library=diffusers&sort=trending). The [`DiffusionPipeline.from_pretrained`] method automatically detects the correct pipeline class from the checkpoint, downloads, and caches all the required configuration and weight files, and returns a pipeline instance ready for inference.
+
+```python
+from diffusers import DiffusionPipeline
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+pipe = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
+```
+
+You can also load a checkpoint with its specific pipeline class. The example above loaded a Stable Diffusion model; to get the same result, use the [`StableDiffusionPipeline`] class:
+
+```python
+from diffusers import StableDiffusionPipeline
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+pipe = StableDiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
+```
+
+A checkpoint (such as [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) or [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)) may also be used for more than one task, like text-to-image or image-to-image. To differentiate what task you want to use the checkpoint for, you have to load it directly with its corresponding task-specific pipeline class:
+
+```python
+from diffusers import StableDiffusionImg2ImgPipeline
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained(repo_id)
+```
+
+### Local pipeline
+
+To load a diffusion pipeline locally, use [`git-lfs`](https://git-lfs.github.com/) to manually download the checkpoint (in this case, [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)) to your local disk. This creates a local folder, `./stable-diffusion-v1-5`, on your disk:
+
+```bash
+git-lfs install
+git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
+```
+
+Then pass the local path to [`~DiffusionPipeline.from_pretrained`]:
+
+```python
+from diffusers import DiffusionPipeline
+
+repo_id = "./stable-diffusion-v1-5"
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
+```
+
+The [`~DiffusionPipeline.from_pretrained`] method won't download any files from the Hub when it detects a local path, but this also means it won't download and cache the latest changes to a checkpoint.
+
+### Swap components in a pipeline
+
+You can customize the default components of any pipeline with another compatible component. Customization is important because:
+
+- Changing the scheduler is important for exploring the trade-off between generation speed and quality.
+- Different components of a model are typically trained independently and you can swap out a component with a better-performing one.
+- During finetuning, usually only some components - like the UNet or text encoder - are trained.
+
+To find out which schedulers are compatible for customization, you can use the `compatibles` method:
+
+```py
+from diffusers import DiffusionPipeline
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
+stable_diffusion.scheduler.compatibles
+```
+
+Let's use the [`SchedulerMixin.from_pretrained`] method to replace the default [`PNDMScheduler`] with a more performant scheduler, [`EulerDiscreteScheduler`]. The `subfolder="scheduler"` argument is required to load the scheduler configuration from the correct [subfolder](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/scheduler) of the pipeline repository.
+
+Then you can pass the new [`EulerDiscreteScheduler`] instance to the `scheduler` argument in [`DiffusionPipeline`]:
+
+```python
+from diffusers import DiffusionPipeline, EulerDiscreteScheduler
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+scheduler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, scheduler=scheduler, use_safetensors=True)
+```
+
+### Safety checker
+
+Diffusion models like Stable Diffusion can generate harmful content, which is why ๐งจ Diffusers has a [safety checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to check generated outputs against known hardcoded NSFW content. If you'd like to disable the safety checker for whatever reason, pass `None` to the `safety_checker` argument:
+
+```python
+from diffusers import DiffusionPipeline
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, safety_checker=None, use_safetensors=True)
+"""
+You have disabled the safety checker for by passing `safety_checker=None`. Ensure that you abide by the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend keeping the safety filter enabled in all public-facing circumstances, disabling it only for use cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
+"""
+```
+
+### Reuse components across pipelines
+
+You can also reuse the same components in multiple pipelines to avoid loading the weights into RAM twice. Use the [`~DiffusionPipeline.components`] method to save the components:
+
+```python
+from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True)
+
+components = stable_diffusion_txt2img.components
+```
+
+Then you can pass the `components` to another pipeline without reloading the weights into RAM:
+
+```py
+stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(**components)
+```
+
+You can also pass the components individually to the pipeline if you want more flexibility over which components to reuse or disable. For example, to reuse the same components in the text-to-image pipeline, except for the safety checker and feature extractor, in the image-to-image pipeline:
+
+```py
+from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True)
+stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(
+ vae=stable_diffusion_txt2img.vae,
+ text_encoder=stable_diffusion_txt2img.text_encoder,
+ tokenizer=stable_diffusion_txt2img.tokenizer,
+ unet=stable_diffusion_txt2img.unet,
+ scheduler=stable_diffusion_txt2img.scheduler,
+ safety_checker=None,
+ feature_extractor=None,
+ requires_safety_checker=False,
+)
+```
+
+## Checkpoint variants
+
+A checkpoint variant is usually a checkpoint whose weights are:
+
+- Stored in a different floating point type for lower precision and lower storage, such as [`torch.float16`](https://pytorch.org/docs/stable/tensors.html#data-types), because it only requires half the bandwidth and storage to download. You can't use this variant if you're continuing training or using a CPU.
+- Non-exponential mean averaged (EMA) weights, which shouldn't be used for inference. You should use these to continue fine-tuning a model.
+
+
+
+๐ก When the checkpoints have identical model structures, but they were trained on different datasets and with a different training setup, they should be stored in separate repositories instead of variations (for example, [`stable-diffusion-v1-4`] and [`stable-diffusion-v1-5`]).
+
+
+
+Otherwise, a variant is **identical** to the original checkpoint. They have exactly the same serialization format (like [Safetensors](./using_safetensors)), model structure, and weights that have identical tensor shapes.
+
+| **checkpoint type** | **weight name** | **argument for loading weights** |
+|---------------------|-------------------------------------|----------------------------------|
+| original | diffusion_pytorch_model.bin | |
+| floating point | diffusion_pytorch_model.fp16.bin | `variant`, `torch_dtype` |
+| non-EMA | diffusion_pytorch_model.non_ema.bin | `variant` |
+
+There are two important arguments to know for loading variants:
+
+- `torch_dtype` defines the floating point precision of the loaded checkpoints. For example, if you want to save bandwidth by loading a `fp16` variant, you should specify `torch_dtype=torch.float16` to *convert the weights* to `fp16`. Otherwise, the `fp16` weights are converted to the default `fp32` precision. You can also load the original checkpoint without defining the `variant` argument, and convert it to `fp16` with `torch_dtype=torch.float16`. In this case, the default `fp32` weights are downloaded first, and then they're converted to `fp16` after loading.
+
+- `variant` defines which files should be loaded from the repository. For example, if you want to load a `non_ema` variant from the [`diffusers/stable-diffusion-variants`](https://huggingface.co/diffusers/stable-diffusion-variants/tree/main/unet) repository, you should specify `variant="non_ema"` to download the `non_ema` files.
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+# load fp16 variant
+stable_diffusion = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16, use_safetensors=True
+)
+# load non_ema variant
+stable_diffusion = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", variant="non_ema", use_safetensors=True
+)
+```
+
+To save a checkpoint stored in a different floating-point type or as a non-EMA variant, use the [`DiffusionPipeline.save_pretrained`] method and specify the `variant` argument. You should try and save a variant to the same folder as the original checkpoint, so you can load both from the same folder:
+
+```python
+from diffusers import DiffusionPipeline
+
+# save as fp16 variant
+stable_diffusion.save_pretrained("runwayml/stable-diffusion-v1-5", variant="fp16")
+# save as non-ema variant
+stable_diffusion.save_pretrained("runwayml/stable-diffusion-v1-5", variant="non_ema")
+```
+
+If you don't save the variant to an existing folder, you must specify the `variant` argument otherwise it'll throw an `Exception` because it can't find the original checkpoint:
+
+```python
+# ๐ this won't work
+stable_diffusion = DiffusionPipeline.from_pretrained(
+ "./stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
+# ๐ this works
+stable_diffusion = DiffusionPipeline.from_pretrained(
+ "./stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16, use_safetensors=True
+)
+```
+
+
+
+## Models
+
+Models are loaded from the [`ModelMixin.from_pretrained`] method, which downloads and caches the latest version of the model weights and configurations. If the latest files are available in the local cache, [`~ModelMixin.from_pretrained`] reuses files in the cache instead of re-downloading them.
+
+Models can be loaded from a subfolder with the `subfolder` argument. For example, the model weights for `runwayml/stable-diffusion-v1-5` are stored in the [`unet`](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet) subfolder:
+
+```python
+from diffusers import UNet2DConditionModel
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+model = UNet2DConditionModel.from_pretrained(repo_id, subfolder="unet", use_safetensors=True)
+```
+
+Or directly from a repository's [directory](https://huggingface.co/google/ddpm-cifar10-32/tree/main):
+
+```python
+from diffusers import UNet2DModel
+
+repo_id = "google/ddpm-cifar10-32"
+model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
+```
+
+You can also load and save model variants by specifying the `variant` argument in [`ModelMixin.from_pretrained`] and [`ModelMixin.save_pretrained`]:
+
+```python
+from diffusers import UNet2DConditionModel
+
+model = UNet2DConditionModel.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", subfolder="unet", variant="non_ema", use_safetensors=True
+)
+model.save_pretrained("./local-unet", variant="non_ema")
+```
+
+## Schedulers
+
+Schedulers are loaded from the [`SchedulerMixin.from_pretrained`] method, and unlike models, schedulers are **not parameterized** or **trained**; they are defined by a configuration file.
+
+Loading schedulers does not consume any significant amount of memory and the same configuration file can be used for a variety of different schedulers.
+For example, the following schedulers are compatible with [`StableDiffusionPipeline`], which means you can load the same scheduler configuration file in any of these classes:
+
+```python
+from diffusers import StableDiffusionPipeline
+from diffusers import (
+ DDPMScheduler,
+ DDIMScheduler,
+ PNDMScheduler,
+ LMSDiscreteScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ DPMSolverMultistepScheduler,
+)
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+
+ddpm = DDPMScheduler.from_pretrained(repo_id, subfolder="scheduler")
+ddim = DDIMScheduler.from_pretrained(repo_id, subfolder="scheduler")
+pndm = PNDMScheduler.from_pretrained(repo_id, subfolder="scheduler")
+lms = LMSDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
+euler_anc = EulerAncestralDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
+euler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
+dpm = DPMSolverMultistepScheduler.from_pretrained(repo_id, subfolder="scheduler")
+
+# replace `dpm` with any of `ddpm`, `ddim`, `pndm`, `lms`, `euler_anc`, `euler`
+pipeline = StableDiffusionPipeline.from_pretrained(repo_id, scheduler=dpm, use_safetensors=True)
+```
+
+## DiffusionPipeline explained
+
+As a class method, [`DiffusionPipeline.from_pretrained`] is responsible for two things:
+
+- Download the latest version of the folder structure required for inference and cache it. If the latest folder structure is available in the local cache, [`DiffusionPipeline.from_pretrained`] reuses the cache and won't redownload the files.
+- Load the cached weights into the correct pipeline [class](../api/pipelines/overview#diffusers-summary) - retrieved from the `model_index.json` file - and return an instance of it.
+
+The pipelines' underlying folder structure corresponds directly with their class instances. For example, the [`StableDiffusionPipeline`] corresponds to the folder structure in [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5).
+
+```python
+from diffusers import DiffusionPipeline
+
+repo_id = "runwayml/stable-diffusion-v1-5"
+pipeline = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
+print(pipeline)
+```
+
+You'll see pipeline is an instance of [`StableDiffusionPipeline`], which consists of seven components:
+
+- `"feature_extractor"`: a [`~transformers.CLIPImageProcessor`] from ๐ค Transformers.
+- `"safety_checker"`: a [component](https://github.com/huggingface/diffusers/blob/e55687e1e15407f60f32242027b7bb8170e58266/src/diffusers/pipelines/stable_diffusion/safety_checker.py#L32) for screening against harmful content.
+- `"scheduler"`: an instance of [`PNDMScheduler`].
+- `"text_encoder"`: a [`~transformers.CLIPTextModel`] from ๐ค Transformers.
+- `"tokenizer"`: a [`~transformers.CLIPTokenizer`] from ๐ค Transformers.
+- `"unet"`: an instance of [`UNet2DConditionModel`].
+- `"vae"`: an instance of [`AutoencoderKL`].
+
+```json
+StableDiffusionPipeline {
+ "feature_extractor": [
+ "transformers",
+ "CLIPImageProcessor"
+ ],
+ "safety_checker": [
+ "stable_diffusion",
+ "StableDiffusionSafetyChecker"
+ ],
+ "scheduler": [
+ "diffusers",
+ "PNDMScheduler"
+ ],
+ "text_encoder": [
+ "transformers",
+ "CLIPTextModel"
+ ],
+ "tokenizer": [
+ "transformers",
+ "CLIPTokenizer"
+ ],
+ "unet": [
+ "diffusers",
+ "UNet2DConditionModel"
+ ],
+ "vae": [
+ "diffusers",
+ "AutoencoderKL"
+ ]
+}
+```
+
+Compare the components of the pipeline instance to the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main) folder structure, and you'll see there is a separate folder for each of the components in the repository:
+
+```
+.
+โโโ feature_extractor
+โย ย โโโ preprocessor_config.json
+โโโ model_index.json
+โโโ safety_checker
+โย ย โโโ config.json
+| โโโ model.fp16.safetensors
+โ โโโ model.safetensors
+โ โโโ pytorch_model.bin
+| โโโ pytorch_model.fp16.bin
+โโโ scheduler
+โย ย โโโ scheduler_config.json
+โโโ text_encoder
+โย ย โโโ config.json
+| โโโ model.fp16.safetensors
+โ โโโ model.safetensors
+โ |โโ pytorch_model.bin
+| โโโ pytorch_model.fp16.bin
+โโโ tokenizer
+โย ย โโโ merges.txt
+โย ย โโโ special_tokens_map.json
+โย ย โโโ tokenizer_config.json
+โย ย โโโ vocab.json
+โโโ unet
+โย ย โโโ config.json
+โย ย โโโ diffusion_pytorch_model.bin
+| |โโ diffusion_pytorch_model.fp16.bin
+โ |โโ diffusion_pytorch_model.f16.safetensors
+โ |โโ diffusion_pytorch_model.non_ema.bin
+โ |โโ diffusion_pytorch_model.non_ema.safetensors
+โ โโโ diffusion_pytorch_model.safetensors
+|โโ vae
+. โโโ config.json
+. โโโ diffusion_pytorch_model.bin
+ โโโ diffusion_pytorch_model.fp16.bin
+ โโโ diffusion_pytorch_model.fp16.safetensors
+ โโโ diffusion_pytorch_model.safetensors
+```
+
+You can access each of the components of the pipeline as an attribute to view its configuration:
+
+```py
+pipeline.tokenizer
+CLIPTokenizer(
+ name_or_path="/root/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/39593d5650112b4cc580433f6b0435385882d819/tokenizer",
+ vocab_size=49408,
+ model_max_length=77,
+ is_fast=False,
+ padding_side="right",
+ truncation_side="right",
+ special_tokens={
+ "bos_token": AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True),
+ "eos_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True),
+ "unk_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True),
+ "pad_token": "<|endoftext|>",
+ },
+ clean_up_tokenization_spaces=True
+)
+```
+
+Every pipeline expects a [`model_index.json`](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json) file that tells the [`DiffusionPipeline`]:
+
+- which pipeline class to load from `_class_name`
+- which version of ๐งจ Diffusers was used to create the model in `_diffusers_version`
+- what components from which library are stored in the subfolders (`name` corresponds to the component and subfolder name, `library` corresponds to the name of the library to load the class from, and `class` corresponds to the class name)
+
+```json
+{
+ "_class_name": "StableDiffusionPipeline",
+ "_diffusers_version": "0.6.0",
+ "feature_extractor": [
+ "transformers",
+ "CLIPImageProcessor"
+ ],
+ "safety_checker": [
+ "stable_diffusion",
+ "StableDiffusionSafetyChecker"
+ ],
+ "scheduler": [
+ "diffusers",
+ "PNDMScheduler"
+ ],
+ "text_encoder": [
+ "transformers",
+ "CLIPTextModel"
+ ],
+ "tokenizer": [
+ "transformers",
+ "CLIPTokenizer"
+ ],
+ "unet": [
+ "diffusers",
+ "UNet2DConditionModel"
+ ],
+ "vae": [
+ "diffusers",
+ "AutoencoderKL"
+ ]
+}
+```
diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md
new file mode 100644
index 0000000..b59b46a
--- /dev/null
+++ b/docs/source/en/using-diffusers/loading_adapters.md
@@ -0,0 +1,297 @@
+
+
+# Load adapters
+
+[[open-in-colab]]
+
+There are several [training](../training/overview) techniques for personalizing diffusion models to generate images of a specific subject or images in certain styles. Each of these training methods produces a different type of adapter. Some of the adapters generate an entirely new model, while other adapters only modify a smaller set of embeddings or weights. This means the loading process for each adapter is also different.
+
+This guide will show you how to load DreamBooth, textual inversion, and LoRA weights.
+
+
+
+Feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer), [LoRA the Explorer](https://huggingface.co/spaces/multimodalart/LoraTheExplorer), and the [Diffusers Models Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) for checkpoints and embeddings to use.
+
+
+
+## DreamBooth
+
+[DreamBooth](https://dreambooth.github.io/) finetunes an *entire diffusion model* on just several images of a subject to generate images of that subject in new styles and settings. This method works by using a special word in the prompt that the model learns to associate with the subject image. Of all the training methods, DreamBooth produces the largest file size (usually a few GBs) because it is a full checkpoint model.
+
+Let's load the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, which is trained on just 10 images drawn by Hergรฉ, to generate images in that style. For it to work, you need to include the special word `herge_style` in your prompt to trigger the checkpoint:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("sd-dreambooth-library/herge-style", torch_dtype=torch.float16).to("cuda")
+prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration"
+image = pipeline(prompt).images[0]
+image
+```
+
+
+
+
+
+## Textual inversion
+
+[Textual inversion](https://textual-inversion.github.io/) is very similar to DreamBooth and it can also personalize a diffusion model to generate certain concepts (styles, objects) from just a few images. This method works by training and finding new embeddings that represent the images you provide with a special word in the prompt. As a result, the diffusion model weights stay the same and the training process produces a relatively tiny (a few KBs) file.
+
+Because textual inversion creates embeddings, it cannot be used on its own like DreamBooth and requires another model.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+```
+
+Now you can load the textual inversion embeddings with the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method and generate some images. Let's load the [sd-concepts-library/gta5-artwork](https://huggingface.co/sd-concepts-library/gta5-artwork) embeddings and you'll need to include the special word `` in your prompt to trigger it:
+
+```py
+pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork")
+prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style"
+image = pipeline(prompt).images[0]
+image
+```
+
+
+
+
+
+Textual inversion can also be trained on undesirable things to create *negative embeddings* to discourage a model from generating images with those undesirable things like blurry images or extra fingers on a hand. This can be an easy way to quickly improve your prompt. You'll also load the embeddings with [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`], but this time, you'll need two more parameters:
+
+- `weight_name`: specifies the weight file to load if the file was saved in the ๐ค Diffusers format with a specific name or if the file is stored in the A1111 format
+- `token`: specifies the special word to use in the prompt to trigger the embeddings
+
+Let's load the [sayakpaul/EasyNegative-test](https://huggingface.co/sayakpaul/EasyNegative-test) embeddings:
+
+```py
+pipeline.load_textual_inversion(
+ "sayakpaul/EasyNegative-test", weight_name="EasyNegative.safetensors", token="EasyNegative"
+)
+```
+
+Now you can use the `token` to generate an image with the negative embeddings:
+
+```py
+prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, EasyNegative"
+negative_prompt = "EasyNegative"
+
+image = pipeline(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0]
+image
+```
+
+
+
+
+
+## LoRA
+
+[Low-Rank Adaptation (LoRA)](https://huggingface.co/papers/2106.09685) is a popular training technique because it is fast and generates smaller file sizes (a couple hundred MBs). Like the other methods in this guide, LoRA can train a model to learn new styles from just a few images. It works by inserting new weights into the diffusion model and then only the new weights are trained instead of the entire model. This makes LoRAs faster to train and easier to store.
+
+
+
+LoRA is a very general training technique that can be used with other training methods. For example, it is common to train a model with DreamBooth and LoRA. It is also increasingly common to load and merge multiple LoRAs to create new and unique images. You can learn more about it in the in-depth [Merge LoRAs](merge_loras) guide since merging is outside the scope of this loading guide.
+
+
+
+LoRAs also need to be used with another model:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+```
+
+Then use the [`~loaders.LoraLoaderMixin.load_lora_weights`] method to load the [ostris/super-cereal-sdxl-lora](https://huggingface.co/ostris/super-cereal-sdxl-lora) weights and specify the weights filename from the repository:
+
+```py
+pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors")
+prompt = "bears, pizza bites"
+image = pipeline(prompt).images[0]
+image
+```
+
+
+
+
+
+The [`~loaders.LoraLoaderMixin.load_lora_weights`] method loads LoRA weights into both the UNet and text encoder. It is the preferred way for loading LoRAs because it can handle cases where:
+
+- the LoRA weights don't have separate identifiers for the UNet and text encoder
+- the LoRA weights have separate identifiers for the UNet and text encoder
+
+But if you only need to load LoRA weights into the UNet, then you can use the [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`] method. Let's load the [jbilcke-hf/sdxl-cinematic-1](https://huggingface.co/jbilcke-hf/sdxl-cinematic-1) LoRA:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.unet.load_attn_procs("jbilcke-hf/sdxl-cinematic-1", weight_name="pytorch_lora_weights.safetensors")
+
+# use cnmt in the prompt to trigger the LoRA
+prompt = "A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration"
+image = pipeline(prompt).images[0]
+image
+```
+
+
+
+
+
+
+
+For both [`~loaders.LoraLoaderMixin.load_lora_weights`] and [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`], you can pass the `cross_attention_kwargs={"scale": 0.5}` parameter to adjust how much of the LoRA weights to use. A value of `0` is the same as only using the base model weights, and a value of `1` is equivalent to using the fully finetuned LoRA.
+
+
+
+To unload the LoRA weights, use the [`~loaders.LoraLoaderMixin.unload_lora_weights`] method to discard the LoRA weights and restore the model to its original weights:
+
+```py
+pipeline.unload_lora_weights()
+```
+
+### Kohya and TheLastBen
+
+Other popular LoRA trainers from the community include those by [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). These trainers create different LoRA checkpoints than those trained by ๐ค Diffusers, but they can still be loaded in the same way.
+
+
+
+
+To load a Kohya LoRA, let's download the [Blueprintify SD XL 1.0](https://civitai.com/models/150986/blueprintify-sd-xl-10) checkpoint from [Civitai](https://civitai.com/) as an example:
+
+```sh
+!wget https://civitai.com/api/download/models/168776 -O blueprintify-sd-xl-10.safetensors
+```
+
+Load the LoRA checkpoint with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method, and specify the filename in the `weight_name` parameter:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_lora_weights("path/to/weights", weight_name="blueprintify-sd-xl-10.safetensors")
+```
+
+Generate an image:
+
+```py
+# use bl3uprint in the prompt to trigger the LoRA
+prompt = "bl3uprint, a highly detailed blueprint of the eiffel tower, explaining how to build all parts, many txt, blueprint grid backdrop"
+image = pipeline(prompt).images[0]
+image
+```
+
+
+
+Some limitations of using Kohya LoRAs with ๐ค Diffusers include:
+
+- Images may not look like those generated by UIs - like ComfyUI - for multiple reasons, which are explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736).
+- [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS) aren't fully supported. The [`~loaders.LoraLoaderMixin.load_lora_weights`] method loads LyCORIS checkpoints with LoRA and LoCon modules, but Hada and LoKR are not supported.
+
+
+
+
+
+
+Loading a checkpoint from TheLastBen is very similar. For example, to load the [TheLastBen/William_Eggleston_Style_SDXL](https://huggingface.co/TheLastBen/William_Eggleston_Style_SDXL) checkpoint:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_lora_weights("TheLastBen/William_Eggleston_Style_SDXL", weight_name="wegg.safetensors")
+
+# use by william eggleston in the prompt to trigger the LoRA
+prompt = "a house by william eggleston, sunrays, beautiful, sunlight, sunrays, beautiful"
+image = pipeline(prompt=prompt).images[0]
+image
+```
+
+
+
+
+## IP-Adapter
+
+[IP-Adapter](https://ip-adapter.github.io/) is a lightweight adapter that enables image prompting for any diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs.
+
+You can learn more about how to use IP-Adapter for different tasks and specific use cases in the [IP-Adapter](../using-diffusers/ip_adapter) guide.
+
+> [!TIP]
+> Diffusers currently only supports IP-Adapter for some of the most popular pipelines. Feel free to open a feature request if you have a cool use case and want to integrate IP-Adapter with an unsupported pipeline!
+> Official IP-Adapter checkpoints are available from [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter).
+
+To start, load a Stable Diffusion checkpoint.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+from diffusers.utils import load_image
+
+pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+```
+
+Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method.
+
+```py
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+```
+
+Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process.
+
+```py
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png")
+generator = torch.Generator(device="cpu").manual_seed(33)
+images = pipeline(
+ย ย prompt='best quality, high quality, wearing sunglasses',
+ย ย ip_adapter_image=image,
+ย ย negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
+ย ย num_inference_steps=50,
+ย ย generator=generator,
+).images[0]
+images
+```
+
+
+ย ย
+
+
+### IP-Adapter Plus
+
+IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains an `image_encoder` subfolder, the image encoder is automatically loaded and registered to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline.
+
+This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder.
+
+```py
+from transformers import CLIPVisionModelWithProjection
+
+image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ "h94/IP-Adapter",
+ subfolder="models/image_encoder",
+ torch_dtype=torch.float16
+)
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ torch_dtype=torch.float16
+).to("cuda")
+
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
+```
diff --git a/docs/source/en/using-diffusers/loading_overview.md b/docs/source/en/using-diffusers/loading_overview.md
new file mode 100644
index 0000000..fb3163b
--- /dev/null
+++ b/docs/source/en/using-diffusers/loading_overview.md
@@ -0,0 +1,17 @@
+
+
+# Overview
+
+๐งจ Diffusers offers many pipelines, models, and schedulers for generative tasks. To make loading these components as simple as possible, we provide a single and unified method - `from_pretrained()` - that loads any of these components from either the Hugging Face [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) or your local machine. Whenever you load a pipeline or model, the latest files are automatically downloaded and cached so you can quickly reuse them next time without redownloading the files.
+
+This section will show you everything you need to know about loading pipelines, how to load different components in a pipeline, how to load checkpoint variants, and how to load community pipelines. You'll also learn how to load schedulers and compare the speed and quality trade-offs of using different schedulers. Finally, you'll see how to convert and load KerasCV checkpoints so you can use them in PyTorch with ๐งจ Diffusers.
diff --git a/docs/source/en/using-diffusers/merge_loras.md b/docs/source/en/using-diffusers/merge_loras.md
new file mode 100644
index 0000000..87a588c
--- /dev/null
+++ b/docs/source/en/using-diffusers/merge_loras.md
@@ -0,0 +1,266 @@
+
+
+# Merge LoRAs
+
+It can be fun and creative to use multiple [LoRAs]((https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora)) together to generate something entirely new and unique. This works by merging multiple LoRA weights together to produce images that are a blend of different styles. Diffusers provides a few methods to merge LoRAs depending on *how* you want to merge their weights, which can affect image quality.
+
+This guide will show you how to merge LoRAs using the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] and [`~peft.LoraModel.add_weighted_adapter`] methods. To improve inference speed and reduce memory-usage of merged LoRAs, you'll also see how to use the [`~loaders.LoraLoaderMixin.fuse_lora`] method to fuse the LoRA weights with the original weights of the underlying model.
+
+For this guide, load a Stable Diffusion XL (SDXL) checkpoint and the [KappaNeuro/studio-ghibli-style]() and [Norod78/sdxl-chalkboarddrawing-lora]() LoRAs with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method. You'll need to assign each LoRA an `adapter_name` to combine them later.
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
+pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
+```
+
+## set_adapters
+
+The [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method merges LoRA adapters by concatenating their weighted matrices. Use the adapter name to specify which LoRAs to merge, and the `adapter_weights` parameter to control the scaling for each LoRA. For example, if `adapter_weights=[0.5, 0.5]`, then the merged LoRA output is an average of both LoRAs. Try adjusting the adapter weights to see how it affects the generated image!
+
+```py
+pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
+
+generator = torch.manual_seed(0)
+prompt = "A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai"
+image = pipeline(prompt, generator=generator, cross_attention_kwargs={"scale": 1.0}).images[0]
+image
+```
+
+
+
+
+
+## add_weighted_adapter
+
+> [!WARNING]
+> This is an experimental method that adds PEFTs [`~peft.LoraModel.add_weighted_adapter`] method to Diffusers to enable more efficient merging methods. Check out this [issue](https://github.com/huggingface/diffusers/issues/6892) if you're interested in learning more about the motivation and design behind this integration.
+
+The [`~peft.LoraModel.add_weighted_adapter`] method provides access to more efficient merging method such as [TIES and DARE](https://huggingface.co/docs/peft/developer_guides/model_merging). To use these merging methods, make sure you have the latest stable version of Diffusers and PEFT installed.
+
+```bash
+pip install -U diffusers peft
+```
+
+There are three steps to merge LoRAs with the [`~peft.LoraModel.add_weighted_adapter`] method:
+
+1. Create a [`~peft.PeftModel`] from the underlying model and LoRA checkpoint.
+2. Load a base UNet model and the LoRA adapters.
+3. Merge the adapters using the [`~peft.LoraModel.add_weighted_adapter`] method and the merging method of your choice.
+
+Let's dive deeper into what these steps entail.
+
+1. Load a UNet that corresponds to the UNet in the LoRA checkpoint. In this case, both LoRAs use the SDXL UNet as their base model.
+
+```python
+from diffusers import UNet2DConditionModel
+import torch
+
+unet = UNet2DConditionModel.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+ subfolder="unet",
+).to("cuda")
+```
+
+Load the SDXL pipeline and the LoRA checkpoints, starting with the [ostris/ikea-instructions-lora-sdxl](https://huggingface.co/ostris/ikea-instructions-lora-sdxl) LoRA.
+
+```python
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ variant="fp16",
+ torch_dtype=torch.float16,
+ unet=unet
+).to("cuda")
+pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
+```
+
+Now you'll create a [`~peft.PeftModel`] from the loaded LoRA checkpoint by combining the SDXL UNet and the LoRA UNet from the pipeline.
+
+```python
+from peft import get_peft_model, LoraConfig
+import copy
+
+sdxl_unet = copy.deepcopy(unet)
+ikea_peft_model = get_peft_model(
+ sdxl_unet,
+ pipeline.unet.peft_config["ikea"],
+ adapter_name="ikea"
+)
+
+original_state_dict = {f"base_model.model.{k}": v for k, v in pipeline.unet.state_dict().items()}
+ikea_peft_model.load_state_dict(original_state_dict, strict=True)
+```
+
+> [!TIP]
+> You can optionally push the ikea_peft_model to the Hub by calling `ikea_peft_model.push_to_hub("ikea_peft_model", token=TOKEN)`.
+
+Repeat this process to create a [`~peft.PeftModel`] from the [lordjia/by-feng-zikai](https://huggingface.co/lordjia/by-feng-zikai) LoRA.
+
+```python
+pipeline.delete_adapters("ikea")
+sdxl_unet.delete_adapters("ikea")
+
+pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
+pipeline.set_adapters(adapter_names="feng")
+
+feng_peft_model = get_peft_model(
+ sdxl_unet,
+ pipeline.unet.peft_config["feng"],
+ adapter_name="feng"
+)
+
+original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()}
+feng_peft_model.load_state_dict(original_state_dict, strict=True)
+```
+
+2. Load a base UNet model and then load the adapters onto it.
+
+```python
+from peft import PeftModel
+
+base_unet = UNet2DConditionModel.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+ subfolder="unet",
+).to("cuda")
+
+model = PeftModel.from_pretrained(base_unet, "stevhliu/ikea_peft_model", use_safetensors=True, subfolder="ikea", adapter_name="ikea")
+model.load_adapter("stevhliu/feng_peft_model", use_safetensors=True, subfolder="feng", adapter_name="feng")
+```
+
+3. Merge the adapters using the [`~peft.LoraModel.add_weighted_adapter`] method and the merging method of your choice (learn more about other merging methods in this [blog post](https://huggingface.co/blog/peft_merging)). For this example, let's use the `"dare_linear"` method to merge the LoRAs.
+
+> [!WARNING]
+> Keep in mind the LoRAs need to have the same rank to be merged!
+
+```python
+model.add_weighted_adapter(
+ adapters=["ikea", "feng"],
+ weights=[1.0, 1.0],
+ combination_type="dare_linear",
+ adapter_name="ikea-feng"
+)
+model.set_adapters("ikea-feng")
+```
+
+Now you can generate an image with the merged LoRA.
+
+```python
+model = model.to(dtype=torch.float16, device="cuda")
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", unet=model, variant="fp16", torch_dtype=torch.float16,
+).to("cuda")
+
+image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
+image
+```
+
+
+
+
+
+## fuse_lora
+
+Both the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] and [`~peft.LoraModel.add_weighted_adapter`] methods require loading the base model and the LoRA adapters separately which incurs some overhead. The [`~loaders.LoraLoaderMixin.fuse_lora`] method allows you to fuse the LoRA weights directly with the original weights of the underlying model. This way, you're only loading the model once which can increase inference and lower memory-usage.
+
+You can use PEFT to easily fuse/unfuse multiple adapters directly into the model weights (both UNet and text encoder) using the [`~loaders.LoraLoaderMixin.fuse_lora`] method, which can lead to a speed-up in inference and lower VRAM usage.
+
+For example, if you have a base model and adapters loaded and set as active with the following adapter weights:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
+pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
+
+pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
+```
+
+Fuse these LoRAs into the UNet with the [`~loaders.LoraLoaderMixin.fuse_lora`] method. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`~loaders.LoraLoaderMixin.fuse_lora`] method because it wonโt work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline.
+
+```py
+pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
+```
+
+Then you should use [`~loaders.LoraLoaderMixin.unload_lora_weights`] to unload the LoRA weights since they've already been fused with the underlying base model. Finally, call [`~DiffusionPipeline.save_pretrained`] to save the fused pipeline locally or you could call [`~DiffusionPipeline.push_to_hub`] to push the fused pipeline to the Hub.
+
+```py
+pipeline.unload_lora_weights()
+# save locally
+pipeline.save_pretrained("path/to/fused-pipeline")
+# save to the Hub
+pipeline.push_to_hub("fused-ikea-feng")
+```
+
+Now you can quickly load the fused pipeline and use it for inference without needing to separately load the LoRA adapters.
+
+```py
+pipeline = DiffusionPipeline.from_pretrained(
+ "username/fused-ikea-feng", torch_dtype=torch.float16,
+).to("cuda")
+
+image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
+image
+```
+
+You can call [`~loaders.LoraLoaderMixin.unfuse_lora`] to restore the original model's weights (for example, if you want to use a different `lora_scale` value). However, this only works if you've only fused one LoRA adapter to the original model. If you've fused multiple LoRAs, you'll need to reload the model.
+
+```py
+pipeline.unfuse_lora()
+```
+
+### torch.compile
+
+[torch.compile](../optimization/torch2.0#torchcompile) can speed up your pipeline even more, but the LoRA weights must be fused first and then unloaded. Typically, the UNet is compiled because it is such a computationally intensive component of the pipeline.
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+# load base model and LoRAs
+pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
+pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
+pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
+
+# activate both LoRAs and set adapter weights
+pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])
+
+# fuse LoRAs and unload weights
+pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
+pipeline.unload_lora_weights()
+
+# torch.compile
+pipeline.unet.to(memory_format=torch.channels_last)
+pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+
+image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
+```
+
+Learn more about torch.compile in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion#torchcompile) guide.
+
+## Next steps
+
+For more conceptual details about how each merging method works, take a look at the [๐ค PEFT welcomes new merging methods](https://huggingface.co/blog/peft_merging#concatenation-cat) blog post!
diff --git a/docs/source/en/using-diffusers/other-formats.md b/docs/source/en/using-diffusers/other-formats.md
new file mode 100644
index 0000000..13efe78
--- /dev/null
+++ b/docs/source/en/using-diffusers/other-formats.md
@@ -0,0 +1,176 @@
+
+
+# Load different Stable Diffusion formats
+
+[[open-in-colab]]
+
+Stable Diffusion models are available in different formats depending on the framework they're trained and saved with, and where you download them from. Converting these formats for use in ๐ค Diffusers allows you to use all the features supported by the library, such as [using different schedulers](schedulers) for inference, [building your custom pipeline](write_own_pipeline), and a variety of techniques and methods for [optimizing inference speed](../optimization/opt_overview).
+
+
+
+We highly recommend using the `.safetensors` format because it is more secure than traditional pickled files which are vulnerable and can be exploited to execute any code on your machine (learn more in the [Load safetensors](using_safetensors) guide).
+
+
+
+This guide will show you how to convert other Stable Diffusion formats to be compatible with ๐ค Diffusers.
+
+## PyTorch .ckpt
+
+The checkpoint - or `.ckpt` - format is commonly used to store and save models. The `.ckpt` file contains the entire model and is typically several GBs in size. While you can load and use a `.ckpt` file directly with the [`~StableDiffusionPipeline.from_single_file`] method, it is generally better to convert the `.ckpt` file to ๐ค Diffusers so both formats are available.
+
+There are two options for converting a `.ckpt` file: use a Space to convert the checkpoint or convert the `.ckpt` file with a script.
+
+### Convert with a Space
+
+The easiest and most convenient way to convert a `.ckpt` file is to use the [SD to Diffusers](https://huggingface.co/spaces/diffusers/sd-to-diffusers) Space. You can follow the instructions on the Space to convert the `.ckpt` file.
+
+This approach works well for basic models, but it may struggle with more customized models. You'll know the Space failed if it returns an empty pull request or error. In this case, you can try converting the `.ckpt` file with a script.
+
+### Convert with a script
+
+๐ค Diffusers provides a [conversion script](https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py) for converting `.ckpt` files. This approach is more reliable than the Space above.
+
+Before you start, make sure you have a local clone of ๐ค Diffusers to run the script and log in to your Hugging Face account so you can open pull requests and push your converted model to the Hub.
+
+```bash
+huggingface-cli login
+```
+
+To use the script:
+
+1. Git clone the repository containing the `.ckpt` file you want to convert. For this example, let's convert this [TemporalNet](https://huggingface.co/CiaraRowles/TemporalNet) `.ckpt` file:
+
+```bash
+git lfs install
+git clone https://huggingface.co/CiaraRowles/TemporalNet
+```
+
+2. Open a pull request on the repository where you're converting the checkpoint from:
+
+```bash
+cd TemporalNet && git fetch origin refs/pr/13:pr/13
+git checkout pr/13
+```
+
+3. There are several input arguments to configure in the conversion script, but the most important ones are:
+
+ - `checkpoint_path`: the path to the `.ckpt` file to convert.
+ - `original_config_file`: a YAML file defining the configuration of the original architecture. If you can't find this file, try searching for the YAML file in the GitHub repository where you found the `.ckpt` file.
+ - `dump_path`: the path to the converted model.
+
+ For example, you can take the `cldm_v15.yaml` file from the [ControlNet](https://github.com/lllyasviel/ControlNet/tree/main/models) repository because the TemporalNet model is a Stable Diffusion v1.5 and ControlNet model.
+
+4. Now you can run the script to convert the `.ckpt` file:
+
+```bash
+python ../diffusers/scripts/convert_original_stable_diffusion_to_diffusers.py --checkpoint_path temporalnetv3.ckpt --original_config_file cldm_v15.yaml --dump_path ./ --controlnet
+```
+
+5. Once the conversion is done, upload your converted model and test out the resulting [pull request](https://huggingface.co/CiaraRowles/TemporalNet/discussions/13)!
+
+```bash
+git push origin pr/13:refs/pr/13
+```
+
+## Keras .pb or .h5
+
+
+
+๐งช This is an experimental feature. Only Stable Diffusion v1 checkpoints are supported by the Convert KerasCV Space at the moment.
+
+
+
+[KerasCV](https://keras.io/keras_cv/) supports training for [Stable Diffusion](https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/stable_diffusion) v1 and v2. However, it offers limited support for experimenting with Stable Diffusion models for inference and deployment whereas ๐ค Diffusers has a more complete set of features for this purpose, such as different [noise schedulers](https://huggingface.co/docs/diffusers/using-diffusers/schedulers), [flash attention](https://huggingface.co/docs/diffusers/optimization/xformers), and [other
+optimization techniques](https://huggingface.co/docs/diffusers/optimization/fp16).
+
+The [Convert KerasCV](https://huggingface.co/spaces/sayakpaul/convert-kerascv-sd-diffusers) Space converts `.pb` or `.h5` files to PyTorch, and then wraps them in a [`StableDiffusionPipeline`] so it is ready for inference. The converted checkpoint is stored in a repository on the Hugging Face Hub.
+
+For this example, let's convert the [`sayakpaul/textual-inversion-kerasio`](https://huggingface.co/sayakpaul/textual-inversion-kerasio/tree/main) checkpoint which was trained with Textual Inversion. It uses the special token `` to personalize images with cats.
+
+The Convert KerasCV Space allows you to input the following:
+
+* Your Hugging Face token.
+* Paths to download the UNet and text encoder weights from. Depending on how the model was trained, you don't necessarily need to provide the paths to both the UNet and text encoder. For example, Textual Inversion only requires the embeddings from the text encoder and a text-to-image model only requires the UNet weights.
+* Placeholder token is only applicable for textual inversion models.
+* The `output_repo_prefix` is the name of the repository where the converted model is stored.
+
+Click the **Submit** button to automatically convert the KerasCV checkpoint! Once the checkpoint is successfully converted, you'll see a link to the new repository containing the converted checkpoint. Follow the link to the new repository, and you'll see the Convert KerasCV Space generated a model card with an inference widget to try out the converted model.
+
+If you prefer to run inference with code, click on the **Use in Diffusers** button in the upper right corner of the model card to copy and paste the code snippet:
+
+```py
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline", use_safetensors=True
+)
+```
+
+Then, you can generate an image like:
+
+```py
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline", use_safetensors=True
+)
+pipeline.to("cuda")
+
+placeholder_token = ""
+prompt = f"two {placeholder_token} getting married, photorealistic, high quality"
+image = pipeline(prompt, num_inference_steps=50).images[0]
+```
+
+## A1111 LoRA files
+
+[Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) (A1111) is a popular web UI for Stable Diffusion that supports model sharing platforms like [Civitai](https://civitai.com/). Models trained with the Low-Rank Adaptation (LoRA) technique are especially popular because they're fast to train and have a much smaller file size than a fully finetuned model. ๐ค Diffusers supports loading A1111 LoRA checkpoints with [`~loaders.LoraLoaderMixin.load_lora_weights`]:
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "Lykon/dreamshaper-xl-1-0", torch_dtype=torch.float16, variant="fp16"
+).to("cuda")
+```
+
+Download a LoRA checkpoint from Civitai; this example uses the [Blueprintify SD XL 1.0](https://civitai.com/models/150986/blueprintify-sd-xl-10) checkpoint, but feel free to try out any LoRA checkpoint!
+
+```py
+# uncomment to download the safetensor weights
+#!wget https://civitai.com/api/download/models/168776 -O blueprintify.safetensors
+```
+
+Load the LoRA checkpoint into the pipeline with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method:
+
+```py
+pipeline.load_lora_weights(".", weight_name="blueprintify.safetensors")
+```
+
+Now you can use the pipeline to generate images:
+
+```py
+prompt = "bl3uprint, a highly detailed blueprint of the empire state building, explaining how to build all parts, many txt, blueprint grid backdrop"
+negative_prompt = "lowres, cropped, worst quality, low quality, normal quality, artifacts, signature, watermark, username, blurry, more than one bridge, bad architecture"
+
+image = pipeline(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ generator=torch.manual_seed(0),
+).images[0]
+image
+```
+
+
+
+
diff --git a/docs/source/en/using-diffusers/other-modalities.md b/docs/source/en/using-diffusers/other-modalities.md
new file mode 100644
index 0000000..2589e8b
--- /dev/null
+++ b/docs/source/en/using-diffusers/other-modalities.md
@@ -0,0 +1,21 @@
+
+
+# Using Diffusers with other modalities
+
+Diffusers is in the process of expanding to modalities other than images.
+
+Example type | Colab | Pipeline |
+:-------------------------:|:-------------------------:|:-------------------------:|
+[Molecule conformation](https://www.nature.com/subjects/molecular-conformation#:~:text=Definition,to%20changes%20in%20their%20environment.) generation | [data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/geodiff_molecule_conformation.ipynb) | โ
+
+More coming soon!
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/pipeline_overview.md b/docs/source/en/using-diffusers/pipeline_overview.md
new file mode 100644
index 0000000..82e02c1
--- /dev/null
+++ b/docs/source/en/using-diffusers/pipeline_overview.md
@@ -0,0 +1,17 @@
+
+
+# Overview
+
+A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components.
+
+This section demonstrates how to use specific pipelines such as Stable Diffusion XL, ControlNet, and DiffEdit. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to create reproducible pipelines, and how to use and contribute community pipelines.
diff --git a/docs/source/en/using-diffusers/push_to_hub.md b/docs/source/en/using-diffusers/push_to_hub.md
new file mode 100644
index 0000000..815c7d6
--- /dev/null
+++ b/docs/source/en/using-diffusers/push_to_hub.md
@@ -0,0 +1,177 @@
+
+
+# Push files to the Hub
+
+[[open-in-colab]]
+
+๐ค Diffusers provides a [`~diffusers.utils.PushToHubMixin`] for uploading your model, scheduler, or pipeline to the Hub. It is an easy way to store your files on the Hub, and also allows you to share your work with others. Under the hood, the [`~diffusers.utils.PushToHubMixin`]:
+
+1. creates a repository on the Hub
+2. saves your model, scheduler, or pipeline files so they can be reloaded later
+3. uploads folder containing these files to the Hub
+
+This guide will show you how to use the [`~diffusers.utils.PushToHubMixin`] to upload your files to the Hub.
+
+You'll need to log in to your Hub account with your access [token](https://huggingface.co/settings/tokens) first:
+
+```py
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+## Models
+
+To push a model to the Hub, call [`~diffusers.utils.PushToHubMixin.push_to_hub`] and specify the repository id of the model to be stored on the Hub:
+
+```py
+from diffusers import ControlNetModel
+
+controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+)
+controlnet.push_to_hub("my-controlnet-model")
+```
+
+For models, you can also specify the [*variant*](loading#checkpoint-variants) of the weights to push to the Hub. For example, to push `fp16` weights:
+
+```py
+controlnet.push_to_hub("my-controlnet-model", variant="fp16")
+```
+
+The [`~diffusers.utils.PushToHubMixin.push_to_hub`] function saves the model's `config.json` file and the weights are automatically saved in the `safetensors` format.
+
+Now you can reload the model from your repository on the Hub:
+
+```py
+model = ControlNetModel.from_pretrained("your-namespace/my-controlnet-model")
+```
+
+## Scheduler
+
+To push a scheduler to the Hub, call [`~diffusers.utils.PushToHubMixin.push_to_hub`] and specify the repository id of the scheduler to be stored on the Hub:
+
+```py
+from diffusers import DDIMScheduler
+
+scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+)
+scheduler.push_to_hub("my-controlnet-scheduler")
+```
+
+The [`~diffusers.utils.PushToHubMixin.push_to_hub`] function saves the scheduler's `scheduler_config.json` file to the specified repository.
+
+Now you can reload the scheduler from your repository on the Hub:
+
+```py
+scheduler = DDIMScheduler.from_pretrained("your-namepsace/my-controlnet-scheduler")
+```
+
+## Pipeline
+
+You can also push an entire pipeline with all it's components to the Hub. For example, initialize the components of a [`StableDiffusionPipeline`] with the parameters you want:
+
+```py
+from diffusers import (
+ UNet2DConditionModel,
+ AutoencoderKL,
+ DDIMScheduler,
+ StableDiffusionPipeline,
+)
+from transformers import CLIPTextModel, CLIPTextConfig, CLIPTokenizer
+
+unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+)
+
+scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+)
+
+vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+)
+
+text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+)
+text_encoder = CLIPTextModel(text_encoder_config)
+tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+```
+
+Pass all of the components to the [`StableDiffusionPipeline`] and call [`~diffusers.utils.PushToHubMixin.push_to_hub`] to push the pipeline to the Hub:
+
+```py
+components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+}
+
+pipeline = StableDiffusionPipeline(**components)
+pipeline.push_to_hub("my-pipeline")
+```
+
+The [`~diffusers.utils.PushToHubMixin.push_to_hub`] function saves each component to a subfolder in the repository. Now you can reload the pipeline from your repository on the Hub:
+
+```py
+pipeline = StableDiffusionPipeline.from_pretrained("your-namespace/my-pipeline")
+```
+
+## Privacy
+
+Set `private=True` in the [`~diffusers.utils.PushToHubMixin.push_to_hub`] function to keep your model, scheduler, or pipeline files private:
+
+```py
+controlnet.push_to_hub("my-controlnet-model-private", private=True)
+```
+
+Private repositories are only visible to you, and other users won't be able to clone the repository and your repository won't appear in search results. Even if a user has the URL to your private repository, they'll receive a `404 - Sorry, we can't find the page you are looking for`. You must be [logged in](https://huggingface.co/docs/huggingface_hub/quick-start#login) to load a model from a private repository.
\ No newline at end of file
diff --git a/docs/source/en/using-diffusers/reproducibility.md b/docs/source/en/using-diffusers/reproducibility.md
new file mode 100644
index 0000000..7c61578
--- /dev/null
+++ b/docs/source/en/using-diffusers/reproducibility.md
@@ -0,0 +1,191 @@
+
+
+# Create reproducible pipelines
+
+[[open-in-colab]]
+
+Reproducibility is important for testing, replicating results, and can even be used to [improve image quality](reusing_seeds). However, the randomness in diffusion models is a desired property because it allows the pipeline to generate different images every time it is run. While you can't expect to get the exact same results across platforms, you can expect results to be reproducible across releases and platforms within a certain tolerance range. Even then, tolerance varies depending on the diffusion pipeline and checkpoint.
+
+This is why it's important to understand how to control sources of randomness in diffusion models or use deterministic algorithms.
+
+
+
+๐ก We strongly recommend reading PyTorch's [statement about reproducibility](https://pytorch.org/docs/stable/notes/randomness.html):
+
+> Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.
+
+
+
+## Control randomness
+
+During inference, pipelines rely heavily on random sampling operations which include creating the
+Gaussian noise tensors to denoise and adding noise to the scheduling step.
+
+Take a look at the tensor values in the [`DDIMPipeline`] after two inference steps:
+
+```python
+from diffusers import DDIMPipeline
+import numpy as np
+
+model_id = "google/ddpm-cifar10-32"
+
+# load model and scheduler
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)
+
+# run pipeline for just two steps and return numpy tensor
+image = ddim(num_inference_steps=2, output_type="np").images
+print(np.abs(image).sum())
+```
+
+Running the code above prints one value, but if you run it again you get a different value. What is going on here?
+
+Every time the pipeline is run, [`torch.randn`](https://pytorch.org/docs/stable/generated/torch.randn.html) uses a different random seed to create Gaussian noise which is denoised stepwise. This leads to a different result each time it is run, which is great for diffusion pipelines since it generates a different random image each time.
+
+But if you need to reliably generate the same image, that'll depend on whether you're running the pipeline on a CPU or GPU.
+
+### CPU
+
+To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed:
+
+```python
+import torch
+from diffusers import DDIMPipeline
+import numpy as np
+
+model_id = "google/ddpm-cifar10-32"
+
+# load model and scheduler
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)
+
+# create a generator for reproducibility
+generator = torch.Generator(device="cpu").manual_seed(0)
+
+# run pipeline for just two steps and return numpy tensor
+image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
+print(np.abs(image).sum())
+```
+
+Now when you run the code above, it always prints a value of `1491.1711` no matter what because the `Generator` object with the seed is passed to all the random functions of the pipeline.
+
+If you run this code example on your specific hardware and PyTorch version, you should get a similar, if not the same, result.
+
+
+
+๐ก It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of
+just integer values representing the seed, but this is the recommended design when dealing with
+probabilistic models in PyTorch, as `Generator`s are *random states* that can be
+passed to multiple pipelines in a sequence.
+
+
+
+### GPU
+
+Writing a reproducible pipeline on a GPU is a bit trickier, and full reproducibility across different hardware is not guaranteed because matrix multiplication - which diffusion pipelines require a lot of - is less deterministic on a GPU than a CPU. For example, if you run the same code example above on a GPU:
+
+```python
+import torch
+from diffusers import DDIMPipeline
+import numpy as np
+
+model_id = "google/ddpm-cifar10-32"
+
+# load model and scheduler
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)
+ddim.to("cuda")
+
+# create a generator for reproducibility
+generator = torch.Generator(device="cuda").manual_seed(0)
+
+# run pipeline for just two steps and return numpy tensor
+image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
+print(np.abs(image).sum())
+```
+
+The result is not the same even though you're using an identical seed because the GPU uses a different random number generator than the CPU.
+
+To circumvent this problem, ๐งจ Diffusers has a [`~diffusers.utils.torch_utils.randn_tensor`] function for creating random noise on the CPU, and then moving the tensor to a GPU if necessary. The `randn_tensor` function is used everywhere inside the pipeline, allowing the user to **always** pass a CPU `Generator` even if the pipeline is run on a GPU.
+
+You'll see the results are much closer now!
+
+```python
+import torch
+from diffusers import DDIMPipeline
+import numpy as np
+
+model_id = "google/ddpm-cifar10-32"
+
+# load model and scheduler
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)
+ddim.to("cuda")
+
+# create a generator for reproducibility; notice you don't place it on the GPU!
+generator = torch.manual_seed(0)
+
+# run pipeline for just two steps and return numpy tensor
+image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
+print(np.abs(image).sum())
+```
+
+
+
+๐ก If reproducibility is important, we recommend always passing a CPU generator.
+The performance loss is often neglectable, and you'll generate much more similar
+values than if the pipeline had been run on a GPU.
+
+
+
+Finally, for more complex pipelines such as [`UnCLIPPipeline`], these are often extremely
+susceptible to precision error propagation. Don't expect similar results across
+different GPU hardware or PyTorch versions. In this case, you'll need to run
+exactly the same hardware and PyTorch version for full reproducibility.
+
+## Deterministic algorithms
+
+You can also configure PyTorch to use deterministic algorithms to create a reproducible pipeline. However, you should be aware that deterministic algorithms may be slower than nondeterministic ones and you may observe a decrease in performance. But if reproducibility is important to you, then this is the way to go!
+
+Nondeterministic behavior occurs when operations are launched in more than one CUDA stream. To avoid this, set the environment variable [`CUBLAS_WORKSPACE_CONFIG`](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility) to `:16:8` to only use one buffer size during runtime.
+
+PyTorch typically benchmarks multiple algorithms to select the fastest one, but if you want reproducibility, you should disable this feature because the benchmark may select different algorithms each time. Lastly, pass `True` to [`torch.use_deterministic_algorithms`](https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html) to enable deterministic algorithms.
+
+```py
+import os
+import torch
+
+os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
+
+torch.backends.cudnn.benchmark = False
+torch.use_deterministic_algorithms(True)
+```
+
+Now when you run the same pipeline twice, you'll get identical results.
+
+```py
+import torch
+from diffusers import DDIMScheduler, StableDiffusionPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipe = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True).to("cuda")
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+g = torch.Generator(device="cuda")
+
+prompt = "A bear is playing a guitar on Times Square"
+
+g.manual_seed(0)
+result1 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images
+
+g.manual_seed(0)
+result2 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images
+
+print("L_inf dist =", abs(result1 - result2).max())
+"L_inf dist = tensor(0., device='cuda:0')"
+```
diff --git a/docs/source/en/using-diffusers/reusing_seeds.md b/docs/source/en/using-diffusers/reusing_seeds.md
new file mode 100644
index 0000000..bad567b
--- /dev/null
+++ b/docs/source/en/using-diffusers/reusing_seeds.md
@@ -0,0 +1,81 @@
+
+
+# Improve image quality with deterministic generation
+
+[[open-in-colab]]
+
+A common way to improve the quality of generated images is with *deterministic batch generation*, generate a batch of images and select one image to improve with a more detailed prompt in a second round of inference. The key is to pass a list of [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator)'s to the pipeline for batched image generation, and tie each `Generator` to a seed so you can reuse it for an image.
+
+Let's use [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) for example, and generate several versions of the following prompt:
+
+```py
+prompt = "Labrador in the style of Vermeer"
+```
+
+Instantiate a pipeline with [`DiffusionPipeline.from_pretrained`] and place it on a GPU (if available):
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+from diffusers.utils import make_image_grid
+
+pipe = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
+pipe = pipe.to("cuda")
+```
+
+Now, define four different `Generator`s and assign each `Generator` a seed (`0` to `3`) so you can reuse a `Generator` later for a specific image:
+
+```python
+generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(4)]
+```
+
+
+
+To create a batched seed, you should use a list comprehension that iterates over the length specified in `range()`. This creates a unique `Generator` object for each image in the batch. If you only multiply the `Generator` by the batch size, this only creates one `Generator` object that is used sequentially for each image in the batch.
+
+For example, if you want to use the same seed to create 4 identical images:
+
+```py
+โ [torch.Generator().manual_seed(seed)] * 4
+
+โ [torch.Generator().manual_seed(seed) for _ in range(4)]
+```
+
+
+
+Generate the images and have a look:
+
+```python
+images = pipe(prompt, generator=generator, num_images_per_prompt=4).images
+make_image_grid(images, rows=2, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/367b7/367b7f070bf889eb6832cdf1afc2240ebac50af8" alt="img"
+
+In this example, you'll improve upon the first image - but in reality, you can use any image you want (even the image with double sets of eyes!). The first image used the `Generator` with seed `0`, so you'll reuse that `Generator` for the second round of inference. To improve the quality of the image, add some additional text to the prompt:
+
+```python
+prompt = [prompt + t for t in [", highly realistic", ", artsy", ", trending", ", colorful"]]
+generator = [torch.Generator(device="cuda").manual_seed(0) for i in range(4)]
+```
+
+Create four generators with seed `0`, and generate another batch of images, all of which should look like the first image from the previous round!
+
+```python
+images = pipe(prompt, generator=generator).images
+make_image_grid(images, rows=2, cols=2)
+```
+
+data:image/s3,"s3://crabby-images/91b6c/91b6c308c96e86e0f4b804bef86acf5c926a6e7a" alt="img"
diff --git a/docs/source/en/using-diffusers/schedulers.md b/docs/source/en/using-diffusers/schedulers.md
new file mode 100644
index 0000000..ac261de
--- /dev/null
+++ b/docs/source/en/using-diffusers/schedulers.md
@@ -0,0 +1,331 @@
+
+
+# Schedulers
+
+[[open-in-colab]]
+
+Diffusion pipelines are inherently a collection of diffusion models and schedulers that are partly independent from each other. This means that one is able to switch out parts of the pipeline to better customize
+a pipeline to one's use case. The best example of this is the [Schedulers](../api/schedulers/overview).
+
+Whereas diffusion models usually simply define the forward pass from noise to a less noisy sample,
+schedulers define the whole denoising process, *i.e.*:
+- How many denoising steps?
+- Stochastic or deterministic?
+- What algorithm to use to find the denoised sample?
+
+They can be quite complex and often define a trade-off between **denoising speed** and **denoising quality**.
+It is extremely difficult to measure quantitatively which scheduler works best for a given diffusion pipeline, so it is often recommended to simply try out which works best.
+
+The following paragraphs show how to do so with the ๐งจ Diffusers library.
+
+## Load pipeline
+
+Let's start by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model in the [`DiffusionPipeline`]:
+
+```python
+from huggingface_hub import login
+from diffusers import DiffusionPipeline
+import torch
+
+login()
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
+```
+
+Next, we move it to GPU:
+
+```python
+pipeline.to("cuda")
+```
+
+## Access the scheduler
+
+The scheduler is always one of the components of the pipeline and is usually called `"scheduler"`.
+So it can be accessed via the `"scheduler"` property.
+
+```python
+pipeline.scheduler
+```
+
+**Output**:
+```
+PNDMScheduler {
+ "_class_name": "PNDMScheduler",
+ "_diffusers_version": "0.21.4",
+ "beta_end": 0.012,
+ "beta_schedule": "scaled_linear",
+ "beta_start": 0.00085,
+ "clip_sample": false,
+ "num_train_timesteps": 1000,
+ "set_alpha_to_one": false,
+ "skip_prk_steps": true,
+ "steps_offset": 1,
+ "timestep_spacing": "leading",
+ "trained_betas": null
+}
+```
+
+We can see that the scheduler is of type [`PNDMScheduler`].
+Cool, now let's compare the scheduler in its performance to other schedulers.
+First we define a prompt on which we will test all the different schedulers:
+
+```python
+prompt = "A photograph of an astronaut riding a horse on Mars, high resolution, high definition."
+```
+
+Next, we create a generator from a random seed that will ensure that we can generate similar images as well as run the pipeline:
+
+```python
+generator = torch.Generator(device="cuda").manual_seed(8)
+image = pipeline(prompt, generator=generator).images[0]
+image
+```
+
+
+
+
+
+
+
+
+## Changing the scheduler
+
+Now we show how easy it is to change the scheduler of a pipeline. Every scheduler has a property [`~SchedulerMixin.compatibles`]
+which defines all compatible schedulers. You can take a look at all available, compatible schedulers for the Stable Diffusion pipeline as follows.
+
+```python
+pipeline.scheduler.compatibles
+```
+
+**Output**:
+```
+[diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler,
+ diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
+ diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
+ diffusers.schedulers.scheduling_ddim.DDIMScheduler,
+ diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
+ diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
+ diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
+ diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler,
+ diffusers.schedulers.scheduling_pndm.PNDMScheduler,
+ diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
+ diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
+ diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
+ diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
+ diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler]
+```
+
+Cool, lots of schedulers to look at. Feel free to have a look at their respective class definitions:
+
+- [`EulerDiscreteScheduler`],
+- [`LMSDiscreteScheduler`],
+- [`DDIMScheduler`],
+- [`DDPMScheduler`],
+- [`HeunDiscreteScheduler`],
+- [`DPMSolverMultistepScheduler`],
+- [`DEISMultistepScheduler`],
+- [`PNDMScheduler`],
+- [`EulerAncestralDiscreteScheduler`],
+- [`UniPCMultistepScheduler`],
+- [`KDPM2DiscreteScheduler`],
+- [`DPMSolverSinglestepScheduler`],
+- [`KDPM2AncestralDiscreteScheduler`].
+
+We will now compare the input prompt with all other schedulers. To change the scheduler of the pipeline you can make use of the
+convenient [`~ConfigMixin.config`] property in combination with the [`~ConfigMixin.from_config`] function.
+
+```python
+pipeline.scheduler.config
+```
+
+returns a dictionary of the configuration of the scheduler:
+
+**Output**:
+```py
+FrozenDict([('num_train_timesteps', 1000),
+ ('beta_start', 0.00085),
+ ('beta_end', 0.012),
+ ('beta_schedule', 'scaled_linear'),
+ ('trained_betas', None),
+ ('skip_prk_steps', True),
+ ('set_alpha_to_one', False),
+ ('prediction_type', 'epsilon'),
+ ('timestep_spacing', 'leading'),
+ ('steps_offset', 1),
+ ('_use_default_values', ['timestep_spacing', 'prediction_type']),
+ ('_class_name', 'PNDMScheduler'),
+ ('_diffusers_version', '0.21.4'),
+ ('clip_sample', False)])
+```
+
+This configuration can then be used to instantiate a scheduler
+of a different class that is compatible with the pipeline. Here,
+we change the scheduler to the [`DDIMScheduler`].
+
+```python
+from diffusers import DDIMScheduler
+
+pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
+```
+
+Cool, now we can run the pipeline again to compare the generation quality.
+
+```python
+generator = torch.Generator(device="cuda").manual_seed(8)
+image = pipeline(prompt, generator=generator).images[0]
+image
+```
+
+
+
+
+
+
+
+If you are a JAX/Flax user, please check [this section](#changing-the-scheduler-in-flax) instead.
+
+## Compare schedulers
+
+So far we have tried running the stable diffusion pipeline with two schedulers: [`PNDMScheduler`] and [`DDIMScheduler`].
+A number of better schedulers have been released that can be run with much fewer steps; let's compare them here:
+
+[`LMSDiscreteScheduler`] usually leads to better results:
+
+```python
+from diffusers import LMSDiscreteScheduler
+
+pipeline.scheduler = LMSDiscreteScheduler.from_config(pipeline.scheduler.config)
+
+generator = torch.Generator(device="cuda").manual_seed(8)
+image = pipeline(prompt, generator=generator).images[0]
+image
+```
+
+
+
+
+
+
+
+
+[`EulerDiscreteScheduler`] and [`EulerAncestralDiscreteScheduler`] can generate high quality results with as little as 30 steps.
+
+```python
+from diffusers import EulerDiscreteScheduler
+
+pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
+
+generator = torch.Generator(device="cuda").manual_seed(8)
+image = pipeline(prompt, generator=generator, num_inference_steps=30).images[0]
+image
+```
+
+
+
+
+[`DPMSolverMultistepScheduler`] gives a reasonable speed/quality trade-off and can be run with as little as 20 steps.
+
+```python
+from diffusers import DPMSolverMultistepScheduler
+
+pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
+
+generator = torch.Generator(device="cuda").manual_seed(8)
+image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+
+
+As you can see, most images look very similar and are arguably of very similar quality. It often really depends on the specific use case which scheduler to choose. A good approach is always to run multiple different
+schedulers to compare results.
+
+## Changing the Scheduler in Flax
+
+If you are a JAX/Flax user, you can also change the default pipeline scheduler. This is a complete example of how to run inference using the Flax Stable Diffusion pipeline and the super-fast [DPM-Solver++ scheduler](../api/schedulers/multistep_dpm_solver):
+
+```Python
+import jax
+import numpy as np
+from flax.jax_utils import replicate
+from flax.training.common_utils import shard
+
+from diffusers import FlaxStableDiffusionPipeline, FlaxDPMSolverMultistepScheduler
+
+model_id = "runwayml/stable-diffusion-v1-5"
+scheduler, scheduler_state = FlaxDPMSolverMultistepScheduler.from_pretrained(
+ model_id,
+ subfolder="scheduler"
+)
+pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ model_id,
+ scheduler=scheduler,
+ revision="bf16",
+ dtype=jax.numpy.bfloat16,
+)
+params["scheduler"] = scheduler_state
+
+# Generate 1 image per parallel device (8 on TPUv2-8 or TPUv3-8)
+prompt = "a photo of an astronaut riding a horse on mars"
+num_samples = jax.device_count()
+prompt_ids = pipeline.prepare_inputs([prompt] * num_samples)
+
+prng_seed = jax.random.PRNGKey(0)
+num_inference_steps = 25
+
+# shard inputs and rng
+params = replicate(params)
+prng_seed = jax.random.split(prng_seed, jax.device_count())
+prompt_ids = shard(prompt_ids)
+
+images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
+```
+
+
+
+The following Flax schedulers are _not yet compatible_ with the Flax Stable Diffusion Pipeline:
+
+- `FlaxLMSDiscreteScheduler`
+- `FlaxDDPMScheduler`
+
+
diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md
new file mode 100644
index 0000000..582e49e
--- /dev/null
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -0,0 +1,452 @@
+
+
+# Stable Diffusion XL
+
+[[open-in-colab]]
+
+[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways:
+
+1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters
+2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped
+3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details
+
+This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting.
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q diffusers transformers accelerate invisible-watermark>=0.2.0
+```
+
+
+
+We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker:
+
+```py
+pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
+```
+
+
+
+## Load model checkpoints
+
+Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method:
+
+```py
+from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
+).to("cuda")
+```
+
+You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally:
+
+```py
+from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_single_file(
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors",
+ torch_dtype=torch.float16
+).to("cuda")
+
+refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16
+).to("cuda")
+```
+
+## Text-to-image
+
+For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipeline_text2image(prompt=prompt).images[0]
+image
+```
+
+
+
+
+
+## Image-to-image
+
+For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with:
+
+```py
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import load_image, make_image_grid
+
+# use from_pipe to avoid consuming additional memory when loading a checkpoint
+pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
+
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
+init_image = load_image(url)
+prompt = "a dog catching a frisbee in the jungle"
+image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+
+## Inpainting
+
+For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with.
+
+```py
+from diffusers import AutoPipelineForInpainting
+from diffusers.utils import load_image, make_image_grid
+
+# use from_pipe to avoid consuming additional memory when loading a checkpoint
+pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
+
+img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
+mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
+
+init_image = load_image(img_url)
+mask_image = load_image(mask_url)
+
+prompt = "A deep sea diver floating"
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
+make_image_grid([init_image, mask_image, image], rows=1, cols=3)
+```
+
+
+
+
+
+## Refine image quality
+
+SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner:
+
+1. use the base and refiner models together to produce a refined image
+2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained)
+
+### Base + refiner model
+
+When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise.
+
+As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+base = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0",
+ text_encoder_2=base.text_encoder_2,
+ vae=base.vae,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+).to("cuda")
+```
+
+To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter.
+
+
+
+The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff.
+
+
+
+Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image.
+
+```py
+prompt = "A majestic lion jumping from a big stone at night"
+
+image = base(
+ prompt=prompt,
+ num_inference_steps=40,
+ denoising_end=0.8,
+ output_type="latent",
+).images
+image = refiner(
+ prompt=prompt,
+ num_inference_steps=40,
+ denoising_start=0.8,
+ image=image,
+).images[0]
+image
+```
+
+
+
+
+ default base model
+
+
+
+ ensemble of expert denoisers
+
+
+
+The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]:
+
+```py
+from diffusers import StableDiffusionXLInpaintPipeline
+from diffusers.utils import load_image, make_image_grid
+import torch
+
+base = StableDiffusionXLInpaintPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0",
+ text_encoder_2=base.text_encoder_2,
+ vae=base.vae,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+).to("cuda")
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = load_image(img_url)
+mask_image = load_image(mask_url)
+
+prompt = "A majestic tiger sitting on a bench"
+num_inference_steps = 75
+high_noise_frac = 0.7
+
+image = base(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ num_inference_steps=num_inference_steps,
+ denoising_end=high_noise_frac,
+ output_type="latent",
+).images
+image = refiner(
+ prompt=prompt,
+ image=image,
+ mask_image=mask_image,
+ num_inference_steps=num_inference_steps,
+ denoising_start=high_noise_frac,
+).images[0]
+make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3)
+```
+
+This ensemble of expert denoisers method works well for all available schedulers!
+
+### Base to refiner model
+
+SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting.
+
+Load the base and refiner models:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+base = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+refiner = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0",
+ text_encoder_2=base.text_encoder_2,
+ vae=base.vae,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16",
+).to("cuda")
+```
+
+Generate an image from the base model, and set the model output to **latent** space:
+
+```py
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+image = base(prompt=prompt, output_type="latent").images[0]
+```
+
+Pass the generated image to the refiner model:
+
+```py
+image = refiner(prompt=prompt, image=image[None, :]).images[0]
+```
+
+
+
+
+ base model
+
+
+
+ base model + refiner model
+
+
+
+For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
+
+## Micro-conditioning
+
+SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images.
+
+
+
+You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`].
+
+
+
+### Size conditioning
+
+There are two types of size conditioning:
+
+- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset.
+
+- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options!
+
+๐ค Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions:
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(
+ prompt=prompt,
+ negative_original_size=(512, 512),
+ negative_target_size=(1024, 1024),
+).images[0]
+```
+
+
+
+ Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
+
+
+### Crop conditioning
+
+Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in ๐ค Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions!
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0]
+image
+```
+
+
+
+
+
+You can also specify negative cropping coordinates to steer generation away from certain cropping parameters:
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipe(
+ prompt=prompt,
+ negative_original_size=(512, 512),
+ negative_crops_coords_top_left=(0, 0),
+ negative_target_size=(1024, 1024),
+).images[0]
+image
+```
+
+## Use a different prompt for each text-encoder
+
+SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts):
+
+```py
+from diffusers import StableDiffusionXLPipeline
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+).to("cuda")
+
+# prompt is passed to OAI CLIP-ViT/L-14
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+# prompt_2 is passed to OpenCLIP-ViT/bigG-14
+prompt_2 = "Van Gogh painting"
+image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
+image
+```
+
+
+
+
+
+The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl) section.
+
+## Optimizations
+
+SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference.
+
+1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors:
+
+```diff
+- base.to("cuda")
+- refiner.to("cuda")
++ base.enable_model_cpu_offload()
++ refiner.enable_model_cpu_offload()
+```
+
+2. Use `torch.compile` for ~20% speed-up (you need `torch>=2.0`):
+
+```diff
++ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)
++ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+3. Enable [xFormers](../optimization/xformers) to run SDXL if `torch<2.0`:
+
+```diff
++ base.enable_xformers_memory_efficient_attention()
++ refiner.enable_xformers_memory_efficient_attention()
+```
+
+## Other resources
+
+If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with ๐ค Diffusers.
diff --git a/docs/source/en/using-diffusers/sdxl_turbo.md b/docs/source/en/using-diffusers/sdxl_turbo.md
new file mode 100644
index 0000000..9ec0e94
--- /dev/null
+++ b/docs/source/en/using-diffusers/sdxl_turbo.md
@@ -0,0 +1,118 @@
+
+
+# Stable Diffusion XL Turbo
+
+[[open-in-colab]]
+
+SDXL Turbo is an adversarial time-distilled [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) model capable
+of running inference in as little as 1 step.
+
+This guide will show you how to use SDXL-Turbo for text-to-image and image-to-image.
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q diffusers transformers accelerate
+```
+
+## Load model checkpoints
+
+Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
+pipeline = pipeline.to("cuda")
+```
+
+You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally. For this loading method, you need to set `timestep_spacing="trailing"` (feel free to experiment with the other scheduler config values to get better results):
+
+```py
+from diffusers import StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler
+import torch
+
+pipeline = StableDiffusionXLPipeline.from_single_file(
+ "https://huggingface.co/stabilityai/sdxl-turbo/blob/main/sd_xl_turbo_1.0_fp16.safetensors",
+ torch_dtype=torch.float16, variant="fp16")
+pipeline = pipeline.to("cuda")
+pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing")
+```
+
+## Text-to-image
+
+For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the `height` and `width` parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so.
+
+Make sure to set `guidance_scale` to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images.
+Increasing the number of steps to 2, 3 or 4 should improve image quality.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
+pipeline_text2image = pipeline_text2image.to("cuda")
+
+prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
+
+image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
+image
+```
+
+
+
+
+
+## Image-to-image
+
+For image-to-image generation, make sure that `num_inference_steps * strength` is larger or equal to 1.
+The image-to-image pipeline will run for `int(num_inference_steps * strength)` steps, e.g. `0.5 * 2.0 = 1` step in
+our example below.
+
+```py
+from diffusers import AutoPipelineForImage2Image
+from diffusers.utils import load_image, make_image_grid
+
+# use from_pipe to avoid consuming additional memory when loading a checkpoint
+pipeline_image2image = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
+
+init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
+init_image = init_image.resize((512, 512))
+
+prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
+
+image = pipeline_image2image(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0]
+make_image_grid([init_image, image], rows=1, cols=2)
+```
+
+
+
+
+
+## Speed-up SDXL Turbo even more
+
+- Compile the UNet if you are using PyTorch version 2.0 or higher. The first inference run will be very slow, but subsequent ones will be much faster.
+
+```py
+pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+- When using the default VAE, keep it in `float32` to avoid costly `dtype` conversions before and after each generation. You only need to do this one before your first generation:
+
+```py
+pipe.upcast_vae()
+```
+
+As an alternative, you can also use a [16-bit VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) created by community member [`@madebyollin`](https://huggingface.co/madebyollin) that does not need to be upcasted to `float32`.
diff --git a/docs/source/en/using-diffusers/shap-e.md b/docs/source/en/using-diffusers/shap-e.md
new file mode 100644
index 0000000..588dde9
--- /dev/null
+++ b/docs/source/en/using-diffusers/shap-e.md
@@ -0,0 +1,192 @@
+
+
+# Shap-E
+
+[[open-in-colab]]
+
+Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps:
+
+1. an encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
+2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications
+
+This guide will show you how to use Shap-E to start generating your own 3D assets!
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q diffusers transformers accelerate trimesh
+```
+
+## Text-to-3D
+
+To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`]. The pipeline generates a list of image frames which are used to create the 3D object.
+
+```py
+import torch
+from diffusers import ShapEPipeline
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16")
+pipe = pipe.to(device)
+
+guidance_scale = 15.0
+prompt = ["A firecracker", "A birthday cupcake"]
+
+images = pipe(
+ prompt,
+ guidance_scale=guidance_scale,
+ num_inference_steps=64,
+ frame_size=256,
+).images
+```
+
+Now use the [`~utils.export_to_gif`] function to turn the list of image frames into a gif of the 3D object.
+
+```py
+from diffusers.utils import export_to_gif
+
+export_to_gif(images[0], "firecracker_3d.gif")
+export_to_gif(images[1], "cake_3d.gif")
+```
+
+
+
+
+ prompt = "A firecracker"
+
+
+
+ prompt = "A birthday cupcake"
+
+
+
+## Image-to-3D
+
+To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`]. You can use an existing image or generate an entirely new one. Let's use the [Kandinsky 2.1](../api/pipelines/kandinsky) model to generate a new image.
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+
+prompt = "A cheeseburger, white background"
+
+image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
+image = pipeline(
+ prompt,
+ image_embeds=image_embeds,
+ negative_image_embeds=negative_image_embeds,
+).images[0]
+
+image.save("burger.png")
+```
+
+Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D representation of it.
+
+```py
+from PIL import Image
+from diffusers import ShapEImg2ImgPipeline
+from diffusers.utils import export_to_gif
+
+pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda")
+
+guidance_scale = 3.0
+image = Image.open("burger.png").resize((256, 256))
+
+images = pipe(
+ image,
+ guidance_scale=guidance_scale,
+ num_inference_steps=64,
+ frame_size=256,
+).images
+
+gif_path = export_to_gif(images[0], "burger_3d.gif")
+```
+
+
+
+
+ cheeseburger
+
+
+
+ 3D cheeseburger
+
+
+
+## Generate mesh
+
+Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the ๐ค Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview).
+
+You can generate mesh outputs for both the [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`] by specifying the `output_type` parameter as `"mesh"`:
+
+```py
+import torch
+from diffusers import ShapEPipeline
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16")
+pipe = pipe.to(device)
+
+guidance_scale = 15.0
+prompt = "A birthday cupcake"
+
+images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
+```
+
+Use the [`~utils.export_to_ply`] function to save the mesh output as a `ply` file:
+
+
+
+You can optionally save the mesh output as an `obj` file with the [`~utils.export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage!
+
+
+
+```py
+from diffusers.utils import export_to_ply
+
+ply_path = export_to_ply(images[0], "3d_cake.ply")
+print(f"Saved to folder: {ply_path}")
+```
+
+Then you can convert the `ply` file to a `glb` file with the trimesh library:
+
+```py
+import trimesh
+
+mesh = trimesh.load("3d_cake.ply")
+mesh_export = mesh.export("3d_cake.glb", file_type="glb")
+```
+
+By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform:
+
+```py
+import trimesh
+import numpy as np
+
+mesh = trimesh.load("3d_cake.ply")
+rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
+mesh = mesh.apply_transform(rot)
+mesh_export = mesh.export("3d_cake.glb", file_type="glb")
+```
+
+Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!
+
+
+
+
diff --git a/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md b/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md
new file mode 100644
index 0000000..5b2c688
--- /dev/null
+++ b/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md
@@ -0,0 +1,225 @@
+
+
+# JAX/Flax
+
+[[open-in-colab]]
+
+๐ค Diffusers supports Flax for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform. This guide shows you how to run inference with Stable Diffusion using JAX/Flax.
+
+Before you begin, make sure you have the necessary libraries installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install -q jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy
+#!pip install -q diffusers
+```
+
+You should also make sure you're using a TPU backend. While JAX does not run exclusively on TPUs, you'll get the best performance on a TPU because each server has 8 TPU accelerators working in parallel.
+
+If you are running this guide in Colab, select *Runtime* in the menu above, select the option *Change runtime type*, and then select *TPU* under the *Hardware accelerator* setting. Import JAX and quickly check whether you're using a TPU:
+
+```python
+import jax
+import jax.tools.colab_tpu
+jax.tools.colab_tpu.setup_tpu()
+
+num_devices = jax.device_count()
+device_type = jax.devices()[0].device_kind
+
+print(f"Found {num_devices} JAX devices of type {device_type}.")
+assert (
+ "TPU" in device_type,
+ "Available device is not a TPU, please select TPU from Runtime > Change runtime type > Hardware accelerator"
+)
+# Found 8 JAX devices of type Cloud TPU.
+```
+
+Great, now you can import the rest of the dependencies you'll need:
+
+```python
+import jax.numpy as jnp
+from jax import pmap
+from flax.jax_utils import replicate
+from flax.training.common_utils import shard
+
+from diffusers import FlaxStableDiffusionPipeline
+```
+
+## Load a model
+
+Flax is a functional framework, so models are stateless and parameters are stored outside of them. Loading a pretrained Flax pipeline returns *both* the pipeline and the model weights (or parameters). In this guide, you'll use `bfloat16`, a more efficient half-float type that is supported by TPUs (you can also use `float32` for full precision if you want).
+
+```python
+dtype = jnp.bfloat16
+pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ revision="bf16",
+ dtype=dtype,
+)
+```
+
+## Inference
+
+TPUs usually have 8 devices working in parallel, so let's use the same prompt for each device. This means you can perform inference on 8 devices at once, with each device generating one image. As a result, you'll get 8 images in the same amount of time it takes for one chip to generate a single image!
+
+
+
+Learn more details in the [How does parallelization work?](#how-does-parallelization-work) section.
+
+
+
+After replicating the prompt, get the tokenized text ids by calling the `prepare_inputs` function on the pipeline. The length of the tokenized text is set to 77 tokens as required by the configuration of the underlying CLIP text model.
+
+```python
+prompt = "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of field, close up, split lighting, cinematic"
+prompt = [prompt] * jax.device_count()
+prompt_ids = pipeline.prepare_inputs(prompt)
+prompt_ids.shape
+# (8, 77)
+```
+
+Model parameters and inputs have to be replicated across the 8 parallel devices. The parameters dictionary is replicated with [`flax.jax_utils.replicate`](https://flax.readthedocs.io/en/latest/api_reference/flax.jax_utils.html#flax.jax_utils.replicate) which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`.
+
+```python
+# parameters
+p_params = replicate(params)
+
+# arrays
+prompt_ids = shard(prompt_ids)
+prompt_ids.shape
+# (8, 1, 77)
+```
+
+This shape means each one of the 8 devices receives as an input a `jnp` array with shape `(1, 77)`, where `1` is the batch size per device. On TPUs with sufficient memory, you could have a batch size larger than `1` if you want to generate multiple images (per chip) at once.
+
+Next, create a random number generator to pass to the generation function. This is standard procedure in Flax, which is very serious and opinionated about random numbers. All functions that deal with random numbers are expected to receive a generator to ensure reproducibility, even when you're training across multiple distributed devices.
+
+The helper function below uses a seed to initialize a random number generator. As long as you use the same seed, you'll get the exact same results. Feel free to use different seeds when exploring results later in the guide.
+
+```python
+def create_key(seed=0):
+ return jax.random.PRNGKey(seed)
+```
+
+The helper function, or `rng`, is split 8 times so each device receives a different generator and generates a different image.
+
+```python
+rng = create_key(0)
+rng = jax.random.split(rng, jax.device_count())
+```
+
+To take advantage of JAX's optimized speed on a TPU, pass `jit=True` to the pipeline to compile the JAX code into an efficient representation and to ensure the model runs in parallel across the 8 devices.
+
+
+
+You need to ensure all your inputs have the same shape in subsequent calls, otherwise JAX will need to recompile the code which is slower.
+
+
+
+The first inference run takes more time because it needs to compile the code, but subsequent calls (even with different inputs) are much faster. For example, it took more than a minute to compile on a TPU v2-8, but then it takes about **7s** on a future inference run!
+
+```py
+%%time
+images = pipeline(prompt_ids, p_params, rng, jit=True)[0]
+
+# CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s
+# Wall time: 1min 29s
+```
+
+The returned array has shape `(8, 1, 512, 512, 3)` which should be reshaped to remove the second dimension and get 8 images of `512 ร 512 ร 3`. Then you can use the [`~utils.numpy_to_pil`] function to convert the arrays into images.
+
+```python
+from diffusers.utils import make_image_grid
+
+images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
+images = pipeline.numpy_to_pil(images)
+make_image_grid(images, rows=2, cols=4)
+```
+
+data:image/s3,"s3://crabby-images/9639d/9639dcc488090f7e9f4af40abcacc9232daf18c2" alt="img"
+
+## Using different prompts
+
+You don't necessarily have to use the same prompt on all devices. For example, to generate 8 different prompts:
+
+```python
+prompts = [
+ "Labrador in the style of Hokusai",
+ "Painting of a squirrel skating in New York",
+ "HAL-9000 in the style of Van Gogh",
+ "Times Square under water, with fish and a dolphin swimming around",
+ "Ancient Roman fresco showing a man working on his laptop",
+ "Close-up photograph of young black woman against urban background, high quality, bokeh",
+ "Armchair in the shape of an avocado",
+ "Clown astronaut in space, with Earth in the background",
+]
+
+prompt_ids = pipeline.prepare_inputs(prompts)
+prompt_ids = shard(prompt_ids)
+
+images = pipeline(prompt_ids, p_params, rng, jit=True).images
+images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
+images = pipeline.numpy_to_pil(images)
+
+make_image_grid(images, 2, 4)
+```
+
+data:image/s3,"s3://crabby-images/25dc8/25dc88a9090187da31a5fec88e0d4a76e6bd84ee" alt="img"
+
+## How does parallelization work?
+
+The Flax pipeline in ๐ค Diffusers automatically compiles the model and runs it in parallel on all available devices. Let's take a closer look at how that process works.
+
+JAX parallelization can be done in multiple ways. The easiest one revolves around using the [`jax.pmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.pmap.html) function to achieve single-program multiple-data (SPMD) parallelization. It means running several copies of the same code, each on different data inputs. More sophisticated approaches are possible, and you can go over to the JAX [documentation](https://jax.readthedocs.io/en/latest/index.html) to explore this topic in more detail if you are interested!
+
+`jax.pmap` does two things:
+
+1. Compiles (or "`jit`s") the code which is similar to `jax.jit()`. This does not happen when you call `pmap`, and only the first time the `pmap`ped function is called.
+2. Ensures the compiled code runs in parallel on all available devices.
+
+To demonstrate, call `pmap` on the pipeline's `_generate` method (this is a private method that generates images and may be renamed or removed in future releases of ๐ค Diffusers):
+
+```python
+p_generate = pmap(pipeline._generate)
+```
+
+After calling `pmap`, the prepared function `p_generate` will:
+
+1. Make a copy of the underlying function, `pipeline._generate`, on each device.
+2. Send each device a different portion of the input arguments (this is why it's necessary to call the *shard* function). In this case, `prompt_ids` has shape `(8, 1, 77, 768)` so the array is split into 8 and each copy of `_generate` receives an input with shape `(1, 77, 768)`.
+
+The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code. You don't have to change anything else to make the code work in parallel.
+
+The first time you call the pipeline takes more time, but the calls afterward are much faster. The `block_until_ready` function is used to correctly measure inference time because JAX uses asynchronous dispatch and returns control to the Python loop as soon as it can. You don't need to use that in your code; blocking occurs automatically when you want to use the result of a computation that has not yet been materialized.
+
+```py
+%%time
+images = p_generate(prompt_ids, p_params, rng)
+images = images.block_until_ready()
+
+# CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s
+# Wall time: 1min 15s
+```
+
+Check your image dimensions to see if they're correct:
+
+```python
+images.shape
+# (8, 1, 512, 512, 3)
+```
+
+## Resources
+
+To learn more about how JAX works with Stable Diffusion, you may be interested in reading:
+
+* [Accelerating Stable Diffusion XL Inference with JAX on Cloud TPU v5e](https://hf.co/blog/sdxl_jax)
diff --git a/docs/source/en/using-diffusers/svd.md b/docs/source/en/using-diffusers/svd.md
new file mode 100644
index 0000000..c9c51f5
--- /dev/null
+++ b/docs/source/en/using-diffusers/svd.md
@@ -0,0 +1,121 @@
+
+
+# Stable Video Diffusion
+
+[[open-in-colab]]
+
+[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127) is a powerful image-to-video generation model that can generate 2-4 second high resolution (576x1024) videos conditioned on an input image.
+
+This guide will show you how to use SVD to generate short videos from images.
+
+Before you begin, make sure you have the following libraries installed:
+
+```py
+!pip install -q -U diffusers transformers accelerate
+```
+
+The are two variants of this model, [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.
+
+You'll use the SVD-XT checkpoint for this guide.
+
+```python
+import torch
+
+from diffusers import StableVideoDiffusionPipeline
+from diffusers.utils import load_image, export_to_video
+
+pipe = StableVideoDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
+)
+pipe.enable_model_cpu_offload()
+
+# Load the conditioning image
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
+image = image.resize((1024, 576))
+
+generator = torch.manual_seed(42)
+frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+
+export_to_video(frames, "generated.mp4", fps=7)
+```
+
+
+
+
+ "source image of a rocket"
+
+
+
+ "generated video from source image"
+
+
+
+## torch.compile
+
+You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/torch2.0#torchcompile) the UNet.
+
+```diff
+- pipe.enable_model_cpu_offload()
++ pipe.to("cuda")
++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+## Reduce memory usage
+
+Video generation is very memory intensive because you're essentially generating `num_frames` all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement:
+
+- enable model offloading: each component of the pipeline is offloaded to the CPU once it's not needed anymore.
+- enable feed-forward chunking: the feed-forward layer runs in a loop instead of running a single feed-forward with a huge batch size.
+- reduce `decode_chunk_size`: the VAE decodes frames in chunks instead of decoding them all together. Setting `decode_chunk_size=1` decodes one frame at a time and uses the least amount of memory (we recommend adjusting this value based on your GPU memory) but the video might have some flickering.
+
+```diff
+- pipe.enable_model_cpu_offload()
+- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
++ pipe.enable_model_cpu_offload()
++ pipe.unet.enable_forward_chunking()
++ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
+```
+
+Using all these tricks togethere should lower the memory requirement to less than 8GB VRAM.
+
+## Micro-conditioning
+
+Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video:
+
+- `fps`: the frames per second of the generated video.
+- `motion_bucket_id`: the motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id increases the motion of the generated video.
+- `noise_aug_strength`: the amount of noise added to the conditioning image. The higher the values the less the video resembles the conditioning image. Increasing this value also increases the motion of the generated video.
+
+For example, to generate a video with more motion, use the `motion_bucket_id` and `noise_aug_strength` micro-conditioning parameters:
+
+```python
+import torch
+
+from diffusers import StableVideoDiffusionPipeline
+from diffusers.utils import load_image, export_to_video
+
+pipe = StableVideoDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
+)
+pipe.enable_model_cpu_offload()
+
+# Load the conditioning image
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
+image = image.resize((1024, 576))
+
+generator = torch.manual_seed(42)
+frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
+export_to_video(frames, "generated.mp4", fps=7)
+```
+
+data:image/s3,"s3://crabby-images/b7034/b7034cc2beea7b4a344ec95ac27badcfc2e88921" alt=""
diff --git a/docs/source/en/using-diffusers/text-img2vid.md b/docs/source/en/using-diffusers/text-img2vid.md
new file mode 100644
index 0000000..56cc85f
--- /dev/null
+++ b/docs/source/en/using-diffusers/text-img2vid.md
@@ -0,0 +1,497 @@
+
+
+# Text or image-to-video
+
+Driven by the success of text-to-image diffusion models, generative video models are able to generate short clips of video from a text prompt or an initial image. These models extend a pretrained diffusion model to generate videos by adding some type of temporal and/or spatial convolution layer to the architecture. A mixed dataset of images and videos are used to train the model which learns to output a series of video frames based on the text or image conditioning.
+
+This guide will show you how to generate videos, how to configure video model parameters, and how to control video generation.
+
+## Popular models
+
+> [!TIP]
+> Discover other cool and trending video generation models on the Hub [here](https://huggingface.co/models?pipeline_tag=text-to-video&sort=trending)!
+
+[Stable Video Diffusions (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid), [I2VGen-XL](https://huggingface.co/ali-vilab/i2vgen-xl/), [AnimateDiff](https://huggingface.co/guoyww/animatediff), and [ModelScopeT2V](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b) are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos.
+
+### Stable Video Diffusion
+
+[SVD](../api/pipelines/svd) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the [Stable Video Diffusion](../using-diffusers/svd) guide.
+
+Begin by loading the [`StableVideoDiffusionPipeline`] and passing an initial image to generate a video from.
+
+```py
+import torch
+from diffusers import StableVideoDiffusionPipeline
+from diffusers.utils import load_image, export_to_video
+
+pipeline = StableVideoDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
+image = image.resize((1024, 576))
+
+generator = torch.manual_seed(42)
+frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0]
+export_to_video(frames, "generated.mp4", fps=7)
+```
+
+
+
+
+ initial image
+
+
+
+ generated video
+
+
+
+### I2VGen-XL
+
+[I2VGen-XL](../api/pipelines/i2vgenxl) is a diffusion model that can generate higher resolution videos than SVD and it is also capable of accepting text prompts in addition to images. The model is trained with two hierarchical encoders (detail and global encoder) to better capture low and high-level details in images. These learned details are used to train a video diffusion model which refines the video resolution and details in the generated video.
+
+You can use I2VGen-XL by loading the [`I2VGenXLPipeline`], and passing a text and image prompt to generate a video.
+
+```py
+import torch
+from diffusers import I2VGenXLPipeline
+from diffusers.utils import export_to_gif, load_image
+
+pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
+pipeline.enable_model_cpu_offload()
+
+image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
+image = load_image(image_url).convert("RGB")
+
+prompt = "Papers were floating in the air on a table in the library"
+negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
+generator = torch.manual_seed(8888)
+
+frames = pipeline(
+ prompt=prompt,
+ image=image,
+ num_inference_steps=50,
+ negative_prompt=negative_prompt,
+ guidance_scale=9.0,
+ generator=generator
+).frames[0]
+export_to_gif(frames, "i2v.gif")
+```
+
+
+
+
+ initial image
+
+
+
+ generated video
+
+
+
+### AnimateDiff
+
+[AnimateDiff](../api/pipelines/animatediff) is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into "video models".
+
+Start by loading a [`MotionAdapter`].
+
+```py
+import torch
+from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+```
+
+Then load a finetuned Stable Diffusion model with the [`AnimateDiffPipeline`].
+
+```py
+pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16)
+scheduler = DDIMScheduler.from_pretrained(
+ "emilianJR/epiCRealism",
+ subfolder="scheduler",
+ clip_sample=False,
+ timestep_spacing="linspace",
+ beta_schedule="linear",
+ steps_offset=1,
+)
+pipeline.scheduler = scheduler
+pipeline.enable_vae_slicing()
+pipeline.enable_model_cpu_offload()
+```
+
+Create a prompt and generate the video.
+
+```py
+output = pipeline(
+ prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
+ negative_prompt="bad quality, worse quality, low resolution",
+ num_frames=16,
+ guidance_scale=7.5,
+ num_inference_steps=50,
+ generator=torch.Generator("cpu").manual_seed(49),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+```
+
+
+
+
+
+### ModelscopeT2V
+
+[ModelscopeT2V](../api/pipelines/text_to_video) adds spatial and temporal convolutions and attention to a UNet, and it is trained on image-text and video-text datasets to enhance what it learns during training. The model takes a prompt, encodes it and creates text embeddings which are denoised by the UNet, and then decoded by a VQGAN into a video.
+
+
+
+ModelScopeT2V generates watermarked videos due to the datasets it was trained on. To use a watermark-free model, try the [cerspense/zeroscope_v2_76w](https://huggingface.co/cerspense/zeroscope_v2_576w) model with the [`TextToVideoSDPipeline`] first, and then upscale it's output with the [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL) checkpoint using the [`VideoToVideoSDPipeline`].
+
+
+
+Load a ModelScopeT2V checkpoint into the [`DiffusionPipeline`] along with a prompt to generate a video.
+
+```py
+import torch
+from diffusers import DiffusionPipeline
+from diffusers.utils import export_to_video
+
+pipeline = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
+pipeline.enable_model_cpu_offload()
+pipeline.enable_vae_slicing()
+
+prompt = "Confident teddy bear surfer rides the wave in the tropics"
+video_frames = pipeline(prompt).frames[0]
+export_to_video(video_frames, "modelscopet2v.mp4", fps=10)
+```
+
+
+
+
+
+## Configure model parameters
+
+There are a few important parameters you can configure in the pipeline that'll affect the video generation process and quality. Let's take a closer look at what these parameters do and how changing them affects the output.
+
+### Number of frames
+
+The `num_frames` parameter determines how many video frames are generated per second. A frame is an image that is played in a sequence of other frames to create motion or a video. This affects video length because the pipeline generates a certain number of frames per second (check a pipeline's API reference for the default value). To increase the video duration, you'll need to increase the `num_frames` parameter.
+
+```py
+import torch
+from diffusers import StableVideoDiffusionPipeline
+from diffusers.utils import load_image, export_to_video
+
+pipeline = StableVideoDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-video-diffusion-img2vid", torch_dtype=torch.float16, variant="fp16"
+)
+pipeline.enable_model_cpu_offload()
+
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
+image = image.resize((1024, 576))
+
+generator = torch.manual_seed(42)
+frames = pipeline(image, decode_chunk_size=8, generator=generator, num_frames=25).frames[0]
+export_to_video(frames, "generated.mp4", fps=7)
+```
+
+
+
+
+ num_frames=14
+
+
+
+ num_frames=25
+
+
+
+### Guidance scale
+
+The `guidance_scale` parameter controls how closely aligned the generated video and text prompt or initial image is. A higher `guidance_scale` value means your generated video is more aligned with the text prompt or initial image, while a lower `guidance_scale` value means your generated video is less aligned which could give the model more "creativity" to interpret the conditioning input.
+
+
+
+SVD uses the `min_guidance_scale` and `max_guidance_scale` parameters for applying guidance to the first and last frames respectively.
+
+
+
+```py
+import torch
+from diffusers import I2VGenXLPipeline
+from diffusers.utils import export_to_gif, load_image
+
+pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
+pipeline.enable_model_cpu_offload()
+
+image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
+image = load_image(image_url).convert("RGB")
+
+prompt = "Papers were floating in the air on a table in the library"
+negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
+generator = torch.manual_seed(0)
+
+frames = pipeline(
+ prompt=prompt,
+ image=image,
+ num_inference_steps=50,
+ negative_prompt=negative_prompt,
+ guidance_scale=1.0,
+ generator=generator
+).frames[0]
+export_to_gif(frames, "i2v.gif")
+```
+
+
+
+
+ guidance_scale=9.0
+
+
+
+ guidance_scale=1.0
+
+
+
+### Negative prompt
+
+A negative prompt deters the model from generating things you donโt want it to. This parameter is commonly used to improve overall generation quality by removing poor or bad features such as โlow resolutionโ or โbad detailsโ.
+
+```py
+import torch
+from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+
+pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16)
+scheduler = DDIMScheduler.from_pretrained(
+ "emilianJR/epiCRealism",
+ subfolder="scheduler",
+ clip_sample=False,
+ timestep_spacing="linspace",
+ beta_schedule="linear",
+ steps_offset=1,
+)
+pipeline.scheduler = scheduler
+pipeline.enable_vae_slicing()
+pipeline.enable_model_cpu_offload()
+
+output = pipeline(
+ prompt="360 camera shot of a sushi roll in a restaurant",
+ negative_prompt="Distorted, discontinuous, ugly, blurry, low resolution, motionless, static",
+ num_frames=16,
+ guidance_scale=7.5,
+ num_inference_steps=50,
+ generator=torch.Generator("cpu").manual_seed(0),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+```
+
+
+
+
+ no negative prompt
+
+
+
+ negative prompt applied
+
+
+
+### Model-specific parameters
+
+There are some pipeline parameters that are unique to each model such as adjusting the motion in a video or adding noise to the initial image.
+
+
+
+
+Stable Video Diffusion provides additional micro-conditioning for the frame rate with the `fps` parameter and for motion with the `motion_bucket_id` parameter. Together, these parameters allow for adjusting the amount of motion in the generated video.
+
+There is also a `noise_aug_strength` parameter that increases the amount of noise added to the initial image. Varying this parameter affects how similar the generated video and initial image are. A higher `noise_aug_strength` also increases the amount of motion. To learn more, read the [Micro-conditioning](../using-diffusers/svd#micro-conditioning) guide.
+
+
+
+
+Text2Video-Zero computes the amount of motion to apply to each frame from randomly sampled latents. You can use the `motion_field_strength_x` and `motion_field_strength_y` parameters to control the amount of motion to apply to the x and y-axes of the video. The parameters `t0` and `t1` are the timesteps to apply motion to the latents.
+
+
+
+
+## Control video generation
+
+Video generation can be controlled similar to how text-to-image, image-to-image, and inpainting can be controlled with a [`ControlNetModel`]. The only difference is you need to use the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] so each frame attends to the first frame.
+
+### Text2Video-Zero
+
+Text2Video-Zero video generation can be conditioned on pose and edge images for even greater control over a subject's motion in the generated video or to preserve the identity of a subject/object in the video. You can also use Text2Video-Zero with [InstructPix2Pix](../api/pipelines/pix2pix) for editing videos with text.
+
+
+
+
+Start by downloading a video and extracting the pose images from it.
+
+```py
+from huggingface_hub import hf_hub_download
+from PIL import Image
+import imageio
+
+filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4"
+repo_id = "PAIR/Text2Video-Zero"
+video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+
+reader = imageio.get_reader(video_path, "ffmpeg")
+frame_count = 8
+pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+```
+
+Load a [`ControlNetModel`] for pose estimation and a checkpoint into the [`StableDiffusionControlNetPipeline`]. Then you'll use the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] for the UNet and ControlNet.
+
+```py
+import torch
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+model_id = "runwayml/stable-diffusion-v1-5"
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
+pipeline = StableDiffusionControlNetPipeline.from_pretrained(
+ model_id, controlnet=controlnet, torch_dtype=torch.float16
+).to("cuda")
+
+pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+pipeline.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+```
+
+Fix the latents for all the frames, and then pass your prompt and extracted pose images to the model to generate a video.
+
+```py
+latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
+
+prompt = "Darth Vader dancing in a desert"
+result = pipeline(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
+imageio.mimsave("video.mp4", result, fps=4)
+```
+
+
+
+
+Download a video and extract the edges from it.
+
+```py
+from huggingface_hub import hf_hub_download
+from PIL import Image
+import imageio
+
+filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4"
+repo_id = "PAIR/Text2Video-Zero"
+video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+
+reader = imageio.get_reader(video_path, "ffmpeg")
+frame_count = 8
+pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+```
+
+Load a [`ControlNetModel`] for canny edge and a checkpoint into the [`StableDiffusionControlNetPipeline`]. Then you'll use the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] for the UNet and ControlNet.
+
+```py
+import torch
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+model_id = "runwayml/stable-diffusion-v1-5"
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+pipeline = StableDiffusionControlNetPipeline.from_pretrained(
+ model_id, controlnet=controlnet, torch_dtype=torch.float16
+).to("cuda")
+
+pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+pipeline.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+```
+
+Fix the latents for all the frames, and then pass your prompt and extracted edge images to the model to generate a video.
+
+```py
+latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
+
+prompt = "Darth Vader dancing in a desert"
+result = pipeline(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
+imageio.mimsave("video.mp4", result, fps=4)
+```
+
+
+
+
+InstructPix2Pix allows you to use text to describe the changes you want to make to the video. Start by downloading and reading a video.
+
+```py
+from huggingface_hub import hf_hub_download
+from PIL import Image
+import imageio
+
+filename = "__assets__/pix2pix video/camel.mp4"
+repo_id = "PAIR/Text2Video-Zero"
+video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+
+reader = imageio.get_reader(video_path, "ffmpeg")
+frame_count = 8
+video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+```
+
+Load the [`StableDiffusionInstructPix2PixPipeline`] and set the [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor`] for the UNet.
+
+```py
+import torch
+from diffusers import StableDiffusionInstructPix2PixPipeline
+from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained("timbrooks/instruct-pix2pix", torch_dtype=torch.float16).to("cuda")
+pipeline.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3))
+```
+
+Pass a prompt describing the change you want to apply to the video.
+
+```py
+prompt = "make it Van Gogh Starry Night style"
+result = pipeline(prompt=[prompt] * len(video), image=video).images
+imageio.mimsave("edited_video.mp4", result, fps=4)
+```
+
+
+
+
+## Optimize
+
+Video generation requires a lot of memory because you're generating many video frames at once. You can reduce your memory requirements at the expense of some inference speed. Try:
+
+1. offloading pipeline components that are no longer needed to the CPU
+2. feed-forward chunking runs the feed-forward layer in a loop instead of all at once
+3. break up the number of frames the VAE has to decode into chunks instead of decoding them all at once
+
+```diff
+- pipeline.enable_model_cpu_offload()
+- frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0]
++ pipeline.enable_model_cpu_offload()
++ pipeline.unet.enable_forward_chunking()
++ frames = pipeline(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
+```
+
+If memory is not an issue and you want to optimize for speed, try wrapping the UNet with [`torch.compile`](../optimization/torch2.0#torchcompile).
+
+```diff
+- pipeline.enable_model_cpu_offload()
++ pipeline.to("cuda")
++ pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
+```
diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md
new file mode 100644
index 0000000..fd9e64b
--- /dev/null
+++ b/docs/source/en/using-diffusers/textual_inversion_inference.md
@@ -0,0 +1,118 @@
+
+
+# Textual inversion
+
+[[open-in-colab]]
+
+The [`StableDiffusionPipeline`] supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images. This gives you more control over the generated images and allows you to tailor the model towards specific concepts. You can get started quickly with a collection of community created concepts in the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer).
+
+This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](../training/text_inversion) training guide.
+
+Import the necessary libraries:
+
+```py
+import torch
+from diffusers import StableDiffusionPipeline
+from diffusers.utils import make_image_grid
+```
+
+## Stable Diffusion 1 and 2
+
+Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer):
+
+```py
+pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5"
+repo_id_embeds = "sd-concepts-library/cat-toy"
+```
+
+Now you can load a pipeline, and pass the pre-learned concept to it:
+
+```py
+pipeline = StableDiffusionPipeline.from_pretrained(
+ pretrained_model_name_or_path, torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+
+pipeline.load_textual_inversion(repo_id_embeds)
+```
+
+Create a prompt with the pre-learned concept by using the special placeholder token ``, and choose the number of samples and rows of images you'd like to generate:
+
+```py
+prompt = "a grafitti in a favela wall with a on it"
+
+num_samples_per_row = 2
+num_rows = 2
+```
+
+Then run the pipeline (feel free to adjust the parameters like `num_inference_steps` and `guidance_scale` to see how they affect image quality), save the generated images and visualize them with the helper function you created at the beginning:
+
+```py
+all_images = []
+for _ in range(num_rows):
+ images = pipeline(prompt, num_images_per_prompt=num_samples_per_row, num_inference_steps=50, guidance_scale=7.5).images
+ all_images.extend(images)
+
+grid = make_image_grid(all_images, num_rows, num_samples_per_row)
+grid
+```
+
+
+
+
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model.
+
+Let's download the SDXL textual inversion embeddings and have a closer look at it's structure:
+
+```py
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+
+file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors")
+state_dict = load_file(file)
+state_dict
+```
+
+```
+{'clip_g': tensor([[ 0.0077, -0.0112, 0.0065, ..., 0.0195, 0.0159, 0.0275],
+ ...,
+ [-0.0170, 0.0213, 0.0143, ..., -0.0302, -0.0240, -0.0362]],
+ 'clip_l': tensor([[ 0.0023, 0.0192, 0.0213, ..., -0.0385, 0.0048, -0.0011],
+ ...,
+ [ 0.0475, -0.0508, -0.0145, ..., 0.0070, -0.0089, -0.0163]],
+```
+
+There are two tensors, `"clip_g"` and `"clip_l"`.
+`"clip_g"` corresponds to the bigger text encoder in SDXL and refers to
+`pipe.text_encoder_2` and `"clip_l"` refers to `pipe.text_encoder`.
+
+Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer
+to [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`]:
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.to("cuda")
+
+pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
+pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
+
+# the embedding should be used as a negative embedding, so we pass it as a negative prompt
+generator = torch.Generator().manual_seed(33)
+image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0]
+image
+```
diff --git a/docs/source/en/using-diffusers/unconditional_image_generation.md b/docs/source/en/using-diffusers/unconditional_image_generation.md
new file mode 100644
index 0000000..8767eab
--- /dev/null
+++ b/docs/source/en/using-diffusers/unconditional_image_generation.md
@@ -0,0 +1,55 @@
+
+
+# Unconditional image generation
+
+[[open-in-colab]]
+
+Unconditional image generation generates images that look like a random sample from the training data the model was trained on because the denoising process is not guided by any additional context like text or image.
+
+To get started, use the [`DiffusionPipeline`] to load the [anton-l/ddpm-butterflies-128](https://huggingface.co/anton-l/ddpm-butterflies-128) checkpoint to generate images of butterflies. The [`DiffusionPipeline`] downloads and caches all the model components required to generate an image.
+
+```py
+from diffusers import DiffusionPipeline
+
+generator = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128").to("cuda")
+image = generator().images[0]
+image
+```
+
+
+
+Want to generate images of something else? Take a look at the training [guide](../training/unconditional_training) to learn how to train a model to generate your own images.
+
+
+
+The output image is a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object that can be saved:
+
+```py
+image.save("generated_image.png")
+```
+
+You can also try experimenting with the `num_inference_steps` parameter, which controls the number of denoising steps. More denoising steps typically produce higher quality images, but it'll take longer to generate. Feel free to play around with this parameter to see how it affects the image quality.
+
+```py
+image = generator(num_inference_steps=100).images[0]
+image
+```
+
+Try out the Space below to generate an image of a butterfly!
+
+
diff --git a/docs/source/en/using-diffusers/using_safetensors.md b/docs/source/en/using-diffusers/using_safetensors.md
new file mode 100644
index 0000000..a9ab7b8
--- /dev/null
+++ b/docs/source/en/using-diffusers/using_safetensors.md
@@ -0,0 +1,84 @@
+
+
+# Load safetensors
+
+[[open-in-colab]]
+
+[safetensors](https://github.com/huggingface/safetensors) is a safe and fast file format for storing and loading tensors. Typically, PyTorch model weights are saved or *pickled* into a `.bin` file with Python's [`pickle`](https://docs.python.org/3/library/pickle.html) utility. However, `pickle` is not secure and pickled files may contain malicious code that can be executed. safetensors is a secure alternative to `pickle`, making it ideal for sharing model weights.
+
+This guide will show you how you load `.safetensor` files, and how to convert Stable Diffusion model weights stored in other formats to `.safetensor`. Before you start, make sure you have safetensors installed:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install safetensors
+```
+
+If you look at the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main) repository, you'll see weights inside the `text_encoder`, `unet` and `vae` subfolders are stored in the `.safetensors` format. By default, ๐ค Diffusers automatically loads these `.safetensors` files from their subfolders if they're available in the model repository.
+
+For more explicit control, you can optionally set `use_safetensors=True` (if `safetensors` is not installed, you'll get an error message asking you to install it):
+
+```py
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
+```
+
+However, model weights are not necessarily stored in separate subfolders like in the example above. Sometimes, all the weights are stored in a single `.safetensors` file. In this case, if the weights are Stable Diffusion weights, you can load the file directly with the [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] method:
+
+```py
+from diffusers import StableDiffusionPipeline
+
+pipeline = StableDiffusionPipeline.from_single_file(
+ "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors"
+)
+```
+
+## Convert to safetensors
+
+Not all weights on the Hub are available in the `.safetensors` format, and you may encounter weights stored as `.bin`. In this case, use the [Convert Space](https://huggingface.co/spaces/diffusers/convert) to convert the weights to `.safetensors`. The Convert Space downloads the pickled weights, converts them, and opens a Pull Request to upload the newly converted `.safetensors` file on the Hub. This way, if there is any malicious code contained in the pickled files, they're uploaded to the Hub - which has a [security scanner](https://huggingface.co/docs/hub/security-pickle#hubs-security-scanner) to detect unsafe files and suspicious pickle imports - instead of your computer.
+
+You can use the model with the new `.safetensors` weights by specifying the reference to the Pull Request in the `revision` parameter (you can also test it in this [Check PR](https://huggingface.co/spaces/diffusers/check_pr) Space on the Hub), for example `refs/pr/22`:
+
+```py
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1", revision="refs/pr/22", use_safetensors=True
+)
+```
+
+## Why use safetensors?
+
+There are several reasons for using safetensors:
+
+- Safety is the number one reason for using safetensors. As open-source and model distribution grows, it is important to be able to trust the model weights you downloaded don't contain any malicious code. The current size of the header in safetensors prevents parsing extremely large JSON files.
+- Loading speed between switching models is another reason to use safetensors, which performs zero-copy of the tensors. It is especially fast compared to `pickle` if you're loading the weights to CPU (the default case), and just as fast if not faster when directly loading the weights to GPU. You'll only notice the performance difference if the model is already loaded, and not if you're downloading the weights or loading the model for the first time.
+
+ The time it takes to load the entire pipeline:
+
+ ```py
+ from diffusers import StableDiffusionPipeline
+
+ pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", use_safetensors=True)
+ "Loaded in safetensors 0:00:02.033658"
+ "Loaded in PyTorch 0:00:02.663379"
+ ```
+
+ But the actual time it takes to load 500MB of the model weights is only:
+
+ ```bash
+ safetensors: 3.4873ms
+ PyTorch: 172.7537ms
+ ```
+
+- Lazy loading is also supported in safetensors, which is useful in distributed settings to only load some of the tensors. This format allowed the [BLOOM](https://huggingface.co/bigscience/bloom) model to be loaded in 45 seconds on 8 GPUs instead of 10 minutes with regular PyTorch weights.
diff --git a/docs/source/en/using-diffusers/weighted_prompts.md b/docs/source/en/using-diffusers/weighted_prompts.md
new file mode 100644
index 0000000..8f0c00d
--- /dev/null
+++ b/docs/source/en/using-diffusers/weighted_prompts.md
@@ -0,0 +1,271 @@
+
+
+# Prompt weighting
+
+[[open-in-colab]]
+
+Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. A prompt can include several concepts, which gets turned into contextualized text embeddings. The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion [blog post](https://huggingface.co/blog/stable_diffusion) to learn more about how it works).
+
+Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. The easiest way to prepare the prompt-weighted embeddings is to use [Compel](https://github.com/damian0815/compel), a text prompt-weighting and blending library. Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a [`prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.prompt_embeds) (and optionally [`negative_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.negative_prompt_embeds)) parameter, such as [`StableDiffusionPipeline`], [`StableDiffusionControlNetPipeline`], and [`StableDiffusionXLPipeline`].
+
+
+
+If your favorite pipeline doesn't have a `prompt_embeds` parameter, please open an [issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can add it!
+
+
+
+This guide will show you how to weight and blend your prompts with Compel in ๐ค Diffusers.
+
+Before you begin, make sure you have the latest version of Compel installed:
+
+```py
+# uncomment to install in Colab
+#!pip install compel --upgrade
+```
+
+For this guide, let's generate an image with the prompt `"a red cat playing with a ball"` using the [`StableDiffusionPipeline`]:
+
+```py
+from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
+import torch
+
+pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_safetensors=True)
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.to("cuda")
+
+prompt = "a red cat playing with a ball"
+
+generator = torch.Generator(device="cpu").manual_seed(33)
+
+image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+## Weighting
+
+You'll notice there is no "ball" in the image! Let's use compel to upweight the concept of "ball" in the prompt. Create a [`Compel`](https://github.com/damian0815/compel/blob/main/doc/compel.md#compel-objects) object, and pass it a tokenizer and text encoder:
+
+```py
+from compel import Compel
+
+compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
+```
+
+compel uses `+` or `-` to increase or decrease the weight of a word in the prompt. To increase the weight of "ball":
+
+
+
+`+` corresponds to the value `1.1`, `++` corresponds to `1.1^2`, and so on. Similarly, `-` corresponds to `0.9` and `--` corresponds to `0.9^2`. Feel free to experiment with adding more `+` or `-` in your prompt!
+
+
+
+```py
+prompt = "a red cat playing with a ball++"
+```
+
+Pass the prompt to `compel_proc` to create the new prompt embeddings which are passed to the pipeline:
+
+```py
+prompt_embeds = compel_proc(prompt)
+generator = torch.manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+To downweight parts of the prompt, use the `-` suffix:
+
+```py
+prompt = "a red------- cat playing with a ball"
+prompt_embeds = compel_proc(prompt)
+
+generator = torch.manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+You can even up or downweight multiple concepts in the same prompt:
+
+```py
+prompt = "a red cat++ playing with a ball----"
+prompt_embeds = compel_proc(prompt)
+
+generator = torch.manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+## Blending
+
+You can also create a weighted *blend* of prompts by adding `.blend()` to a list of prompts and passing it some weights. Your blend may not always produce the result you expect because it breaks some assumptions about how the text encoder functions, so just have fun and experiment with it!
+
+```py
+prompt_embeds = compel_proc('("a red cat playing with a ball", "jungle").blend(0.7, 0.8)')
+generator = torch.Generator(device="cuda").manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+## Conjunction
+
+A conjunction diffuses each prompt independently and concatenates their results by their weighted sum. Add `.and()` to the end of a list of prompts to create a conjunction:
+
+```py
+prompt_embeds = compel_proc('["a red cat", "playing with a", "ball"].and()')
+generator = torch.Generator(device="cuda").manual_seed(55)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+
+
+
+
+## Textual inversion
+
+[Textual inversion](../training/text_inversion) is a technique for learning a specific concept from some images which you can use to generate new images conditioned on that concept.
+
+Create a pipeline and use the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] function to load the textual inversion embeddings (feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer) for 100+ trained concepts):
+
+```py
+import torch
+from diffusers import StableDiffusionPipeline
+from compel import Compel, DiffusersTextualInversionManager
+
+pipe = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16,
+ use_safetensors=True, variant="fp16").to("cuda")
+pipe.load_textual_inversion("sd-concepts-library/midjourney-style")
+```
+
+Compel provides a `DiffusersTextualInversionManager` class to simplify prompt weighting with textual inversion. Instantiate `DiffusersTextualInversionManager` and pass it to the `Compel` class:
+
+```py
+textual_inversion_manager = DiffusersTextualInversionManager(pipe)
+compel_proc = Compel(
+ tokenizer=pipe.tokenizer,
+ text_encoder=pipe.text_encoder,
+ textual_inversion_manager=textual_inversion_manager)
+```
+
+Incorporate the concept to condition a prompt with using the `` syntax:
+
+```py
+prompt_embeds = compel_proc('("A red cat++ playing with a ball ")')
+
+image = pipe(prompt_embeds=prompt_embeds).images[0]
+image
+```
+
+
+
+
+
+## DreamBooth
+
+[DreamBooth](../training/dreambooth) is a technique for generating contextualized images of a subject given just a few images of the subject to train on. It is similar to textual inversion, but DreamBooth trains the full model whereas textual inversion only fine-tunes the text embeddings. This means you should use [`~DiffusionPipeline.from_pretrained`] to load the DreamBooth model (feel free to browse the [Stable Diffusion Dreambooth Concepts Library](https://huggingface.co/sd-dreambooth-library) for 100+ trained models):
+
+```py
+import torch
+from diffusers import DiffusionPipeline, UniPCMultistepScheduler
+from compel import Compel
+
+pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+```
+
+Create a `Compel` class with a tokenizer and text encoder, and pass your prompt to it. Depending on the model you use, you'll need to incorporate the model's unique identifier into your prompt. For example, the `dndcoverart-v1` model uses the identifier `dndcoverart`:
+
+```py
+compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
+prompt_embeds = compel_proc('("magazine cover of a dndcoverart dragon, high quality, intricate details, larry elmore art style").and()')
+image = pipe(prompt_embeds=prompt_embeds).images[0]
+image
+```
+
+
+
+
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) has two tokenizers and text encoders so it's usage is a bit different. To address this, you should pass both tokenizers and encoders to the `Compel` class:
+
+```py
+from compel import Compel, ReturnedEmbeddingsType
+from diffusers import DiffusionPipeline
+from diffusers.utils import make_image_grid
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ variant="fp16",
+ use_safetensors=True,
+ torch_dtype=torch.float16
+).to("cuda")
+
+compel = Compel(
+ tokenizer=[pipeline.tokenizer, pipeline.tokenizer_2] ,
+ text_encoder=[pipeline.text_encoder, pipeline.text_encoder_2],
+ returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
+ requires_pooled=[False, True]
+)
+```
+
+This time, let's upweight "ball" by a factor of 1.5 for the first prompt, and downweight "ball" by 0.6 for the second prompt. The [`StableDiffusionXLPipeline`] also requires [`pooled_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline.__call__.pooled_prompt_embeds) (and optionally [`negative_pooled_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline.__call__.negative_pooled_prompt_embeds)) so you should pass those to the pipeline along with the conditioning tensors:
+
+```py
+# apply weights
+prompt = ["a red cat playing with a (ball)1.5", "a red cat playing with a (ball)0.6"]
+conditioning, pooled = compel(prompt)
+
+# generate image
+generator = [torch.Generator().manual_seed(33) for _ in range(len(prompt))]
+images = pipeline(prompt_embeds=conditioning, pooled_prompt_embeds=pooled, generator=generator, num_inference_steps=30).images
+make_image_grid(images, rows=1, cols=2)
+```
+
+
+
+
+ "a red cat playing with a (ball)1.5"
+
+
+
+ "a red cat playing with a (ball)0.6"
+
+
diff --git a/docs/source/en/using-diffusers/write_own_pipeline.md b/docs/source/en/using-diffusers/write_own_pipeline.md
new file mode 100644
index 0000000..6d766d0
--- /dev/null
+++ b/docs/source/en/using-diffusers/write_own_pipeline.md
@@ -0,0 +1,293 @@
+
+
+# Understanding pipelines, models and schedulers
+
+[[open-in-colab]]
+
+๐งจ Diffusers is designed to be a user-friendly and flexible toolbox for building diffusion systems tailored to your use-case. At the core of the toolbox are models and schedulers. While the [`DiffusionPipeline`] bundles these components together for convenience, you can also unbundle the pipeline and use the models and schedulers separately to create new diffusion systems.
+
+In this tutorial, you'll learn how to use models and schedulers to assemble a diffusion system for inference, starting with a basic pipeline and then progressing to the Stable Diffusion pipeline.
+
+## Deconstruct a basic pipeline
+
+A pipeline is a quick and easy way to run a model for inference, requiring no more than four lines of code to generate an image:
+
+```py
+>>> from diffusers import DDPMPipeline
+
+>>> ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
+>>> image = ddpm(num_inference_steps=25).images[0]
+>>> image
+```
+
+
+
+
+
+That was super easy, but how did the pipeline do that? Let's breakdown the pipeline and take a look at what's happening under the hood.
+
+In the example above, the pipeline contains a [`UNet2DModel`] model and a [`DDPMScheduler`]. The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the *noise residual* and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until it reaches the end of the specified number of inference steps.
+
+To recreate the pipeline with the model and scheduler separately, let's write our own denoising process.
+
+1. Load the model and scheduler:
+
+```py
+>>> from diffusers import DDPMScheduler, UNet2DModel
+
+>>> scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
+>>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
+```
+
+2. Set the number of timesteps to run the denoising process for:
+
+```py
+>>> scheduler.set_timesteps(50)
+```
+
+3. Setting the scheduler timesteps creates a tensor with evenly spaced elements in it, 50 in this example. Each element corresponds to a timestep at which the model denoises an image. When you create the denoising loop later, you'll iterate over this tensor to denoise an image:
+
+```py
+>>> scheduler.timesteps
+tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
+ 700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
+ 420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
+ 140, 120, 100, 80, 60, 40, 20, 0])
+```
+
+4. Create some random noise with the same shape as the desired output:
+
+```py
+>>> import torch
+
+>>> sample_size = model.config.sample_size
+>>> noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")
+```
+
+5. Now write a loop to iterate over the timesteps. At each timestep, the model does a [`UNet2DModel.forward`] pass and returns the noisy residual. The scheduler's [`~DDPMScheduler.step`] method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it'll repeat until it reaches the end of the `timesteps` array.
+
+```py
+>>> input = noise
+
+>>> for t in scheduler.timesteps:
+... with torch.no_grad():
+... noisy_residual = model(input, t).sample
+... previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
+... input = previous_noisy_sample
+```
+
+This is the entire denoising process, and you can use this same pattern to write any diffusion system.
+
+6. The last step is to convert the denoised output into an image:
+
+```py
+>>> from PIL import Image
+>>> import numpy as np
+
+>>> image = (input / 2 + 0.5).clamp(0, 1).squeeze()
+>>> image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
+>>> image = Image.fromarray(image)
+>>> image
+```
+
+In the next section, you'll put your skills to the test and breakdown the more complex Stable Diffusion pipeline. The steps are more or less the same. You'll initialize the necessary components, and set the number of timesteps to create a `timestep` array. The `timestep` array is used in the denoising loop, and for each element in this array, the model predicts a less noisy image. The denoising loop iterates over the `timestep`'s, and at each timestep, it outputs a noisy residual and the scheduler uses it to predict a less noisy image at the previous timestep. This process is repeated until you reach the end of the `timestep` array.
+
+Let's try it out!
+
+## Deconstruct the Stable Diffusion pipeline
+
+Stable Diffusion is a text-to-image *latent diffusion* model. It is called a latent diffusion model because it works with a lower-dimensional representation of the image instead of the actual pixel space, which makes it more memory efficient. The encoder compresses the image into a smaller representation, and a decoder to convert the compressed representation back into an image. For text-to-image models, you'll need a tokenizer and an encoder to generate text embeddings. From the previous example, you already know you need a UNet model and a scheduler.
+
+As you can see, this is already more complex than the DDPM pipeline which only contains a UNet model. The Stable Diffusion model has three separate pretrained models.
+
+
+
+๐ก Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog for more details about how the VAE, UNet, and text encoder models work.
+
+
+
+Now that you know what you need for the Stable Diffusion pipeline, load all these components with the [`~ModelMixin.from_pretrained`] method. You can find them in the pretrained [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint, and each component is stored in a separate subfolder:
+
+```py
+>>> from PIL import Image
+>>> import torch
+>>> from transformers import CLIPTextModel, CLIPTokenizer
+>>> from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
+
+>>> vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
+>>> tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
+>>> text_encoder = CLIPTextModel.from_pretrained(
+... "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True
+... )
+>>> unet = UNet2DConditionModel.from_pretrained(
+... "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True
+... )
+```
+
+Instead of the default [`PNDMScheduler`], exchange it for the [`UniPCMultistepScheduler`] to see how easy it is to plug a different scheduler in:
+
+```py
+>>> from diffusers import UniPCMultistepScheduler
+
+>>> scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")
+```
+
+To speed up inference, move the models to a GPU since, unlike the scheduler, they have trainable weights:
+
+```py
+>>> torch_device = "cuda"
+>>> vae.to(torch_device)
+>>> text_encoder.to(torch_device)
+>>> unet.to(torch_device)
+```
+
+### Create text embeddings
+
+The next step is to tokenize the text to generate embeddings. The text is used to condition the UNet model and steer the diffusion process towards something that resembles the input prompt.
+
+
+
+๐ก The `guidance_scale` parameter determines how much weight should be given to the prompt when generating an image.
+
+
+
+Feel free to choose any prompt you like if you want to generate something else!
+
+```py
+>>> prompt = ["a photograph of an astronaut riding a horse"]
+>>> height = 512 # default height of Stable Diffusion
+>>> width = 512 # default width of Stable Diffusion
+>>> num_inference_steps = 25 # Number of denoising steps
+>>> guidance_scale = 7.5 # Scale for classifier-free guidance
+>>> generator = torch.manual_seed(0) # Seed generator to create the initial latent noise
+>>> batch_size = len(prompt)
+```
+
+Tokenize the text and generate the embeddings from the prompt:
+
+```py
+>>> text_input = tokenizer(
+... prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
+... )
+
+>>> with torch.no_grad():
+... text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
+```
+
+You'll also need to generate the *unconditional text embeddings* which are the embeddings for the padding token. These need to have the same shape (`batch_size` and `seq_length`) as the conditional `text_embeddings`:
+
+```py
+>>> max_length = text_input.input_ids.shape[-1]
+>>> uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
+>>> uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]
+```
+
+Let's concatenate the conditional and unconditional embeddings into a batch to avoid doing two forward passes:
+
+```py
+>>> text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
+```
+
+### Create random noise
+
+Next, generate some initial random noise as a starting point for the diffusion process. This is the latent representation of the image, and it'll be gradually denoised. At this point, the `latent` image is smaller than the final image size but that's okay though because the model will transform it into the final 512x512 image dimensions later.
+
+
+
+๐ก The height and width are divided by 8 because the `vae` model has 3 down-sampling layers. You can check by running the following:
+
+```py
+2 ** (len(vae.config.block_out_channels) - 1) == 8
+```
+
+
+
+```py
+>>> latents = torch.randn(
+... (batch_size, unet.config.in_channels, height // 8, width // 8),
+... generator=generator,
+... device=torch_device,
+... )
+```
+
+### Denoise the image
+
+Start by scaling the input with the initial noise distribution, *sigma*, the noise scale value, which is required for improved schedulers like [`UniPCMultistepScheduler`]:
+
+```py
+>>> latents = latents * scheduler.init_noise_sigma
+```
+
+The last step is to create the denoising loop that'll progressively transform the pure noise in `latents` to an image described by your prompt. Remember, the denoising loop needs to do three things:
+
+1. Set the scheduler's timesteps to use during denoising.
+2. Iterate over the timesteps.
+3. At each timestep, call the UNet model to predict the noise residual and pass it to the scheduler to compute the previous noisy sample.
+
+```py
+>>> from tqdm.auto import tqdm
+
+>>> scheduler.set_timesteps(num_inference_steps)
+
+>>> for t in tqdm(scheduler.timesteps):
+... # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
+... latent_model_input = torch.cat([latents] * 2)
+
+... latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
+
+... # predict the noise residual
+... with torch.no_grad():
+... noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
+
+... # perform guidance
+... noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+... noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+... # compute the previous noisy sample x_t -> x_t-1
+... latents = scheduler.step(noise_pred, t, latents).prev_sample
+```
+
+### Decode the image
+
+The final step is to use the `vae` to decode the latent representation into an image and get the decoded output with `sample`:
+
+```py
+# scale and decode the image latents with vae
+latents = 1 / 0.18215 * latents
+with torch.no_grad():
+ image = vae.decode(latents).sample
+```
+
+Lastly, convert the image to a `PIL.Image` to see your generated image!
+
+```py
+>>> image = (image / 2 + 0.5).clamp(0, 1).squeeze()
+>>> image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
+>>> image = Image.fromarray(image)
+>>> image
+```
+
+
+
+
+
+## Next steps
+
+From basic to complex pipelines, you've seen that all you really need to write your own diffusion system is a denoising loop. The loop should set the scheduler's timesteps, iterate over them, and alternate between calling the UNet model to predict the noise residual and passing it to the scheduler to compute the previous noisy sample.
+
+This is really what ๐งจ Diffusers is designed for: to make it intuitive and easy to write your own diffusion system using models and schedulers.
+
+For your next steps, feel free to:
+
+* Learn how to [build and contribute a pipeline](../using-diffusers/contribute_pipeline) to ๐งจ Diffusers. We can't wait and see what you'll come up with!
+* Explore [existing pipelines](../api/pipelines/overview) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately.
diff --git a/docs/source/ja/_toctree.yml b/docs/source/ja/_toctree.yml
new file mode 100644
index 0000000..000809b
--- /dev/null
+++ b/docs/source/ja/_toctree.yml
@@ -0,0 +1,16 @@
+- sections:
+ - local: index
+ title: ๐งจ Diffusers
+ - local: quicktour
+ title: ใฏใคใใฏใใขใผ
+ - local: stable_diffusion
+ title: ๆๅนใงๅน็ใฎ่ฏใๆกๆฃใขใใซ
+ - local: installation
+ title: ใคใณในใใผใซ
+ title: ใฏใใใซ
+- sections:
+ - local: tutorials/tutorial_overview
+ title: ๆฆ่ฆ
+ - local: tutorials/autopipeline
+ title: AutoPipeline
+ title: ใใฅใผใใชใขใซ
\ No newline at end of file
diff --git a/docs/source/ja/index.md b/docs/source/ja/index.md
new file mode 100644
index 0000000..8d4e8db
--- /dev/null
+++ b/docs/source/ja/index.md
@@ -0,0 +1,48 @@
+
+
+
+
+ใใชใใใใงใ๏ผ็จฎใ`1`ใฎ`Generator`ใซๅฏพๅฟใใ2็ช็ฎใฎ็ปๅใซใ่ขซๅไฝใฎๅนด้ฝขใซ้ขใใใใญในใใ่ฟฝๅ ใใฆใใใๅฐใๆใๅ ใใฆใฟใพใใใ๏ผ
+
+```python
+prompts = [
+ "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+]
+
+generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
+images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
+make_image_grid(images, 2, 2)
+```
+
+
+
+๊ฝค ์ธ์์ ์ ๋๋ค! `1`์ ์๋๋ฅผ ๊ฐ์ง `Generator`์ ํด๋นํ๋ ๋ ๋ฒ์งธ ์ด๋ฏธ์ง์ ํผ์ฌ์ฒด์ ๋์ด์ ๋ํ ํ ์คํธ๋ฅผ ์ถ๊ฐํ์ฌ ์กฐ๊ธ ๋ ์กฐ์ ํด ๋ณด๊ฒ ์ต๋๋ค:
+
+```python
+prompts = [
+ "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+]
+
+generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
+images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
+image_grid(images)
+```
+
+
+
+# Diffusers
+
+๐ค Diffusers รฉ uma biblioteca de modelos de difusรฃo de รบltima geraรงรฃo para geraรงรฃo de imagens, รกudio e atรฉ mesmo estruturas 3D de molรฉculas. Se vocรช estรก procurando uma soluรงรฃo de geraรงรฃo simples ou queira treinar seu prรณprio modelo de difusรฃo, ๐ค Diffusers รฉ uma modular caixa de ferramentas que suporta ambos. Nossa biblioteca รฉ desenhada com foco em [usabilidade em vez de desempenho](conceptual/philosophy#usability-over-performance), [simples em vez de fรกcil](conceptual/philosophy#simple-over-easy) e [customizรกvel em vez de abstraรงรตes](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction).
+
+A Biblioteca tem trรชs componentes principais:
+
+- Pipelines de รบltima geraรงรฃo para a geraรงรฃo em poucas linhas de cรณdigo. Tรชm muitos pipelines no ๐ค Diffusers, veja a tabela no pipeline [Visรฃo geral](api/pipelines/overview) para uma lista completa de pipelines disponรญveis e as tarefas que eles resolvem.
+- Intercambiรกveis [agendadores de ruรญdo](api/schedulers/overview) para balancear as compensaรงรตes entre velocidade e qualidade de geraรงรฃo.
+- [Modelos](api/models) prรฉ-treinados que podem ser usados como se fossem blocos de construรงรฃo, e combinados com agendadores, para criar seu prรณprio sistema de difusรฃo de ponta a ponta.
+
+
diff --git a/docs/source/pt/installation.md b/docs/source/pt/installation.md
new file mode 100644
index 0000000..0e45707
--- /dev/null
+++ b/docs/source/pt/installation.md
@@ -0,0 +1,156 @@
+
+
+# Instalaรงรฃo
+
+๐ค Diffusers รฉ testado no Python 3.8+, PyTorch 1.7.0+, e Flax. Siga as instruรงรตes de instalaรงรฃo abaixo para a biblioteca de deep learning que vocรช estรก utilizando:
+
+- [PyTorch](https://pytorch.org/get-started/locally/) instruรงรตes de instalaรงรฃo
+- [Flax](https://flax.readthedocs.io/en/latest/) instruรงรตes de instalaรงรฃo
+
+## Instalaรงรฃo com pip
+
+Recomenda-se instalar ๐ค Diffusers em um [ambiente virtual](https://docs.python.org/3/library/venv.html).
+Se vocรช nรฃo estรก familiarizado com ambiente virtuals, veja o [guia](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
+Um ambiente virtual deixa mais fรกcil gerenciar diferentes projetos e evitar problemas de compatibilidade entre dependรชncias.
+
+Comece criando um ambiente virtual no diretรณrio do projeto:
+
+```bash
+python -m venv .env
+```
+
+Ative o ambiente virtual:
+
+```bash
+source .env/bin/activate
+```
+
+Recomenda-se a instalaรงรฃo do ๐ค Transformers porque ๐ค Diffusers depende de seus modelos:
+
+
+
+```bash
+pip install diffusers["torch"] transformers
+```
+
+
+```bash
+pip install diffusers["flax"] transformers
+```
+
+
+
+## Instalaรงรฃo a partir do cรณdigo fonte
+
+Antes da instalaรงรฃo do ๐ค Diffusers a partir do cรณdigo fonte, certifique-se de ter o PyTorch e o ๐ค Accelerate instalados.
+
+Para instalar o ๐ค Accelerate:
+
+```bash
+pip install accelerate
+```
+
+entรฃo instale o ๐ค Diffusers do cรณdigo fonte:
+
+```bash
+pip install git+https://github.com/huggingface/diffusers
+```
+
+Esse comando instala a รบltima versรฃo em desenvolvimento `main` em vez da รบltima versรฃo estรกvel `stable`.
+A versรฃo `main` รฉ รบtil para se manter atualizado com os รบltimos desenvolvimentos.
+Por exemplo, se um bug foi corrigido desde o รบltimo lanรงamento estรกvel, mas um novo lanรงamento ainda nรฃo foi lanรงado.
+No entanto, isso significa que a versรฃo `main` pode nรฃo ser sempre estรกvel.
+Nรณs nos esforรงamos para manter a versรฃo `main` operacional, e a maioria dos problemas geralmente sรฃo resolvidos em algumas horas ou um dia.
+Se vocรช encontrar um problema, por favor abra uma [Issue](https://github.com/huggingface/diffusers/issues/new/choose), assim conseguimos arrumar o quanto antes!
+
+## Instalaรงรฃo editรกvel
+
+Vocรช precisarรก de uma instalaรงรฃo editรกvel se vocรช:
+
+- Usar a versรฃo `main` do cรณdigo fonte.
+- Contribuir para o ๐ค Diffusers e precisa testar mudanรงas no cรณdigo.
+
+Clone o repositรณrio e instale o ๐ค Diffusers com os seguintes comandos:
+
+```bash
+git clone https://github.com/huggingface/diffusers.git
+cd diffusers
+```
+
+
+
+```bash
+pip install -e ".[torch]"
+```
+
+
+```bash
+pip install -e ".[flax]"
+```
+
+
+
+Esses comandos irรก linkar a pasta que vocรช clonou o repositรณrio e os caminhos das suas bibliotecas Python.
+Python entรฃo irรก procurar dentro da pasta que vocรช clonou alรฉm dos caminhos normais das bibliotecas.
+Por exemplo, se o pacote python for tipicamente instalado no `~/anaconda3/envs/main/lib/python3.8/site-packages/`, o Python tambรฉm irรก procurar na pasta `~/diffusers/` que vocรช clonou.
+
+
+
+Vocรช deve deixar a pasta `diffusers` se vocรช quiser continuar usando a biblioteca.
+
+
+
+Agora vocรช pode facilmente atualizar seu clone para a รบltima versรฃo do ๐ค Diffusers com o seguinte comando:
+
+```bash
+cd ~/diffusers/
+git pull
+```
+
+Seu ambiente Python vai encontrar a versรฃo `main` do ๐ค Diffusers na prรณxima execuรงรฃo.
+
+## Cache
+
+Os pesos e os arquivos dos modelos sรฃo baixados do Hub para o cache que geralmente รฉ o seu diretรณrio home. Vocรช pode mudar a localizaรงรฃo do cache especificando as variรกveis de ambiente `HF_HOME` ou `HUGGINFACE_HUB_CACHE` ou configurando o parรขmetro `cache_dir` em mรฉtodos como [`~DiffusionPipeline.from_pretrained`].
+
+Aquivos em cache permitem que vocรช rode ๐ค Diffusers offline. Para prevenir que o ๐ค Diffusers se conecte ร internet, defina a variรกvel de ambiente `HF_HUB_OFFLINE` para `True` e o ๐ค Diffusers irรก apenas carregar arquivos previamente baixados em cache.
+
+```shell
+export HF_HUB_OFFLINE=True
+```
+
+Para mais detalhes de como gerenciar e limpar o cache, olhe o guia de [caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
+
+## Telemetria
+
+Nossa biblioteca coleta informaรงรตes de telemetria durante as requisiรงรตes [`~DiffusionPipeline.from_pretrained`].
+O dado coletado inclui a versรฃo do ๐ค Diffusers e PyTorch/Flax, o modelo ou classe de pipeline requisitado,
+e o caminho para um checkpoint prรฉ-treinado se ele estiver hospedado no Hugging Face Hub.
+Esse dado de uso nos ajuda a debugar problemas e priorizar novas funcionalidades.
+Telemetria รฉ enviada apenas quando รฉ carregado modelos e pipelines do Hub,
+e nรฃo รฉ coletado se vocรช estiver carregando arquivos locais.
+
+Nos entendemos que nem todo mundo quer compartilhar informaรงรตes adicionais, e nรณs respeitamos sua privacidade.
+Vocรช pode desabilitar a coleta de telemetria definindo a variรกvel de ambiente `DISABLE_TELEMETRY` do seu terminal:
+
+No Linux/MacOS:
+
+```bash
+export DISABLE_TELEMETRY=YES
+```
+
+No Windows:
+
+```bash
+set DISABLE_TELEMETRY=YES
+```
diff --git a/docs/source/pt/quicktour.md b/docs/source/pt/quicktour.md
new file mode 100644
index 0000000..b1ea0b3
--- /dev/null
+++ b/docs/source/pt/quicktour.md
@@ -0,0 +1,314 @@
+
+
+[[open-in-colab]]
+
+# Tour rรกpido
+
+Modelos de difusรฃo sรฃo treinados para remover o ruรญdo Gaussiano aleatรณrio passo a passo para gerar uma amostra de interesse, como uma imagem ou รกudio. Isso despertou um tremendo interesse em IA generativa, e vocรช provavelmente jรก viu exemplos de imagens geradas por difusรฃo na internet. ๐งจ Diffusers รฉ uma biblioteca que visa tornar os modelos de difusรฃo amplamente acessรญveis a todos.
+
+Seja vocรช um desenvolvedor ou um usuรกrio, esse tour rรกpido irรก introduzir vocรช ao ๐งจ Diffusers e ajudar vocรช a comeรงar a gerar rapidamente! Hรก trรชs componentes principais da biblioteca para conhecer:
+
+- O [`DiffusionPipeline`] รฉ uma classe de alto nรญvel de ponta a ponta desenhada para gerar rapidamente amostras de modelos de difusรฃo prรฉ-treinados para inferรชncia.
+- [Modelos](./api/models) prรฉ-treinados populares e mรณdulos que podem ser usados como blocos de construรงรฃo para criar sistemas de difusรฃo.
+- Vรกrios [Agendadores](./api/schedulers/overview) diferentes - algoritmos que controlam como o ruรญdo รฉ adicionado para treinamento, e como gerar imagens sem o ruรญdo durante a inferรชncia.
+
+Esse tour rรกpido mostrarรก como usar o [`DiffusionPipeline`] para inferรชncia, e entรฃo mostrarรก como combinar um modelo e um agendador para replicar o que estรก acontecendo dentro do [`DiffusionPipeline`].
+
+
+
+Esse tour rรกpido รฉ uma versรฃo simplificada da introduรงรฃo ๐งจ Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) para ajudar vocรช a comeรงar rรกpido. Se vocรช quer aprender mais sobre o objetivo do ๐งจ Diffusers, filosofia de design, e detalhes adicionais sobre a API principal, veja o notebook!
+
+
+
+Antes de comeรงar, certifique-se de ter todas as bibliotecas necessรกrias instaladas:
+
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install --upgrade diffusers accelerate transformers
+```
+
+- [๐ค Accelerate](https://huggingface.co/docs/accelerate/index) acelera o carregamento do modelo para geraรงรฃo e treinamento.
+- [๐ค Transformers](https://huggingface.co/docs/transformers/index) รฉ necessรกrio para executar os modelos mais populares de difusรฃo, como o [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview).
+
+## DiffusionPipeline
+
+O [`DiffusionPipeline`] รฉ a forma mais fรกcil de usar um sistema de difusรฃo prรฉ-treinado para geraรงรฃo. ร um sistema de ponta a ponta contendo o modelo e o agendador. Vocรช pode usar o [`DiffusionPipeline`] pronto para muitas tarefas. Dรช uma olhada na tabela abaixo para algumas tarefas suportadas, e para uma lista completa de tarefas suportadas, veja a tabela [Resumo do ๐งจ Diffusers](./api/pipelines/overview#diffusers-summary).
+
+| **Tarefa** | **Descriรงรฃo** | **Pipeline** |
+| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
+| Unconditional Image Generation | gera uma imagem a partir do ruรญdo Gaussiano | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) |
+| Text-Guided Image Generation | gera uma imagem a partir de um prompt de texto | [conditional_image_generation](./using-diffusers/conditional_image_generation) |
+| Text-Guided Image-to-Image Translation | adapta uma imagem guiada por um prompt de texto | [img2img](./using-diffusers/img2img) |
+| Text-Guided Image-Inpainting | preenche a parte da mรกscara da imagem, dado a imagem, a mรกscara e o prompt de texto | [inpaint](./using-diffusers/inpaint) |
+| Text-Guided Depth-to-Image Translation | adapta as partes de uma imagem guiada por um prompt de texto enquanto preserva a estrutura por estimativa de profundidade | [depth2img](./using-diffusers/depth2img) |
+
+Comece criando uma instรขncia do [`DiffusionPipeline`] e especifique qual checkpoint do pipeline vocรช gostaria de baixar.
+Vocรช pode usar o [`DiffusionPipeline`] para qualquer [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) armazenado no Hugging Face Hub.
+Nesse quicktour, vocรช carregarรก o checkpoint [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) para geraรงรฃo de texto para imagem.
+
+
+
+Para os modelos de [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion), por favor leia cuidadosamente a [licenรงa](https://huggingface.co/spaces/CompVis/stable-diffusion-license) primeiro antes de rodar o modelo. ๐งจ Diffusers implementa uma verificaรงรฃo de seguranรงa: [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) para prevenir conteรบdo ofensivo ou nocivo, mas as capacidades de geraรงรฃo de imagem aprimorada do modelo podem ainda produzir conteรบdo potencialmente nocivo.
+
+
+
+Para carregar o modelo com o mรฉtodo [`~DiffusionPipeline.from_pretrained`]:
+
+```python
+>>> from diffusers import DiffusionPipeline
+
+>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
+```
+
+O [`DiffusionPipeline`] baixa e armazena em cache todos os componentes de modelagem, tokenizaรงรฃo, e agendamento. Vocรช verรก que o pipeline do Stable Diffusion รฉ composto pelo [`UNet2DConditionModel`] e [`PNDMScheduler`] entre outras coisas:
+
+```py
+>>> pipeline
+StableDiffusionPipeline {
+ "_class_name": "StableDiffusionPipeline",
+ "_diffusers_version": "0.13.1",
+ ...,
+ "scheduler": [
+ "diffusers",
+ "PNDMScheduler"
+ ],
+ ...,
+ "unet": [
+ "diffusers",
+ "UNet2DConditionModel"
+ ],
+ "vae": [
+ "diffusers",
+ "AutoencoderKL"
+ ]
+}
+```
+
+Nรณs fortemente recomendamos rodar o pipeline em uma placa de vรญdeo, pois o modelo consiste em aproximadamente 1.4 bilhรตes de parรขmetros.
+Vocรช pode mover o objeto gerador para uma placa de vรญdeo, assim como vocรช faria no PyTorch:
+
+```python
+>>> pipeline.to("cuda")
+```
+
+Agora vocรช pode passar o prompt de texto para o `pipeline` para gerar uma imagem, e entรฃo acessar a imagem sem ruรญdo. Por padrรฃo, a saรญda da imagem รฉ embrulhada em um objeto [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class).
+
+```python
+>>> image = pipeline("An image of a squirrel in Picasso style").images[0]
+>>> image
+```
+
+
+
+
+
+Salve a imagem chamando o `save`:
+
+```python
+>>> image.save("image_of_squirrel_painting.png")
+```
+
+### Pipeline local
+
+Vocรช tambรฉm pode utilizar o pipeline localmente. A รบnica diferenรงa รฉ que vocรช precisa baixar os pesos primeiro:
+
+```bash
+!git lfs install
+!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
+```
+
+Assim carregue os pesos salvos no pipeline:
+
+```python
+>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True)
+```
+
+Agora vocรช pode rodar o pipeline como vocรช faria na seรงรฃo acima.
+
+### Troca dos agendadores
+
+Agendadores diferentes tem diferentes velocidades de retirar o ruรญdo e compensaรงรตes de qualidade. A melhor forma de descobrir qual funciona melhor para vocรช รฉ testar eles! Uma das principais caracterรญsticas do ๐งจ Diffusers รฉ permitir que vocรช troque facilmente entre agendadores. Por exemplo, para substituir o [`PNDMScheduler`] padrรฃo com o [`EulerDiscreteScheduler`], carregue ele com o mรฉtodo [`~diffusers.ConfigMixin.from_config`]:
+
+```py
+>>> from diffusers import EulerDiscreteScheduler
+
+>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
+>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
+```
+
+Tente gerar uma imagem com o novo agendador e veja se vocรช nota alguma diferenรงa!
+
+Na prรณxima seรงรฃo, vocรช irรก dar uma olhada mais de perto nos componentes - o modelo e o agendador - que compรตe o [`DiffusionPipeline`] e aprender como usar esses componentes para gerar uma imagem de um gato.
+
+## Modelos
+
+A maioria dos modelos recebe uma amostra de ruรญdo, e em cada _timestep_ ele prevรช o _noise residual_ (outros modelos aprendem a prever a amostra anterior diretamente ou a velocidade ou [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), a diferenรงa entre uma imagem menos com ruรญdo e a imagem de entrada. Vocรช pode misturar e combinar modelos para criar outros sistemas de difusรฃo.
+
+Modelos sรฃo inicializados com o mรฉtodo [`~ModelMixin.from_pretrained`] que tambรฉm armazena em cache localmente os pesos do modelo para que seja mais rรกpido na prรณxima vez que vocรช carregar o modelo. Para o tour rรกpido, vocรช irรก carregar o [`UNet2DModel`], um modelo bรกsico de geraรงรฃo de imagem incondicional com um checkpoint treinado em imagens de gato:
+
+```py
+>>> from diffusers import UNet2DModel
+
+>>> repo_id = "google/ddpm-cat-256"
+>>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
+```
+
+Para acessar os parรขmetros do modelo, chame `model.config`:
+
+```py
+>>> model.config
+```
+
+A configuraรงรฃo do modelo รฉ um dicionรกrio ๐ง congelado ๐ง, o que significa que esses parรขmetros nรฃo podem ser mudados depois que o modelo รฉ criado. Isso รฉ intencional e garante que os parรขmetros usados para definir a arquitetura do modelo no inรญcio permaneรงam os mesmos, enquanto outros parรขmetros ainda podem ser ajustados durante a geraรงรฃo.
+
+Um dos parรขmetros mais importantes sรฃo:
+
+- `sample_size`: a dimensรฃo da altura e largura da amostra de entrada.
+- `in_channels`: o nรบmero de canais de entrada da amostra de entrada.
+- `down_block_types` e `up_block_types`: o tipo de blocos de downsampling e upsampling usados para criar a arquitetura UNet.
+- `block_out_channels`: o nรบmero de canais de saรญda dos blocos de downsampling; tambรฉm utilizado como uma order reversa do nรบmero de canais de entrada dos blocos de upsampling.
+- `layers_per_block`: o nรบmero de blocks ResNet presentes em cada block UNet.
+
+Para usar o modelo para geraรงรฃo, crie a forma da imagem com ruรญdo Gaussiano aleatรณrio. Deve ter um eixo `batch` porque o modelo pode receber mรบltiplos ruรญdos aleatรณrios, um eixo `channel` correspondente ao nรบmero de canais de entrada, e um eixo `sample_size` para a altura e largura da imagem:
+
+```py
+>>> import torch
+
+>>> torch.manual_seed(0)
+
+>>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
+>>> noisy_sample.shape
+torch.Size([1, 3, 256, 256])
+```
+
+Para geraรงรฃo, passe a imagem com ruรญdo para o modelo e um `timestep`. O `timestep` indica o quรฃo ruidosa a imagem de entrada รฉ, com mais ruรญdo no inรญcio e menos no final. Isso ajuda o modelo a determinar sua posiรงรฃo no processo de difusรฃo, se estรก mais perto do inรญcio ou do final. Use o mรฉtodo `sample` para obter a saรญda do modelo:
+
+```py
+>>> with torch.no_grad():
+... noisy_residual = model(sample=noisy_sample, timestep=2).sample
+```
+
+Para geraรงรฃo de exemplos reais, vocรช precisarรก de um agendador para guiar o processo de retirada do ruรญdo. Na prรณxima seรงรฃo, vocรช irรก aprender como acoplar um modelo com um agendador.
+
+## Agendadores
+
+Agendadores gerenciam a retirada do ruรญdo de uma amostra ruidosa para uma amostra menos ruidosa dado a saรญda do modelo - nesse caso, รฉ o `noisy_residual`.
+
+
+
+๐งจ Diffusers รฉ uma caixa de ferramentas para construir sistemas de difusรฃo. Enquanto o [`DiffusionPipeline`] รฉ uma forma conveniente de comeรงar com um sistema de difusรฃo prรฉ-construรญdo, vocรช tambรฉm pode escolher seus prรณprios modelos e agendadores separadamente para construir um sistema de difusรฃo personalizado.
+
+
+
+Para o tour rรกpido, vocรช irรก instanciar o [`DDPMScheduler`] com o mรฉtodo [`~diffusers.ConfigMixin.from_config`]:
+
+```py
+>>> from diffusers import DDPMScheduler
+
+>>> scheduler = DDPMScheduler.from_config(repo_id)
+>>> scheduler
+DDPMScheduler {
+ "_class_name": "DDPMScheduler",
+ "_diffusers_version": "0.13.1",
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "beta_start": 0.0001,
+ "clip_sample": true,
+ "clip_sample_range": 1.0,
+ "num_train_timesteps": 1000,
+ "prediction_type": "epsilon",
+ "trained_betas": null,
+ "variance_type": "fixed_small"
+}
+```
+
+
+
+๐ก Perceba como o agendador รฉ instanciado de uma configuraรงรฃo. Diferentemente de um modelo, um agendador nรฃo tem pesos treinรกveis e รฉ livre de parรขmetros!
+
+
+
+Um dos parรขmetros mais importante sรฃo:
+
+- `num_train_timesteps`: o tamanho do processo de retirar ruรญdo ou em outras palavras, o nรบmero de _timesteps_ necessรกrios para o processo de ruรญdos Gausianos aleatรณrios dentro de uma amostra de dados.
+- `beta_schedule`: o tipo de agendados de ruรญdo para o uso de geraรงรฃo e treinamento.
+- `beta_start` e `beta_end`: para comeรงar e terminar os valores de ruรญdo para o agendador de ruรญdo.
+
+Para predizer uma imagem com um pouco menos de ruรญdo, passe o seguinte para o mรฉtodo do agendador [`~diffusers.DDPMScheduler.step`]: saรญda do modelo, `timestep`, e a atual `amostra`.
+
+```py
+>>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample
+>>> less_noisy_sample.shape
+```
+
+O `less_noisy_sample` pode ser passado para o prรณximo `timestep` onde ele ficarรก ainda com menos ruรญdo! Vamos juntar tudo agora e visualizar o processo inteiro de retirada de ruรญdo.
+
+Comece, criando a funรงรฃo que faรงa o pรณs-processamento e mostre a imagem sem ruรญdo como uma `PIL.Image`:
+
+```py
+>>> import PIL.Image
+>>> import numpy as np
+
+
+>>> def display_sample(sample, i):
+... image_processed = sample.cpu().permute(0, 2, 3, 1)
+... image_processed = (image_processed + 1.0) * 127.5
+... image_processed = image_processed.numpy().astype(np.uint8)
+
+... image_pil = PIL.Image.fromarray(image_processed[0])
+... display(f"Image at step {i}")
+... display(image_pil)
+```
+
+Para acelerar o processo de retirada de ruรญdo, mova a entrada e o modelo para uma GPU:
+
+```py
+>>> model.to("cuda")
+>>> noisy_sample = noisy_sample.to("cuda")
+```
+
+Agora, crie um loop de retirada de ruรญdo que prediz o residual da amostra menos ruidosa, e computa a amostra menos ruidosa com o agendador:
+
+```py
+>>> import tqdm
+
+>>> sample = noisy_sample
+
+>>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)):
+... # 1. predict noise residual
+... with torch.no_grad():
+... residual = model(sample, t).sample
+
+... # 2. compute less noisy image and set x_t -> x_t-1
+... sample = scheduler.step(residual, t, sample).prev_sample
+
+... # 3. optionally look at image
+... if (i + 1) % 50 == 0:
+... display_sample(sample, i + 1)
+```
+
+Sente-se e assista o gato ser gerado do nada alรฉm de ruรญdo! ๐ป
+
+
+
+
+
+## Prรณximos passos
+
+Esperamos que vocรช tenha gerado algumas imagens legais com o ๐งจ Diffusers neste tour rรกpido! Para suas prรณximas etapas, vocรช pode
+
+- Treine ou faรงa a configuraรงรฃo fina de um modelo para gerar suas prรณprias imagens no tutorial de [treinamento](./tutorials/basic_training).
+- Veja exemplos oficiais e da comunidade de [scripts de treinamento ou configuraรงรฃo fina](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) para os mais variados casos de uso.
+- Aprenda sobre como carregar, acessar, mudar e comparar agendadores no guia [Usando diferentes agendadores](./using-diffusers/schedulers).
+- Explore engenharia de prompt, otimizaรงรตes de velocidade e memรณria, e dicas e truques para gerar imagens de maior qualidade com o guia [Stable Diffusion](./stable_diffusion).
+- Se aprofunde em acelerar ๐งจ Diffusers com guias sobre [PyTorch otimizado em uma GPU](./optimization/fp16), e guias de inferรชncia para rodar [Stable Diffusion em Apple Silicon (M1/M2)](./optimization/mps) e [ONNX Runtime](./optimization/onnx).
diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml
new file mode 100644
index 0000000..41d5e95
--- /dev/null
+++ b/docs/source/zh/_toctree.yml
@@ -0,0 +1,10 @@
+- sections:
+ - local: index
+ title: ๐งจ Diffusers
+ - local: quicktour
+ title: ๅฟซ้ๅ ฅ้จ
+ - local: stable_diffusion
+ title: ๆๆๅ้ซๆ็ๆฉๆฃ
+ - local: installation
+ title: ๅฎ่ฃ
+ title: ๅผๅง
diff --git a/docs/source/zh/index.md b/docs/source/zh/index.md
new file mode 100644
index 0000000..92c52bc
--- /dev/null
+++ b/docs/source/zh/index.md
@@ -0,0 +1,101 @@
+
+
+
+
+้ๅธธ็ไปคไบบๅฐ่ฑกๆทฑๅป! Let's tweak the second image - ๆ `Generator` ็็งๅญ่ฎพ็ฝฎไธบ `1` - ๆทปๅ ไธไบๅ ณไบๅนด้พ็ไธป้ขๆๆฌ:
+
+```python
+prompts = [
+ "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+ "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
+]
+
+generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
+images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
+make_image_grid(images, 2, 2)
+```
+
+
+
+
+
+## ๆๅ
+
+ๅจๆฌๆ็จไธญ, ๆจๅญฆไน ไบๅฆไฝไผๅ[`DiffusionPipeline`]ไปฅๆ้ซ่ฎก็ฎๅๅ ๅญๆ็๏ผไปฅๅๆ้ซ็ๆ่พๅบ็่ดจ้. ๅฆๆไฝ ๆๅ ด่ถฃ่ฎฉไฝ ็ pipeline ๆดๅฟซ, ๅฏไปฅ็ไธ็ไปฅไธ่ตๆบ:
+
+- ๅญฆไน [PyTorch 2.0](./optimization/torch2.0) ๅ [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) ๅฏไปฅ่ฎฉๆจ็้ๅบฆๆ้ซ 5 - 300% . ๅจ A100 GPU ไธ, ๆจ็้ๅบฆๅฏไปฅๆ้ซ 50% !
+- ๅฆๆไฝ ๆฒกๆณ็จ PyTorch 2, ๆไปฌๅปบ่ฎฎไฝ ๅฎ่ฃ [xFormers](./optimization/xformers)ใๅฎ็ๅ ๅญ้ซๆๆณจๆๅๆบๅถ๏ผ*memory-efficient attention mechanism*๏ผไธPyTorch 1.13.1้ ๅไฝฟ็จ๏ผ้ๅบฆๆดๅฟซ๏ผๅ ๅญๆถ่ๆดๅฐใ
+- ๅ ถไป็ไผๅๆๆฏ, ๅฆ๏ผๆจกๅๅธ่ฝฝ๏ผ*model offloading*๏ผ, ๅ ๅซๅจ [่ฟไปฝๆๅ](./optimization/fp16).
diff --git a/tests/__init__.py b/tests/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/conftest.py b/tests/conftest.py
new file mode 100644
index 0000000..4993ed9
--- /dev/null
+++ b/tests/conftest.py
@@ -0,0 +1,44 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# tests directory-specific settings - this file is run automatically
+# by pytest before any tests are run
+
+import sys
+import warnings
+from os.path import abspath, dirname, join
+
+
+# allow having multiple repository checkouts and not needing to remember to rerun
+# 'pip install -e .[dev]' when switching between checkouts and running tests.
+git_repo_path = abspath(join(dirname(dirname(__file__)), "src"))
+sys.path.insert(1, git_repo_path)
+
+# silence FutureWarning warnings in tests since often we can't act on them until
+# they become normal warnings - i.e. the tests still need to test the current functionality
+warnings.simplefilter(action="ignore", category=FutureWarning)
+
+
+def pytest_addoption(parser):
+ from diffusers.utils.testing_utils import pytest_addoption_shared
+
+ pytest_addoption_shared(parser)
+
+
+def pytest_terminal_summary(terminalreporter):
+ from diffusers.utils.testing_utils import pytest_terminal_summary_main
+
+ make_reports = terminalreporter.config.getoption("--make-reports")
+ if make_reports:
+ pytest_terminal_summary_main(terminalreporter, id=make_reports)
diff --git a/tests/fixtures/custom_pipeline/pipeline.py b/tests/fixtures/custom_pipeline/pipeline.py
new file mode 100644
index 0000000..601f51b
--- /dev/null
+++ b/tests/fixtures/custom_pipeline/pipeline.py
@@ -0,0 +1,101 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+
+# limitations under the License.
+
+
+from typing import Optional, Tuple, Union
+
+import torch
+
+from diffusers import DiffusionPipeline, ImagePipelineOutput
+
+
+class CustomLocalPipeline(DiffusionPipeline):
+ r"""
+ This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+ library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+
+ Parameters:
+ unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image.
+ scheduler ([`SchedulerMixin`]):
+ A scheduler to be used in combination with `unet` to denoise the encoded image. Can be one of
+ [`DDPMScheduler`], or [`DDIMScheduler`].
+ """
+
+ def __init__(self, unet, scheduler):
+ super().__init__()
+ self.register_modules(unet=unet, scheduler=scheduler)
+
+ @torch.no_grad()
+ def __call__(
+ self,
+ batch_size: int = 1,
+ generator: Optional[torch.Generator] = None,
+ num_inference_steps: int = 50,
+ output_type: Optional[str] = "pil",
+ return_dict: bool = True,
+ **kwargs,
+ ) -> Union[ImagePipelineOutput, Tuple]:
+ r"""
+ Args:
+ batch_size (`int`, *optional*, defaults to 1):
+ The number of images to generate.
+ generator (`torch.Generator`, *optional*):
+ A [torch generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation
+ deterministic.
+ eta (`float`, *optional*, defaults to 0.0):
+ The eta parameter which controls the scale of the variance (0 is DDIM and 1 is one type of DDPM).
+ num_inference_steps (`int`, *optional*, defaults to 50):
+ The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+ expense of slower inference.
+ output_type (`str`, *optional*, defaults to `"pil"`):
+ The output format of the generate image. Choose between
+ [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+ return_dict (`bool`, *optional*, defaults to `True`):
+ Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
+
+ Returns:
+ [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if
+ `return_dict` is True, otherwise a `tuple. When returning a tuple, the first element is a list with the
+ generated images.
+ """
+
+ # Sample gaussian noise to begin loop
+ image = torch.randn(
+ (batch_size, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
+ generator=generator,
+ )
+ image = image.to(self.device)
+
+ # set step values
+ self.scheduler.set_timesteps(num_inference_steps)
+
+ for t in self.progress_bar(self.scheduler.timesteps):
+ # 1. predict noise model_output
+ model_output = self.unet(image, t).sample
+
+ # 2. predict previous mean of image x_t-1 and add variance depending on eta
+ # eta corresponds to ฮท in paper and should be between [0, 1]
+ # do x_t -> x_t-1
+ image = self.scheduler.step(model_output, t, image).prev_sample
+
+ image = (image / 2 + 0.5).clamp(0, 1)
+ image = image.cpu().permute(0, 2, 3, 1).numpy()
+ if output_type == "pil":
+ image = self.numpy_to_pil(image)
+
+ if not return_dict:
+ return (image,), "This is a local test"
+
+ return ImagePipelineOutput(images=image), "This is a local test"
diff --git a/tests/fixtures/custom_pipeline/what_ever.py b/tests/fixtures/custom_pipeline/what_ever.py
new file mode 100644
index 0000000..8ceeb42
--- /dev/null
+++ b/tests/fixtures/custom_pipeline/what_ever.py
@@ -0,0 +1,101 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+
+# limitations under the License.
+
+
+from typing import Optional, Tuple, Union
+
+import torch
+
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
+
+
+class CustomLocalPipeline(DiffusionPipeline):
+ r"""
+ This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+ library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+
+ Parameters:
+ unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image.
+ scheduler ([`SchedulerMixin`]):
+ A scheduler to be used in combination with `unet` to denoise the encoded image. Can be one of
+ [`DDPMScheduler`], or [`DDIMScheduler`].
+ """
+
+ def __init__(self, unet, scheduler):
+ super().__init__()
+ self.register_modules(unet=unet, scheduler=scheduler)
+
+ @torch.no_grad()
+ def __call__(
+ self,
+ batch_size: int = 1,
+ generator: Optional[torch.Generator] = None,
+ num_inference_steps: int = 50,
+ output_type: Optional[str] = "pil",
+ return_dict: bool = True,
+ **kwargs,
+ ) -> Union[ImagePipelineOutput, Tuple]:
+ r"""
+ Args:
+ batch_size (`int`, *optional*, defaults to 1):
+ The number of images to generate.
+ generator (`torch.Generator`, *optional*):
+ A [torch generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation
+ deterministic.
+ eta (`float`, *optional*, defaults to 0.0):
+ The eta parameter which controls the scale of the variance (0 is DDIM and 1 is one type of DDPM).
+ num_inference_steps (`int`, *optional*, defaults to 50):
+ The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+ expense of slower inference.
+ output_type (`str`, *optional*, defaults to `"pil"`):
+ The output format of the generate image. Choose between
+ [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+ return_dict (`bool`, *optional*, defaults to `True`):
+ Whether or not to return a [`~pipeline_utils.ImagePipelineOutput`] instead of a plain tuple.
+
+ Returns:
+ [`~pipeline_utils.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if
+ `return_dict` is True, otherwise a `tuple. When returning a tuple, the first element is a list with the
+ generated images.
+ """
+
+ # Sample gaussian noise to begin loop
+ image = torch.randn(
+ (batch_size, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
+ generator=generator,
+ )
+ image = image.to(self.device)
+
+ # set step values
+ self.scheduler.set_timesteps(num_inference_steps)
+
+ for t in self.progress_bar(self.scheduler.timesteps):
+ # 1. predict noise model_output
+ model_output = self.unet(image, t).sample
+
+ # 2. predict previous mean of image x_t-1 and add variance depending on eta
+ # eta corresponds to ฮท in paper and should be between [0, 1]
+ # do x_t -> x_t-1
+ image = self.scheduler.step(model_output, t, image).prev_sample
+
+ image = (image / 2 + 0.5).clamp(0, 1)
+ image = image.cpu().permute(0, 2, 3, 1).numpy()
+ if output_type == "pil":
+ image = self.numpy_to_pil(image)
+
+ if not return_dict:
+ return (image,), "This is a local test"
+
+ return ImagePipelineOutput(images=image), "This is a local test"
diff --git a/tests/fixtures/elise_format0.mid b/tests/fixtures/elise_format0.mid
new file mode 100644
index 0000000..33dbabe
Binary files /dev/null and b/tests/fixtures/elise_format0.mid differ
diff --git a/tests/lora/test_lora_layers_peft.py b/tests/lora/test_lora_layers_peft.py
new file mode 100644
index 0000000..67d28fe
--- /dev/null
+++ b/tests/lora/test_lora_layers_peft.py
@@ -0,0 +1,2324 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import gc
+import importlib
+import os
+import tempfile
+import time
+import unittest
+
+import numpy as np
+import torch
+import torch.nn as nn
+from huggingface_hub import hf_hub_download
+from huggingface_hub.repocard import RepoCard
+from packaging import version
+from safetensors.torch import load_file
+from transformers import CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ AutoPipelineForImage2Image,
+ AutoPipelineForText2Image,
+ ControlNetModel,
+ DDIMScheduler,
+ DiffusionPipeline,
+ EulerDiscreteScheduler,
+ LCMScheduler,
+ StableDiffusionPipeline,
+ StableDiffusionXLAdapterPipeline,
+ StableDiffusionXLControlNetPipeline,
+ StableDiffusionXLPipeline,
+ T2IAdapter,
+ UNet2DConditionModel,
+)
+from diffusers.utils.import_utils import is_accelerate_available, is_peft_available
+from diffusers.utils.testing_utils import (
+ floats_tensor,
+ load_image,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_peft_backend,
+ require_peft_version_greater,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+
+if is_accelerate_available():
+ from accelerate.utils import release_memory
+
+if is_peft_available():
+ from peft import LoraConfig
+ from peft.tuners.tuners_utils import BaseTunerLayer
+ from peft.utils import get_peft_model_state_dict
+
+
+def state_dicts_almost_equal(sd1, sd2):
+ sd1 = dict(sorted(sd1.items()))
+ sd2 = dict(sorted(sd2.items()))
+
+ models_are_equal = True
+ for ten1, ten2 in zip(sd1.values(), sd2.values()):
+ if (ten1 - ten2).abs().max() > 1e-3:
+ models_are_equal = False
+
+ return models_are_equal
+
+
+@require_peft_backend
+class PeftLoraLoaderMixinTests:
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
+ pipeline_class = None
+ scheduler_cls = None
+ scheduler_kwargs = None
+ has_two_text_encoders = False
+ unet_kwargs = None
+ vae_kwargs = None
+
+ def get_dummy_components(self, scheduler_cls=None):
+ scheduler_cls = self.scheduler_cls if scheduler_cls is None else LCMScheduler
+ rank = 4
+
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(**self.unet_kwargs)
+
+ scheduler = scheduler_cls(**self.scheduler_kwargs)
+
+ torch.manual_seed(0)
+ vae = AutoencoderKL(**self.vae_kwargs)
+
+ text_encoder = CLIPTextModel.from_pretrained("peft-internal-testing/tiny-clip-text-2")
+ tokenizer = CLIPTokenizer.from_pretrained("peft-internal-testing/tiny-clip-text-2")
+
+ if self.has_two_text_encoders:
+ text_encoder_2 = CLIPTextModelWithProjection.from_pretrained("peft-internal-testing/tiny-clip-text-2")
+ tokenizer_2 = CLIPTokenizer.from_pretrained("peft-internal-testing/tiny-clip-text-2")
+
+ text_lora_config = LoraConfig(
+ r=rank,
+ lora_alpha=rank,
+ target_modules=["q_proj", "k_proj", "v_proj", "out_proj"],
+ init_lora_weights=False,
+ )
+
+ unet_lora_config = LoraConfig(
+ r=rank, lora_alpha=rank, target_modules=["to_q", "to_k", "to_v", "to_out.0"], init_lora_weights=False
+ )
+
+ if self.has_two_text_encoders:
+ pipeline_components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "image_encoder": None,
+ "feature_extractor": None,
+ }
+ else:
+ pipeline_components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+
+ return pipeline_components, text_lora_config, unet_lora_config
+
+ def get_dummy_inputs(self, with_generator=True):
+ batch_size = 1
+ sequence_length = 10
+ num_channels = 4
+ sizes = (32, 32)
+
+ generator = torch.manual_seed(0)
+ noise = floats_tensor((batch_size, num_channels) + sizes)
+ input_ids = torch.randint(1, sequence_length, size=(batch_size, sequence_length), generator=generator)
+
+ pipeline_inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ }
+ if with_generator:
+ pipeline_inputs.update({"generator": generator})
+
+ return noise, input_ids, pipeline_inputs
+
+ # copied from: https://colab.research.google.com/gist/sayakpaul/df2ef6e1ae6d8c10a49d859883b10860/scratchpad.ipynb
+ def get_dummy_tokens(self):
+ max_seq_length = 77
+
+ inputs = torch.randint(2, 56, size=(1, max_seq_length), generator=torch.manual_seed(0))
+
+ prepared_inputs = {}
+ prepared_inputs["input_ids"] = inputs
+ return prepared_inputs
+
+ def check_if_lora_correctly_set(self, model) -> bool:
+ """
+ Checks if the LoRA layers are correctly set with peft
+ """
+ for module in model.modules():
+ if isinstance(module, BaseTunerLayer):
+ return True
+ return False
+
+ def test_simple_inference(self):
+ """
+ Tests a simple inference and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ _, _, inputs = self.get_dummy_inputs()
+ output_no_lora = pipe(**inputs).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ def test_simple_inference_with_text_lora(self):
+ """
+ Tests a simple inference with lora attached on the text encoder
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ output_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ not np.allclose(output_lora, output_no_lora, atol=1e-3, rtol=1e-3), "Lora should change the output"
+ )
+
+ def test_simple_inference_with_text_lora_and_scale(self):
+ """
+ Tests a simple inference with lora attached on the text encoder + scale argument
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ output_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ not np.allclose(output_lora, output_no_lora, atol=1e-3, rtol=1e-3), "Lora should change the output"
+ )
+
+ output_lora_scale = pipe(
+ **inputs, generator=torch.manual_seed(0), cross_attention_kwargs={"scale": 0.5}
+ ).images
+ self.assertTrue(
+ not np.allclose(output_lora, output_lora_scale, atol=1e-3, rtol=1e-3),
+ "Lora + scale should change the output",
+ )
+
+ output_lora_0_scale = pipe(
+ **inputs, generator=torch.manual_seed(0), cross_attention_kwargs={"scale": 0.0}
+ ).images
+ self.assertTrue(
+ np.allclose(output_no_lora, output_lora_0_scale, atol=1e-3, rtol=1e-3),
+ "Lora + 0 scale should lead to same result as no LoRA",
+ )
+
+ def test_simple_inference_with_text_lora_fused(self):
+ """
+ Tests a simple inference with lora attached into text encoder + fuses the lora weights into base model
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.fuse_lora()
+ # Fusing should still keep the LoRA layers
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ ouput_fused = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertFalse(
+ np.allclose(ouput_fused, output_no_lora, atol=1e-3, rtol=1e-3), "Fused lora should change the output"
+ )
+
+ def test_simple_inference_with_text_lora_unloaded(self):
+ """
+ Tests a simple inference with lora attached to text encoder, then unloads the lora weights
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.unload_lora_weights()
+ # unloading should remove the LoRA layers
+ self.assertFalse(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly unloaded in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ self.assertFalse(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2),
+ "Lora not correctly unloaded in text encoder 2",
+ )
+
+ ouput_unloaded = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ np.allclose(ouput_unloaded, output_no_lora, atol=1e-3, rtol=1e-3),
+ "Fused lora should change the output",
+ )
+
+ def test_simple_inference_with_text_lora_save_load(self):
+ """
+ Tests a simple usecase where users could use saving utilities for LoRA.
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ images_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ text_encoder_state_dict = get_peft_model_state_dict(pipe.text_encoder)
+ if self.has_two_text_encoders:
+ text_encoder_2_state_dict = get_peft_model_state_dict(pipe.text_encoder_2)
+
+ self.pipeline_class.save_lora_weights(
+ save_directory=tmpdirname,
+ text_encoder_lora_layers=text_encoder_state_dict,
+ text_encoder_2_lora_layers=text_encoder_2_state_dict,
+ safe_serialization=False,
+ )
+ else:
+ self.pipeline_class.save_lora_weights(
+ save_directory=tmpdirname,
+ text_encoder_lora_layers=text_encoder_state_dict,
+ safe_serialization=False,
+ )
+
+ self.assertTrue(os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.bin")))
+ pipe.unload_lora_weights()
+
+ pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.bin"))
+
+ images_lora_from_pretrained = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ self.assertTrue(
+ np.allclose(images_lora, images_lora_from_pretrained, atol=1e-3, rtol=1e-3),
+ "Loading from saved checkpoints should give same results.",
+ )
+
+ def test_simple_inference_save_pretrained(self):
+ """
+ Tests a simple usecase where users could use saving utilities for LoRA through save_pretrained
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ images_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+
+ pipe_from_pretrained = self.pipeline_class.from_pretrained(tmpdirname)
+ pipe_from_pretrained.to(self.torch_device)
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe_from_pretrained.text_encoder),
+ "Lora not correctly set in text encoder",
+ )
+
+ if self.has_two_text_encoders:
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe_from_pretrained.text_encoder_2),
+ "Lora not correctly set in text encoder 2",
+ )
+
+ images_lora_save_pretrained = pipe_from_pretrained(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(images_lora, images_lora_save_pretrained, atol=1e-3, rtol=1e-3),
+ "Loading from saved checkpoints should give same results.",
+ )
+
+ def test_simple_inference_with_text_unet_lora_save_load(self):
+ """
+ Tests a simple usecase where users could use saving utilities for LoRA for Unet + text encoder
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ pipe.unet.add_adapter(unet_lora_config)
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ images_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ text_encoder_state_dict = get_peft_model_state_dict(pipe.text_encoder)
+ unet_state_dict = get_peft_model_state_dict(pipe.unet)
+ if self.has_two_text_encoders:
+ text_encoder_2_state_dict = get_peft_model_state_dict(pipe.text_encoder_2)
+
+ self.pipeline_class.save_lora_weights(
+ save_directory=tmpdirname,
+ text_encoder_lora_layers=text_encoder_state_dict,
+ text_encoder_2_lora_layers=text_encoder_2_state_dict,
+ unet_lora_layers=unet_state_dict,
+ safe_serialization=False,
+ )
+ else:
+ self.pipeline_class.save_lora_weights(
+ save_directory=tmpdirname,
+ text_encoder_lora_layers=text_encoder_state_dict,
+ unet_lora_layers=unet_state_dict,
+ safe_serialization=False,
+ )
+
+ self.assertTrue(os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.bin")))
+ pipe.unload_lora_weights()
+
+ pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.bin"))
+
+ images_lora_from_pretrained = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ self.assertTrue(
+ np.allclose(images_lora, images_lora_from_pretrained, atol=1e-3, rtol=1e-3),
+ "Loading from saved checkpoints should give same results.",
+ )
+
+ def test_simple_inference_with_text_unet_lora_and_scale(self):
+ """
+ Tests a simple inference with lora attached on the text encoder + Unet + scale argument
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ pipe.unet.add_adapter(unet_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ output_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ not np.allclose(output_lora, output_no_lora, atol=1e-3, rtol=1e-3), "Lora should change the output"
+ )
+
+ output_lora_scale = pipe(
+ **inputs, generator=torch.manual_seed(0), cross_attention_kwargs={"scale": 0.5}
+ ).images
+ self.assertTrue(
+ not np.allclose(output_lora, output_lora_scale, atol=1e-3, rtol=1e-3),
+ "Lora + scale should change the output",
+ )
+
+ output_lora_0_scale = pipe(
+ **inputs, generator=torch.manual_seed(0), cross_attention_kwargs={"scale": 0.0}
+ ).images
+ self.assertTrue(
+ np.allclose(output_no_lora, output_lora_0_scale, atol=1e-3, rtol=1e-3),
+ "Lora + 0 scale should lead to same result as no LoRA",
+ )
+
+ self.assertTrue(
+ pipe.text_encoder.text_model.encoder.layers[0].self_attn.q_proj.scaling["default"] == 1.0,
+ "The scaling parameter has not been correctly restored!",
+ )
+
+ def test_simple_inference_with_text_lora_unet_fused(self):
+ """
+ Tests a simple inference with lora attached into text encoder + fuses the lora weights into base model
+ and makes sure it works as expected - with unet
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ pipe.unet.add_adapter(unet_lora_config)
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.fuse_lora()
+ # Fusing should still keep the LoRA layers
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in unet")
+
+ if self.has_two_text_encoders:
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ ouput_fused = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertFalse(
+ np.allclose(ouput_fused, output_no_lora, atol=1e-3, rtol=1e-3), "Fused lora should change the output"
+ )
+
+ def test_simple_inference_with_text_unet_lora_unloaded(self):
+ """
+ Tests a simple inference with lora attached to text encoder and unet, then unloads the lora weights
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ pipe.unet.add_adapter(unet_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.unload_lora_weights()
+ # unloading should remove the LoRA layers
+ self.assertFalse(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly unloaded in text encoder"
+ )
+ self.assertFalse(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly unloaded in Unet")
+
+ if self.has_two_text_encoders:
+ self.assertFalse(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2),
+ "Lora not correctly unloaded in text encoder 2",
+ )
+
+ ouput_unloaded = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ np.allclose(ouput_unloaded, output_no_lora, atol=1e-3, rtol=1e-3),
+ "Fused lora should change the output",
+ )
+
+ def test_simple_inference_with_text_unet_lora_unfused(self):
+ """
+ Tests a simple inference with lora attached to text encoder and unet, then unloads the lora weights
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ pipe.unet.add_adapter(unet_lora_config)
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.fuse_lora()
+
+ output_fused_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.unfuse_lora()
+
+ output_unfused_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ # unloading should remove the LoRA layers
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Unfuse should still keep LoRA layers"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Unfuse should still keep LoRA layers")
+
+ if self.has_two_text_encoders:
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Unfuse should still keep LoRA layers"
+ )
+
+ # Fuse and unfuse should lead to the same results
+ self.assertTrue(
+ np.allclose(output_fused_lora, output_unfused_lora, atol=1e-3, rtol=1e-3),
+ "Fused lora should change the output",
+ )
+
+ def test_simple_inference_with_text_unet_multi_adapter(self):
+ """
+ Tests a simple inference with lora attached to text encoder and unet, attaches
+ multiple adapters and set them
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-2")
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.set_adapters("adapter-1")
+
+ output_adapter_1 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters("adapter-2")
+ output_adapter_2 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+
+ output_adapter_mixed = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ # Fuse and unfuse should lead to the same results
+ self.assertFalse(
+ np.allclose(output_adapter_1, output_adapter_2, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and 2 should give different results",
+ )
+
+ self.assertFalse(
+ np.allclose(output_adapter_1, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and mixed adapters should give different results",
+ )
+
+ self.assertFalse(
+ np.allclose(output_adapter_2, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Adapter 2 and mixed adapters should give different results",
+ )
+
+ pipe.disable_lora()
+
+ output_disabled = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(output_no_lora, output_disabled, atol=1e-3, rtol=1e-3),
+ "output with no lora and output with lora disabled should give same results",
+ )
+
+ def test_simple_inference_with_text_unet_multi_adapter_delete_adapter(self):
+ """
+ Tests a simple inference with lora attached to text encoder and unet, attaches
+ multiple adapters and set/delete them
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-2")
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.set_adapters("adapter-1")
+
+ output_adapter_1 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters("adapter-2")
+ output_adapter_2 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+
+ output_adapter_mixed = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertFalse(
+ np.allclose(output_adapter_1, output_adapter_2, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and 2 should give different results",
+ )
+
+ self.assertFalse(
+ np.allclose(output_adapter_1, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and mixed adapters should give different results",
+ )
+
+ self.assertFalse(
+ np.allclose(output_adapter_2, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Adapter 2 and mixed adapters should give different results",
+ )
+
+ pipe.delete_adapters("adapter-1")
+ output_deleted_adapter_1 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(output_deleted_adapter_1, output_adapter_2, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and 2 should give different results",
+ )
+
+ pipe.delete_adapters("adapter-2")
+ output_deleted_adapters = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(output_no_lora, output_deleted_adapters, atol=1e-3, rtol=1e-3),
+ "output with no lora and output with lora disabled should give same results",
+ )
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+ pipe.delete_adapters(["adapter-1", "adapter-2"])
+
+ output_deleted_adapters = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(output_no_lora, output_deleted_adapters, atol=1e-3, rtol=1e-3),
+ "output with no lora and output with lora disabled should give same results",
+ )
+
+ def test_simple_inference_with_text_unet_multi_adapter_weighted(self):
+ """
+ Tests a simple inference with lora attached to text encoder and unet, attaches
+ multiple adapters and set them
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-2")
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.set_adapters("adapter-1")
+
+ output_adapter_1 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters("adapter-2")
+ output_adapter_2 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+
+ output_adapter_mixed = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ # Fuse and unfuse should lead to the same results
+ self.assertFalse(
+ np.allclose(output_adapter_1, output_adapter_2, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and 2 should give different results",
+ )
+
+ self.assertFalse(
+ np.allclose(output_adapter_1, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Adapter 1 and mixed adapters should give different results",
+ )
+
+ self.assertFalse(
+ np.allclose(output_adapter_2, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Adapter 2 and mixed adapters should give different results",
+ )
+
+ pipe.set_adapters(["adapter-1", "adapter-2"], [0.5, 0.6])
+ output_adapter_mixed_weighted = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertFalse(
+ np.allclose(output_adapter_mixed_weighted, output_adapter_mixed, atol=1e-3, rtol=1e-3),
+ "Weighted adapter and mixed adapter should give different results",
+ )
+
+ pipe.disable_lora()
+
+ output_disabled = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(output_no_lora, output_disabled, atol=1e-3, rtol=1e-3),
+ "output with no lora and output with lora disabled should give same results",
+ )
+
+ def test_lora_fuse_nan(self):
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ # corrupt one LoRA weight with `inf` values
+ with torch.no_grad():
+ pipe.unet.mid_block.attentions[0].transformer_blocks[0].attn1.to_q.lora_A["adapter-1"].weight += float(
+ "inf"
+ )
+
+ # with `safe_fusing=True` we should see an Error
+ with self.assertRaises(ValueError):
+ pipe.fuse_lora(safe_fusing=True)
+
+ # without we should not see an error, but every image will be black
+ pipe.fuse_lora(safe_fusing=False)
+
+ out = pipe("test", num_inference_steps=2, output_type="np").images
+
+ self.assertTrue(np.isnan(out).all())
+
+ def test_get_adapters(self):
+ """
+ Tests a simple usecase where we attach multiple adapters and check if the results
+ are the expected results
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+
+ adapter_names = pipe.get_active_adapters()
+ self.assertListEqual(adapter_names, ["adapter-1"])
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ adapter_names = pipe.get_active_adapters()
+ self.assertListEqual(adapter_names, ["adapter-2"])
+
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+ self.assertListEqual(pipe.get_active_adapters(), ["adapter-1", "adapter-2"])
+
+ def test_get_list_adapters(self):
+ """
+ Tests a simple usecase where we attach multiple adapters and check if the results
+ are the expected results
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+
+ adapter_names = pipe.get_list_adapters()
+ self.assertDictEqual(adapter_names, {"text_encoder": ["adapter-1"], "unet": ["adapter-1"]})
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ adapter_names = pipe.get_list_adapters()
+ self.assertDictEqual(
+ adapter_names, {"text_encoder": ["adapter-1", "adapter-2"], "unet": ["adapter-1", "adapter-2"]}
+ )
+
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+ self.assertDictEqual(
+ pipe.get_list_adapters(),
+ {"unet": ["adapter-1", "adapter-2"], "text_encoder": ["adapter-1", "adapter-2"]},
+ )
+
+ pipe.unet.add_adapter(unet_lora_config, "adapter-3")
+ self.assertDictEqual(
+ pipe.get_list_adapters(),
+ {"unet": ["adapter-1", "adapter-2", "adapter-3"], "text_encoder": ["adapter-1", "adapter-2"]},
+ )
+
+ @require_peft_version_greater(peft_version="0.6.2")
+ def test_simple_inference_with_text_lora_unet_fused_multi(self):
+ """
+ Tests a simple inference with lora attached into text encoder + fuses the lora weights into base model
+ and makes sure it works as expected - with unet and multi-adapter case
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(output_no_lora.shape == (1, 64, 64, 3))
+
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-1")
+
+ # Attach a second adapter
+ pipe.text_encoder.add_adapter(text_lora_config, "adapter-2")
+ pipe.unet.add_adapter(unet_lora_config, "adapter-2")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-1")
+ pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-2")
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ # set them to multi-adapter inference mode
+ pipe.set_adapters(["adapter-1", "adapter-2"])
+ ouputs_all_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.set_adapters(["adapter-1"])
+ ouputs_lora_1 = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ pipe.fuse_lora(adapter_names=["adapter-1"])
+
+ # Fusing should still keep the LoRA layers so outpout should remain the same
+ outputs_lora_1_fused = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ self.assertTrue(
+ np.allclose(ouputs_lora_1, outputs_lora_1_fused, atol=1e-3, rtol=1e-3),
+ "Fused lora should not change the output",
+ )
+
+ pipe.unfuse_lora()
+ pipe.fuse_lora(adapter_names=["adapter-2", "adapter-1"])
+
+ # Fusing should still keep the LoRA layers
+ output_all_lora_fused = pipe(**inputs, generator=torch.manual_seed(0)).images
+ self.assertTrue(
+ np.allclose(output_all_lora_fused, ouputs_all_lora, atol=1e-3, rtol=1e-3),
+ "Fused lora should not change the output",
+ )
+
+ @unittest.skip("This is failing for now - need to investigate")
+ def test_simple_inference_with_text_unet_lora_unfused_torch_compile(self):
+ """
+ Tests a simple inference with lora attached to text encoder and unet, then unloads the lora weights
+ and makes sure it works as expected
+ """
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+ pipe.text_encoder.add_adapter(text_lora_config)
+ pipe.unet.add_adapter(unet_lora_config)
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+ )
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2.add_adapter(text_lora_config)
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2"
+ )
+
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+ pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True)
+
+ if self.has_two_text_encoders:
+ pipe.text_encoder_2 = torch.compile(pipe.text_encoder_2, mode="reduce-overhead", fullgraph=True)
+
+ # Just makes sure it works..
+ _ = pipe(**inputs, generator=torch.manual_seed(0)).images
+
+ def test_modify_padding_mode(self):
+ def set_pad_mode(network, mode="circular"):
+ for _, module in network.named_modules():
+ if isinstance(module, torch.nn.Conv2d):
+ module.padding_mode = mode
+
+ for scheduler_cls in [DDIMScheduler, LCMScheduler]:
+ components, _, _ = self.get_dummy_components(scheduler_cls)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(self.torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ _pad_mode = "circular"
+ set_pad_mode(pipe.vae, _pad_mode)
+ set_pad_mode(pipe.unet, _pad_mode)
+
+ _, _, inputs = self.get_dummy_inputs()
+ _ = pipe(**inputs).images
+
+
+class StableDiffusionLoRATests(PeftLoraLoaderMixinTests, unittest.TestCase):
+ pipeline_class = StableDiffusionPipeline
+ scheduler_cls = DDIMScheduler
+ scheduler_kwargs = {
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "beta_schedule": "scaled_linear",
+ "clip_sample": False,
+ "set_alpha_to_one": False,
+ "steps_offset": 1,
+ }
+ unet_kwargs = {
+ "block_out_channels": (32, 64),
+ "layers_per_block": 2,
+ "sample_size": 32,
+ "in_channels": 4,
+ "out_channels": 4,
+ "down_block_types": ("DownBlock2D", "CrossAttnDownBlock2D"),
+ "up_block_types": ("CrossAttnUpBlock2D", "UpBlock2D"),
+ "cross_attention_dim": 32,
+ }
+ vae_kwargs = {
+ "block_out_channels": [32, 64],
+ "in_channels": 3,
+ "out_channels": 3,
+ "down_block_types": ["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ "up_block_types": ["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ "latent_channels": 4,
+ }
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @slow
+ @require_torch_gpu
+ def test_integration_move_lora_cpu(self):
+ path = "runwayml/stable-diffusion-v1-5"
+ lora_id = "takuma104/lora-test-text-encoder-lora-target"
+
+ pipe = StableDiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16)
+ pipe.load_lora_weights(lora_id, adapter_name="adapter-1")
+ pipe.load_lora_weights(lora_id, adapter_name="adapter-2")
+ pipe = pipe.to("cuda")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder),
+ "Lora not correctly set in text encoder",
+ )
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.unet),
+ "Lora not correctly set in text encoder",
+ )
+
+ # We will offload the first adapter in CPU and check if the offloading
+ # has been performed correctly
+ pipe.set_lora_device(["adapter-1"], "cpu")
+
+ for name, module in pipe.unet.named_modules():
+ if "adapter-1" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
+ self.assertTrue(module.weight.device == torch.device("cpu"))
+ elif "adapter-2" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
+ self.assertTrue(module.weight.device != torch.device("cpu"))
+
+ for name, module in pipe.text_encoder.named_modules():
+ if "adapter-1" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
+ self.assertTrue(module.weight.device == torch.device("cpu"))
+ elif "adapter-2" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
+ self.assertTrue(module.weight.device != torch.device("cpu"))
+
+ pipe.set_lora_device(["adapter-1"], 0)
+
+ for n, m in pipe.unet.named_modules():
+ if "adapter-1" in n and not isinstance(m, (nn.Dropout, nn.Identity)):
+ self.assertTrue(m.weight.device != torch.device("cpu"))
+
+ for n, m in pipe.text_encoder.named_modules():
+ if "adapter-1" in n and not isinstance(m, (nn.Dropout, nn.Identity)):
+ self.assertTrue(m.weight.device != torch.device("cpu"))
+
+ pipe.set_lora_device(["adapter-1", "adapter-2"], "cuda")
+
+ for n, m in pipe.unet.named_modules():
+ if ("adapter-1" in n or "adapter-2" in n) and not isinstance(m, (nn.Dropout, nn.Identity)):
+ self.assertTrue(m.weight.device != torch.device("cpu"))
+
+ for n, m in pipe.text_encoder.named_modules():
+ if ("adapter-1" in n or "adapter-2" in n) and not isinstance(m, (nn.Dropout, nn.Identity)):
+ self.assertTrue(m.weight.device != torch.device("cpu"))
+
+ @slow
+ @require_torch_gpu
+ def test_integration_logits_with_scale(self):
+ path = "runwayml/stable-diffusion-v1-5"
+ lora_id = "takuma104/lora-test-text-encoder-lora-target"
+
+ pipe = StableDiffusionPipeline.from_pretrained(path, torch_dtype=torch.float32)
+ pipe.load_lora_weights(lora_id)
+ pipe = pipe.to("cuda")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder),
+ "Lora not correctly set in text encoder 2",
+ )
+
+ prompt = "a red sks dog"
+
+ images = pipe(
+ prompt=prompt,
+ num_inference_steps=15,
+ cross_attention_kwargs={"scale": 0.5},
+ generator=torch.manual_seed(0),
+ output_type="np",
+ ).images
+
+ expected_slice_scale = np.array([0.307, 0.283, 0.310, 0.310, 0.300, 0.314, 0.336, 0.314, 0.321])
+
+ predicted_slice = images[0, -3:, -3:, -1].flatten()
+
+ self.assertTrue(np.allclose(expected_slice_scale, predicted_slice, atol=1e-3, rtol=1e-3))
+
+ @slow
+ @require_torch_gpu
+ def test_integration_logits_no_scale(self):
+ path = "runwayml/stable-diffusion-v1-5"
+ lora_id = "takuma104/lora-test-text-encoder-lora-target"
+
+ pipe = StableDiffusionPipeline.from_pretrained(path, torch_dtype=torch.float32)
+ pipe.load_lora_weights(lora_id)
+ pipe = pipe.to("cuda")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.text_encoder),
+ "Lora not correctly set in text encoder",
+ )
+
+ prompt = "a red sks dog"
+
+ images = pipe(prompt=prompt, num_inference_steps=30, generator=torch.manual_seed(0), output_type="np").images
+
+ expected_slice_scale = np.array([0.074, 0.064, 0.073, 0.0842, 0.069, 0.0641, 0.0794, 0.076, 0.084])
+
+ predicted_slice = images[0, -3:, -3:, -1].flatten()
+
+ self.assertTrue(np.allclose(expected_slice_scale, predicted_slice, atol=1e-3, rtol=1e-3))
+
+ @nightly
+ @require_torch_gpu
+ def test_integration_logits_multi_adapter(self):
+ path = "stabilityai/stable-diffusion-xl-base-1.0"
+ lora_id = "CiroN2022/toy-face"
+
+ pipe = StableDiffusionXLPipeline.from_pretrained(path, torch_dtype=torch.float16)
+ pipe.load_lora_weights(lora_id, weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
+ pipe = pipe.to("cuda")
+
+ self.assertTrue(
+ self.check_if_lora_correctly_set(pipe.unet),
+ "Lora not correctly set in Unet",
+ )
+
+ prompt = "toy_face of a hacker with a hoodie"
+
+ lora_scale = 0.9
+
+ images = pipe(
+ prompt=prompt,
+ num_inference_steps=30,
+ generator=torch.manual_seed(0),
+ cross_attention_kwargs={"scale": lora_scale},
+ output_type="np",
+ ).images
+ expected_slice_scale = np.array([0.538, 0.539, 0.540, 0.540, 0.542, 0.539, 0.538, 0.541, 0.539])
+
+ predicted_slice = images[0, -3:, -3:, -1].flatten()
+ self.assertTrue(np.allclose(expected_slice_scale, predicted_slice, atol=1e-3, rtol=1e-3))
+
+ pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
+ pipe.set_adapters("pixel")
+
+ prompt = "pixel art, a hacker with a hoodie, simple, flat colors"
+ images = pipe(
+ prompt,
+ num_inference_steps=30,
+ guidance_scale=7.5,
+ cross_attention_kwargs={"scale": lora_scale},
+ generator=torch.manual_seed(0),
+ output_type="np",
+ ).images
+
+ predicted_slice = images[0, -3:, -3:, -1].flatten()
+ expected_slice_scale = np.array(
+ [0.61973065, 0.62018543, 0.62181497, 0.61933696, 0.6208608, 0.620576, 0.6200281, 0.62258327, 0.6259889]
+ )
+ self.assertTrue(np.allclose(expected_slice_scale, predicted_slice, atol=1e-3, rtol=1e-3))
+
+ # multi-adapter inference
+ pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])
+ images = pipe(
+ prompt,
+ num_inference_steps=30,
+ guidance_scale=7.5,
+ cross_attention_kwargs={"scale": 1.0},
+ generator=torch.manual_seed(0),
+ output_type="np",
+ ).images
+ predicted_slice = images[0, -3:, -3:, -1].flatten()
+ expected_slice_scale = np.array([0.5888, 0.5897, 0.5946, 0.5888, 0.5935, 0.5946, 0.5857, 0.5891, 0.5909])
+ self.assertTrue(np.allclose(expected_slice_scale, predicted_slice, atol=1e-3, rtol=1e-3))
+
+ # Lora disabled
+ pipe.disable_lora()
+ images = pipe(
+ prompt,
+ num_inference_steps=30,
+ guidance_scale=7.5,
+ cross_attention_kwargs={"scale": lora_scale},
+ generator=torch.manual_seed(0),
+ output_type="np",
+ ).images
+ predicted_slice = images[0, -3:, -3:, -1].flatten()
+ expected_slice_scale = np.array([0.5456, 0.5466, 0.5487, 0.5458, 0.5469, 0.5454, 0.5446, 0.5479, 0.5487])
+ self.assertTrue(np.allclose(expected_slice_scale, predicted_slice, atol=1e-3, rtol=1e-3))
+
+
+class StableDiffusionXLLoRATests(PeftLoraLoaderMixinTests, unittest.TestCase):
+ has_two_text_encoders = True
+ pipeline_class = StableDiffusionXLPipeline
+ scheduler_cls = EulerDiscreteScheduler
+ scheduler_kwargs = {
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "beta_schedule": "scaled_linear",
+ "timestep_spacing": "leading",
+ "steps_offset": 1,
+ }
+ unet_kwargs = {
+ "block_out_channels": (32, 64),
+ "layers_per_block": 2,
+ "sample_size": 32,
+ "in_channels": 4,
+ "out_channels": 4,
+ "down_block_types": ("DownBlock2D", "CrossAttnDownBlock2D"),
+ "up_block_types": ("CrossAttnUpBlock2D", "UpBlock2D"),
+ "attention_head_dim": (2, 4),
+ "use_linear_projection": True,
+ "addition_embed_type": "text_time",
+ "addition_time_embed_dim": 8,
+ "transformer_layers_per_block": (1, 2),
+ "projection_class_embeddings_input_dim": 80, # 6 * 8 + 32
+ "cross_attention_dim": 64,
+ }
+ vae_kwargs = {
+ "block_out_channels": [32, 64],
+ "in_channels": 3,
+ "out_channels": 3,
+ "down_block_types": ["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ "up_block_types": ["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ "latent_channels": 4,
+ "sample_size": 128,
+ }
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+
+@slow
+@require_torch_gpu
+class LoraIntegrationTests(PeftLoraLoaderMixinTests, unittest.TestCase):
+ pipeline_class = StableDiffusionPipeline
+ scheduler_cls = DDIMScheduler
+ scheduler_kwargs = {
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "beta_schedule": "scaled_linear",
+ "clip_sample": False,
+ "set_alpha_to_one": False,
+ "steps_offset": 1,
+ }
+ unet_kwargs = {
+ "block_out_channels": (32, 64),
+ "layers_per_block": 2,
+ "sample_size": 32,
+ "in_channels": 4,
+ "out_channels": 4,
+ "down_block_types": ("DownBlock2D", "CrossAttnDownBlock2D"),
+ "up_block_types": ("CrossAttnUpBlock2D", "UpBlock2D"),
+ "cross_attention_dim": 32,
+ }
+ vae_kwargs = {
+ "block_out_channels": [32, 64],
+ "in_channels": 3,
+ "out_channels": 3,
+ "down_block_types": ["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ "up_block_types": ["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ "latent_channels": 4,
+ }
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_dreambooth_old_format(self):
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ lora_model_id = "hf-internal-testing/lora_dreambooth_dog_example"
+ card = RepoCard.load(lora_model_id)
+ base_model_id = card.data.to_dict()["base_model"]
+
+ pipe = StableDiffusionPipeline.from_pretrained(base_model_id, safety_checker=None)
+ pipe = pipe.to(torch_device)
+ pipe.load_lora_weights(lora_model_id)
+
+ images = pipe(
+ "A photo of a sks dog floating in the river", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+
+ expected = np.array([0.7207, 0.6787, 0.6010, 0.7478, 0.6838, 0.6064, 0.6984, 0.6443, 0.5785])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-4))
+ release_memory(pipe)
+
+ def test_dreambooth_text_encoder_new_format(self):
+ generator = torch.Generator().manual_seed(0)
+
+ lora_model_id = "hf-internal-testing/lora-trained"
+ card = RepoCard.load(lora_model_id)
+ base_model_id = card.data.to_dict()["base_model"]
+
+ pipe = StableDiffusionPipeline.from_pretrained(base_model_id, safety_checker=None)
+ pipe = pipe.to(torch_device)
+ pipe.load_lora_weights(lora_model_id)
+
+ images = pipe("A photo of a sks dog", output_type="np", generator=generator, num_inference_steps=2).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+
+ expected = np.array([0.6628, 0.6138, 0.5390, 0.6625, 0.6130, 0.5463, 0.6166, 0.5788, 0.5359])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-4))
+ release_memory(pipe)
+
+ def test_a1111(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/Counterfeit-V2.5", safety_checker=None).to(
+ torch_device
+ )
+ lora_model_id = "hf-internal-testing/civitai-light-shadow-lora"
+ lora_filename = "light_and_shadow.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.3636, 0.3708, 0.3694, 0.3679, 0.3829, 0.3677, 0.3692, 0.3688, 0.3292])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_lycoris(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/Amixx", safety_checker=None, use_safetensors=True, variant="fp16"
+ ).to(torch_device)
+ lora_model_id = "hf-internal-testing/edgLycorisMugler-light"
+ lora_filename = "edgLycorisMugler-light.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.6463, 0.658, 0.599, 0.6542, 0.6512, 0.6213, 0.658, 0.6485, 0.6017])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_a1111_with_model_cpu_offload(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/Counterfeit-V2.5", safety_checker=None)
+ pipe.enable_model_cpu_offload()
+ lora_model_id = "hf-internal-testing/civitai-light-shadow-lora"
+ lora_filename = "light_and_shadow.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.3636, 0.3708, 0.3694, 0.3679, 0.3829, 0.3677, 0.3692, 0.3688, 0.3292])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_a1111_with_sequential_cpu_offload(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/Counterfeit-V2.5", safety_checker=None)
+ pipe.enable_sequential_cpu_offload()
+ lora_model_id = "hf-internal-testing/civitai-light-shadow-lora"
+ lora_filename = "light_and_shadow.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.3636, 0.3708, 0.3694, 0.3679, 0.3829, 0.3677, 0.3692, 0.3688, 0.3292])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_kohya_sd_v15_with_higher_dimensions(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None).to(
+ torch_device
+ )
+ lora_model_id = "hf-internal-testing/urushisato-lora"
+ lora_filename = "urushisato_v15.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.7165, 0.6616, 0.5833, 0.7504, 0.6718, 0.587, 0.6871, 0.6361, 0.5694])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_vanilla_funetuning(self):
+ generator = torch.Generator().manual_seed(0)
+
+ lora_model_id = "hf-internal-testing/sd-model-finetuned-lora-t4"
+ card = RepoCard.load(lora_model_id)
+ base_model_id = card.data.to_dict()["base_model"]
+
+ pipe = StableDiffusionPipeline.from_pretrained(base_model_id, safety_checker=None)
+ pipe = pipe.to(torch_device)
+ pipe.load_lora_weights(lora_model_id)
+
+ images = pipe("A pokemon with blue eyes.", output_type="np", generator=generator, num_inference_steps=2).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+
+ expected = np.array([0.7406, 0.699, 0.5963, 0.7493, 0.7045, 0.6096, 0.6886, 0.6388, 0.583])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-4))
+ release_memory(pipe)
+
+ def test_unload_kohya_lora(self):
+ generator = torch.manual_seed(0)
+ prompt = "masterpiece, best quality, mountain"
+ num_inference_steps = 2
+
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None).to(
+ torch_device
+ )
+ initial_images = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ initial_images = initial_images[0, -3:, -3:, -1].flatten()
+
+ lora_model_id = "hf-internal-testing/civitai-colored-icons-lora"
+ lora_filename = "Colored_Icons_by_vizsumit.safetensors"
+
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ generator = torch.manual_seed(0)
+ lora_images = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ lora_images = lora_images[0, -3:, -3:, -1].flatten()
+
+ pipe.unload_lora_weights()
+ generator = torch.manual_seed(0)
+ unloaded_lora_images = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ unloaded_lora_images = unloaded_lora_images[0, -3:, -3:, -1].flatten()
+
+ self.assertFalse(np.allclose(initial_images, lora_images))
+ self.assertTrue(np.allclose(initial_images, unloaded_lora_images, atol=1e-3))
+ release_memory(pipe)
+
+ def test_load_unload_load_kohya_lora(self):
+ # This test ensures that a Kohya-style LoRA can be safely unloaded and then loaded
+ # without introducing any side-effects. Even though the test uses a Kohya-style
+ # LoRA, the underlying adapter handling mechanism is format-agnostic.
+ generator = torch.manual_seed(0)
+ prompt = "masterpiece, best quality, mountain"
+ num_inference_steps = 2
+
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None).to(
+ torch_device
+ )
+ initial_images = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ initial_images = initial_images[0, -3:, -3:, -1].flatten()
+
+ lora_model_id = "hf-internal-testing/civitai-colored-icons-lora"
+ lora_filename = "Colored_Icons_by_vizsumit.safetensors"
+
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ generator = torch.manual_seed(0)
+ lora_images = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ lora_images = lora_images[0, -3:, -3:, -1].flatten()
+
+ pipe.unload_lora_weights()
+ generator = torch.manual_seed(0)
+ unloaded_lora_images = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ unloaded_lora_images = unloaded_lora_images[0, -3:, -3:, -1].flatten()
+
+ self.assertFalse(np.allclose(initial_images, lora_images))
+ self.assertTrue(np.allclose(initial_images, unloaded_lora_images, atol=1e-3))
+
+ # make sure we can load a LoRA again after unloading and they don't have
+ # any undesired effects.
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ generator = torch.manual_seed(0)
+ lora_images_again = pipe(
+ prompt, output_type="np", generator=generator, num_inference_steps=num_inference_steps
+ ).images
+ lora_images_again = lora_images_again[0, -3:, -3:, -1].flatten()
+
+ self.assertTrue(np.allclose(lora_images, lora_images_again, atol=1e-3))
+ release_memory(pipe)
+
+ def test_not_empty_state_dict(self):
+ # Makes sure https://github.com/huggingface/diffusers/issues/7054 does not happen again
+ pipe = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
+ ).to("cuda")
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+ cached_file = hf_hub_download("hf-internal-testing/lcm-lora-test-sd-v1-5", "test_lora.safetensors")
+ lcm_lora = load_file(cached_file)
+
+ pipe.load_lora_weights(lcm_lora, adapter_name="lcm")
+ self.assertTrue(lcm_lora != {})
+ release_memory(pipe)
+
+ def test_load_unload_load_state_dict(self):
+ # Makes sure https://github.com/huggingface/diffusers/issues/7054 does not happen again
+ pipe = AutoPipelineForText2Image.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
+ ).to("cuda")
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+ cached_file = hf_hub_download("hf-internal-testing/lcm-lora-test-sd-v1-5", "test_lora.safetensors")
+ lcm_lora = load_file(cached_file)
+ previous_state_dict = lcm_lora.copy()
+
+ pipe.load_lora_weights(lcm_lora, adapter_name="lcm")
+ self.assertDictEqual(lcm_lora, previous_state_dict)
+
+ pipe.unload_lora_weights()
+ pipe.load_lora_weights(lcm_lora, adapter_name="lcm")
+ self.assertDictEqual(lcm_lora, previous_state_dict)
+
+ release_memory(pipe)
+
+
+@slow
+@require_torch_gpu
+class LoraSDXLIntegrationTests(PeftLoraLoaderMixinTests, unittest.TestCase):
+ has_two_text_encoders = True
+ pipeline_class = StableDiffusionXLPipeline
+ scheduler_cls = EulerDiscreteScheduler
+ scheduler_kwargs = {
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "beta_schedule": "scaled_linear",
+ "timestep_spacing": "leading",
+ "steps_offset": 1,
+ }
+ unet_kwargs = {
+ "block_out_channels": (32, 64),
+ "layers_per_block": 2,
+ "sample_size": 32,
+ "in_channels": 4,
+ "out_channels": 4,
+ "down_block_types": ("DownBlock2D", "CrossAttnDownBlock2D"),
+ "up_block_types": ("CrossAttnUpBlock2D", "UpBlock2D"),
+ "attention_head_dim": (2, 4),
+ "use_linear_projection": True,
+ "addition_embed_type": "text_time",
+ "addition_time_embed_dim": 8,
+ "transformer_layers_per_block": (1, 2),
+ "projection_class_embeddings_input_dim": 80, # 6 * 8 + 32
+ "cross_attention_dim": 64,
+ }
+ vae_kwargs = {
+ "block_out_channels": [32, 64],
+ "in_channels": 3,
+ "out_channels": 3,
+ "down_block_types": ["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ "up_block_types": ["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ "latent_channels": 4,
+ "sample_size": 128,
+ }
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_sdxl_0_9_lora_one(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9")
+ lora_model_id = "hf-internal-testing/sdxl-0.9-daiton-lora"
+ lora_filename = "daiton-xl-lora-test.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ pipe.enable_model_cpu_offload()
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.3838, 0.3482, 0.3588, 0.3162, 0.319, 0.3369, 0.338, 0.3366, 0.3213])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_sdxl_0_9_lora_two(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9")
+ lora_model_id = "hf-internal-testing/sdxl-0.9-costumes-lora"
+ lora_filename = "saijo.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ pipe.enable_model_cpu_offload()
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.3137, 0.3269, 0.3355, 0.255, 0.2577, 0.2563, 0.2679, 0.2758, 0.2626])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_sdxl_0_9_lora_three(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9")
+ lora_model_id = "hf-internal-testing/sdxl-0.9-kamepan-lora"
+ lora_filename = "kame_sdxl_v2-000020-16rank.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ pipe.enable_model_cpu_offload()
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.4015, 0.3761, 0.3616, 0.3745, 0.3462, 0.3337, 0.3564, 0.3649, 0.3468])
+
+ self.assertTrue(np.allclose(images, expected, atol=5e-3))
+ release_memory(pipe)
+
+ def test_sdxl_1_0_lora(self):
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+ pipe.enable_model_cpu_offload()
+ lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
+ lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.4468, 0.4087, 0.4134, 0.366, 0.3202, 0.3505, 0.3786, 0.387, 0.3535])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-4))
+ release_memory(pipe)
+
+ def test_sdxl_lcm_lora(self):
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ lora_model_id = "latent-consistency/lcm-lora-sdxl"
+
+ pipe.load_lora_weights(lora_model_id)
+
+ image = pipe(
+ "masterpiece, best quality, mountain", generator=generator, num_inference_steps=4, guidance_scale=0.5
+ ).images[0]
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/lcm_lora/sdxl_lcm_lora.png"
+ )
+
+ image_np = pipe.image_processor.pil_to_numpy(image)
+ expected_image_np = pipe.image_processor.pil_to_numpy(expected_image)
+
+ max_diff = numpy_cosine_similarity_distance(image_np.flatten(), expected_image_np.flatten())
+ assert max_diff < 1e-4
+
+ pipe.unload_lora_weights()
+
+ release_memory(pipe)
+
+ def test_sdv1_5_lcm_lora(self):
+ pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+ pipe.to("cuda")
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ lora_model_id = "latent-consistency/lcm-lora-sdv1-5"
+ pipe.load_lora_weights(lora_model_id)
+
+ image = pipe(
+ "masterpiece, best quality, mountain", generator=generator, num_inference_steps=4, guidance_scale=0.5
+ ).images[0]
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/lcm_lora/sdv15_lcm_lora.png"
+ )
+
+ image_np = pipe.image_processor.pil_to_numpy(image)
+ expected_image_np = pipe.image_processor.pil_to_numpy(expected_image)
+
+ max_diff = numpy_cosine_similarity_distance(image_np.flatten(), expected_image_np.flatten())
+ assert max_diff < 1e-4
+
+ pipe.unload_lora_weights()
+
+ release_memory(pipe)
+
+ def test_sdv1_5_lcm_lora_img2img(self):
+ pipe = AutoPipelineForImage2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+ pipe.to("cuda")
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/img2img/fantasy_landscape.png"
+ )
+
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ lora_model_id = "latent-consistency/lcm-lora-sdv1-5"
+ pipe.load_lora_weights(lora_model_id)
+
+ image = pipe(
+ "snowy mountain",
+ generator=generator,
+ image=init_image,
+ strength=0.5,
+ num_inference_steps=4,
+ guidance_scale=0.5,
+ ).images[0]
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/lcm_lora/sdv15_lcm_lora_img2img.png"
+ )
+
+ image_np = pipe.image_processor.pil_to_numpy(image)
+ expected_image_np = pipe.image_processor.pil_to_numpy(expected_image)
+
+ max_diff = numpy_cosine_similarity_distance(image_np.flatten(), expected_image_np.flatten())
+ assert max_diff < 1e-4
+
+ pipe.unload_lora_weights()
+
+ release_memory(pipe)
+
+ def test_sdxl_1_0_lora_fusion(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+ lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
+ lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ pipe.fuse_lora()
+ # We need to unload the lora weights since in the previous API `fuse_lora` led to lora weights being
+ # silently deleted - otherwise this will CPU OOM
+ pipe.unload_lora_weights()
+
+ pipe.enable_model_cpu_offload()
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ # This way we also test equivalence between LoRA fusion and the non-fusion behaviour.
+ expected = np.array([0.4468, 0.4087, 0.4134, 0.366, 0.3202, 0.3505, 0.3786, 0.387, 0.3535])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-4))
+ release_memory(pipe)
+
+ def test_sdxl_1_0_lora_unfusion(self):
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+ lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
+ lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ pipe.fuse_lora()
+
+ pipe.enable_model_cpu_offload()
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=3
+ ).images
+ images_with_fusion = images.flatten()
+
+ pipe.unfuse_lora()
+ generator = torch.Generator("cpu").manual_seed(0)
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=3
+ ).images
+ images_without_fusion = images.flatten()
+
+ max_diff = numpy_cosine_similarity_distance(images_with_fusion, images_without_fusion)
+ assert max_diff < 1e-4
+
+ release_memory(pipe)
+
+ def test_sdxl_1_0_lora_unfusion_effectivity(self):
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator().manual_seed(0)
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+ original_image_slice = images[0, -3:, -3:, -1].flatten()
+
+ lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
+ lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+ pipe.fuse_lora()
+
+ generator = torch.Generator().manual_seed(0)
+ _ = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ pipe.unfuse_lora()
+
+ # We need to unload the lora weights - in the old API unfuse led to unloading the adapter weights
+ pipe.unload_lora_weights()
+
+ generator = torch.Generator().manual_seed(0)
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+ images_without_fusion_slice = images[0, -3:, -3:, -1].flatten()
+
+ self.assertTrue(np.allclose(original_image_slice, images_without_fusion_slice, atol=1e-3))
+ release_memory(pipe)
+
+ def test_sdxl_1_0_lora_fusion_efficiency(self):
+ generator = torch.Generator().manual_seed(0)
+ lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
+ lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename, torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
+
+ start_time = time.time()
+ for _ in range(3):
+ pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+ end_time = time.time()
+ elapsed_time_non_fusion = end_time - start_time
+
+ del pipe
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename, torch_dtype=torch.float16)
+ pipe.fuse_lora()
+
+ # We need to unload the lora weights since in the previous API `fuse_lora` led to lora weights being
+ # silently deleted - otherwise this will CPU OOM
+ pipe.unload_lora_weights()
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator().manual_seed(0)
+ start_time = time.time()
+ for _ in range(3):
+ pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+ end_time = time.time()
+ elapsed_time_fusion = end_time - start_time
+
+ self.assertTrue(elapsed_time_fusion < elapsed_time_non_fusion)
+ release_memory(pipe)
+
+ def test_sdxl_1_0_last_ben(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+ pipe.enable_model_cpu_offload()
+ lora_model_id = "TheLastBen/Papercut_SDXL"
+ lora_filename = "papercut.safetensors"
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe("papercut.safetensors", output_type="np", generator=generator, num_inference_steps=2).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.5244, 0.4347, 0.4312, 0.4246, 0.4398, 0.4409, 0.4884, 0.4938, 0.4094])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_sdxl_1_0_fuse_unfuse_all(self):
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
+ text_encoder_1_sd = copy.deepcopy(pipe.text_encoder.state_dict())
+ text_encoder_2_sd = copy.deepcopy(pipe.text_encoder_2.state_dict())
+ unet_sd = copy.deepcopy(pipe.unet.state_dict())
+
+ pipe.load_lora_weights(
+ "davizca87/sun-flower", weight_name="snfw3rXL-000004.safetensors", torch_dtype=torch.float16
+ )
+
+ fused_te_state_dict = pipe.text_encoder.state_dict()
+ fused_te_2_state_dict = pipe.text_encoder_2.state_dict()
+ unet_state_dict = pipe.unet.state_dict()
+
+ peft_ge_070 = version.parse(importlib.metadata.version("peft")) >= version.parse("0.7.0")
+
+ def remap_key(key, sd):
+ # some keys have moved around for PEFT >= 0.7.0, but they should still be loaded correctly
+ if (key in sd) or (not peft_ge_070):
+ return key
+
+ # instead of linear.weight, we now have linear.base_layer.weight, etc.
+ if key.endswith(".weight"):
+ key = key[:-7] + ".base_layer.weight"
+ elif key.endswith(".bias"):
+ key = key[:-5] + ".base_layer.bias"
+ return key
+
+ for key, value in text_encoder_1_sd.items():
+ key = remap_key(key, fused_te_state_dict)
+ self.assertTrue(torch.allclose(fused_te_state_dict[key], value))
+
+ for key, value in text_encoder_2_sd.items():
+ key = remap_key(key, fused_te_2_state_dict)
+ self.assertTrue(torch.allclose(fused_te_2_state_dict[key], value))
+
+ for key, value in unet_state_dict.items():
+ self.assertTrue(torch.allclose(unet_state_dict[key], value))
+
+ pipe.fuse_lora()
+ pipe.unload_lora_weights()
+
+ assert not state_dicts_almost_equal(text_encoder_1_sd, pipe.text_encoder.state_dict())
+ assert not state_dicts_almost_equal(text_encoder_2_sd, pipe.text_encoder_2.state_dict())
+ assert not state_dicts_almost_equal(unet_sd, pipe.unet.state_dict())
+ release_memory(pipe)
+ del unet_sd, text_encoder_1_sd, text_encoder_2_sd
+
+ def test_sdxl_1_0_lora_with_sequential_cpu_offloading(self):
+ generator = torch.Generator().manual_seed(0)
+
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+ pipe.enable_sequential_cpu_offload()
+ lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
+ lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
+
+ pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
+
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.4468, 0.4087, 0.4134, 0.366, 0.3202, 0.3505, 0.3786, 0.387, 0.3535])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipe)
+
+ def test_sd_load_civitai_empty_network_alpha(self):
+ """
+ This test simply checks that loading a LoRA with an empty network alpha works fine
+ See: https://github.com/huggingface/diffusers/issues/5606
+ """
+ pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
+ pipeline.enable_sequential_cpu_offload()
+ civitai_path = hf_hub_download("ybelkada/test-ahi-civitai", "ahi_lora_weights.safetensors")
+ pipeline.load_lora_weights(civitai_path, adapter_name="ahri")
+
+ images = pipeline(
+ "ahri, masterpiece, league of legends",
+ output_type="np",
+ generator=torch.manual_seed(156),
+ num_inference_steps=5,
+ ).images
+ images = images[0, -3:, -3:, -1].flatten()
+ expected = np.array([0.0, 0.0, 0.0, 0.002557, 0.020954, 0.001792, 0.006581, 0.00591, 0.002995])
+
+ self.assertTrue(np.allclose(images, expected, atol=1e-3))
+ release_memory(pipeline)
+
+ def test_controlnet_canny_lora(self):
+ controlnet = ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0")
+
+ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet
+ )
+ pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors")
+ pipe.enable_sequential_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "corgi"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+
+ images = pipe(prompt, image=image, generator=generator, output_type="np", num_inference_steps=3).images
+
+ assert images[0].shape == (768, 512, 3)
+
+ original_image = images[0, -3:, -3:, -1].flatten()
+ expected_image = np.array([0.4574, 0.4461, 0.4435, 0.4462, 0.4396, 0.439, 0.4474, 0.4486, 0.4333])
+ assert np.allclose(original_image, expected_image, atol=1e-04)
+ release_memory(pipe)
+
+ def test_sdxl_t2i_adapter_canny_lora(self):
+ adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-lineart-sdxl-1.0", torch_dtype=torch.float16).to(
+ "cpu"
+ )
+ pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ adapter=adapter,
+ torch_dtype=torch.float16,
+ variant="fp16",
+ )
+ pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors")
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "toy"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/toy_canny.png"
+ )
+
+ images = pipe(prompt, image=image, generator=generator, output_type="np", num_inference_steps=3).images
+
+ assert images[0].shape == (768, 512, 3)
+
+ image_slice = images[0, -3:, -3:, -1].flatten()
+ expected_slice = np.array([0.4284, 0.4337, 0.4319, 0.4255, 0.4329, 0.4280, 0.4338, 0.4420, 0.4226])
+ assert numpy_cosine_similarity_distance(image_slice, expected_slice) < 1e-4
+
+ @nightly
+ def test_sequential_fuse_unfuse(self):
+ pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
+
+ # 1. round
+ pipe.load_lora_weights("Pclanglais/TintinIA", torch_dtype=torch.float16)
+ pipe.to("cuda")
+ pipe.fuse_lora()
+
+ generator = torch.Generator().manual_seed(0)
+ images = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+ image_slice = images[0, -3:, -3:, -1].flatten()
+
+ pipe.unfuse_lora()
+
+ # 2. round
+ pipe.load_lora_weights("ProomptEngineer/pe-balloon-diffusion-style", torch_dtype=torch.float16)
+ pipe.fuse_lora()
+ pipe.unfuse_lora()
+
+ # 3. round
+ pipe.load_lora_weights("ostris/crayon_style_lora_sdxl", torch_dtype=torch.float16)
+ pipe.fuse_lora()
+ pipe.unfuse_lora()
+
+ # 4. back to 1st round
+ pipe.load_lora_weights("Pclanglais/TintinIA", torch_dtype=torch.float16)
+ pipe.fuse_lora()
+
+ generator = torch.Generator().manual_seed(0)
+ images_2 = pipe(
+ "masterpiece, best quality, mountain", output_type="np", generator=generator, num_inference_steps=2
+ ).images
+ image_slice_2 = images_2[0, -3:, -3:, -1].flatten()
+
+ self.assertTrue(np.allclose(image_slice, image_slice_2, atol=1e-3))
+ release_memory(pipe)
diff --git a/tests/others/test_check_copies.py b/tests/others/test_check_copies.py
new file mode 100644
index 0000000..6e1c8fc
--- /dev/null
+++ b/tests/others/test_check_copies.py
@@ -0,0 +1,117 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import re
+import shutil
+import sys
+import tempfile
+import unittest
+
+
+git_repo_path = os.path.abspath(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
+sys.path.append(os.path.join(git_repo_path, "utils"))
+
+import check_copies # noqa: E402
+
+
+# This is the reference code that will be used in the tests.
+# If DDPMSchedulerOutput is changed in scheduling_ddpm.py, this code needs to be manually updated.
+REFERENCE_CODE = """ \"""
+ Output class for the scheduler's `step` function output.
+
+ Args:
+ prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
+ Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
+ denoising loop.
+ pred_original_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
+ The predicted denoised sample `(x_{0})` based on the model output from the current timestep.
+ `pred_original_sample` can be used to preview progress or for guidance.
+ \"""
+
+ prev_sample: torch.FloatTensor
+ pred_original_sample: Optional[torch.FloatTensor] = None
+"""
+
+
+class CopyCheckTester(unittest.TestCase):
+ def setUp(self):
+ self.diffusers_dir = tempfile.mkdtemp()
+ os.makedirs(os.path.join(self.diffusers_dir, "schedulers/"))
+ check_copies.DIFFUSERS_PATH = self.diffusers_dir
+ shutil.copy(
+ os.path.join(git_repo_path, "src/diffusers/schedulers/scheduling_ddpm.py"),
+ os.path.join(self.diffusers_dir, "schedulers/scheduling_ddpm.py"),
+ )
+
+ def tearDown(self):
+ check_copies.DIFFUSERS_PATH = "src/diffusers"
+ shutil.rmtree(self.diffusers_dir)
+
+ def check_copy_consistency(self, comment, class_name, class_code, overwrite_result=None):
+ code = comment + f"\nclass {class_name}(nn.Module):\n" + class_code
+ if overwrite_result is not None:
+ expected = comment + f"\nclass {class_name}(nn.Module):\n" + overwrite_result
+ code = check_copies.run_ruff(code)
+ fname = os.path.join(self.diffusers_dir, "new_code.py")
+ with open(fname, "w", newline="\n") as f:
+ f.write(code)
+ if overwrite_result is None:
+ self.assertTrue(len(check_copies.is_copy_consistent(fname)) == 0)
+ else:
+ check_copies.is_copy_consistent(f.name, overwrite=True)
+ with open(fname, "r") as f:
+ self.assertTrue(f.read(), expected)
+
+ def test_find_code_in_diffusers(self):
+ code = check_copies.find_code_in_diffusers("schedulers.scheduling_ddpm.DDPMSchedulerOutput")
+ self.assertEqual(code, REFERENCE_CODE)
+
+ def test_is_copy_consistent(self):
+ # Base copy consistency
+ self.check_copy_consistency(
+ "# Copied from diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput",
+ "DDPMSchedulerOutput",
+ REFERENCE_CODE + "\n",
+ )
+
+ # With no empty line at the end
+ self.check_copy_consistency(
+ "# Copied from diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput",
+ "DDPMSchedulerOutput",
+ REFERENCE_CODE,
+ )
+
+ # Copy consistency with rename
+ self.check_copy_consistency(
+ "# Copied from diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput with DDPM->Test",
+ "TestSchedulerOutput",
+ re.sub("DDPM", "Test", REFERENCE_CODE),
+ )
+
+ # Copy consistency with a really long name
+ long_class_name = "TestClassWithAReallyLongNameBecauseSomePeopleLikeThatForSomeReason"
+ self.check_copy_consistency(
+ f"# Copied from diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput with DDPM->{long_class_name}",
+ f"{long_class_name}SchedulerOutput",
+ re.sub("Bert", long_class_name, REFERENCE_CODE),
+ )
+
+ # Copy consistency with overwrite
+ self.check_copy_consistency(
+ "# Copied from diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput with DDPM->Test",
+ "TestSchedulerOutput",
+ REFERENCE_CODE,
+ overwrite_result=re.sub("DDPM", "Test", REFERENCE_CODE),
+ )
diff --git a/tests/others/test_check_dummies.py b/tests/others/test_check_dummies.py
new file mode 100644
index 0000000..1890ffa
--- /dev/null
+++ b/tests/others/test_check_dummies.py
@@ -0,0 +1,122 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import unittest
+
+
+git_repo_path = os.path.abspath(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
+sys.path.append(os.path.join(git_repo_path, "utils"))
+
+import check_dummies # noqa: E402
+from check_dummies import create_dummy_files, create_dummy_object, find_backend, read_init # noqa: E402
+
+
+# Align TRANSFORMERS_PATH in check_dummies with the current path
+check_dummies.PATH_TO_DIFFUSERS = os.path.join(git_repo_path, "src", "diffusers")
+
+
+class CheckDummiesTester(unittest.TestCase):
+ def test_find_backend(self):
+ simple_backend = find_backend(" if not is_torch_available():")
+ self.assertEqual(simple_backend, "torch")
+
+ # backend_with_underscore = find_backend(" if not is_tensorflow_text_available():")
+ # self.assertEqual(backend_with_underscore, "tensorflow_text")
+
+ double_backend = find_backend(" if not (is_torch_available() and is_transformers_available()):")
+ self.assertEqual(double_backend, "torch_and_transformers")
+
+ # double_backend_with_underscore = find_backend(
+ # " if not (is_sentencepiece_available() and is_tensorflow_text_available()):"
+ # )
+ # self.assertEqual(double_backend_with_underscore, "sentencepiece_and_tensorflow_text")
+
+ triple_backend = find_backend(
+ " if not (is_torch_available() and is_transformers_available() and is_onnx_available()):"
+ )
+ self.assertEqual(triple_backend, "torch_and_transformers_and_onnx")
+
+ def test_read_init(self):
+ objects = read_init()
+ # We don't assert on the exact list of keys to allow for smooth grow of backend-specific objects
+ self.assertIn("torch", objects)
+ self.assertIn("torch_and_transformers", objects)
+ self.assertIn("flax_and_transformers", objects)
+ self.assertIn("torch_and_transformers_and_onnx", objects)
+
+ # Likewise, we can't assert on the exact content of a key
+ self.assertIn("UNet2DModel", objects["torch"])
+ self.assertIn("FlaxUNet2DConditionModel", objects["flax"])
+ self.assertIn("StableDiffusionPipeline", objects["torch_and_transformers"])
+ self.assertIn("FlaxStableDiffusionPipeline", objects["flax_and_transformers"])
+ self.assertIn("LMSDiscreteScheduler", objects["torch_and_scipy"])
+ self.assertIn("OnnxStableDiffusionPipeline", objects["torch_and_transformers_and_onnx"])
+
+ def test_create_dummy_object(self):
+ dummy_constant = create_dummy_object("CONSTANT", "'torch'")
+ self.assertEqual(dummy_constant, "\nCONSTANT = None\n")
+
+ dummy_function = create_dummy_object("function", "'torch'")
+ self.assertEqual(
+ dummy_function, "\ndef function(*args, **kwargs):\n requires_backends(function, 'torch')\n"
+ )
+
+ expected_dummy_class = """
+class FakeClass(metaclass=DummyObject):
+ _backends = 'torch'
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, 'torch')
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, 'torch')
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, 'torch')
+"""
+ dummy_class = create_dummy_object("FakeClass", "'torch'")
+ self.assertEqual(dummy_class, expected_dummy_class)
+
+ def test_create_dummy_files(self):
+ expected_dummy_pytorch_file = """# This file is autogenerated by the command `make fix-copies`, do not edit.
+from ..utils import DummyObject, requires_backends
+
+
+CONSTANT = None
+
+
+def function(*args, **kwargs):
+ requires_backends(function, ["torch"])
+
+
+class FakeClass(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, ["torch"])
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, ["torch"])
+"""
+ dummy_files = create_dummy_files({"torch": ["CONSTANT", "function", "FakeClass"]})
+ self.assertEqual(dummy_files["torch"], expected_dummy_pytorch_file)
diff --git a/tests/others/test_config.py b/tests/others/test_config.py
new file mode 100644
index 0000000..3492ec3
--- /dev/null
+++ b/tests/others/test_config.py
@@ -0,0 +1,288 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+
+from diffusers import (
+ DDIMScheduler,
+ DDPMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ PNDMScheduler,
+ logging,
+)
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.utils.testing_utils import CaptureLogger
+
+
+class SampleObject(ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ e=[1, 3],
+ ):
+ pass
+
+
+class SampleObject2(ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ f=[1, 3],
+ ):
+ pass
+
+
+class SampleObject3(ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ e=[1, 3],
+ f=[1, 3],
+ ):
+ pass
+
+
+class SampleObject4(ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ e=[1, 5],
+ f=[5, 4],
+ ):
+ pass
+
+
+class ConfigTester(unittest.TestCase):
+ def test_load_not_from_mixin(self):
+ with self.assertRaises(ValueError):
+ ConfigMixin.load_config("dummy_path")
+
+ def test_register_to_config(self):
+ obj = SampleObject()
+ config = obj.config
+ assert config["a"] == 2
+ assert config["b"] == 5
+ assert config["c"] == (2, 5)
+ assert config["d"] == "for diffusion"
+ assert config["e"] == [1, 3]
+
+ # init ignore private arguments
+ obj = SampleObject(_name_or_path="lalala")
+ config = obj.config
+ assert config["a"] == 2
+ assert config["b"] == 5
+ assert config["c"] == (2, 5)
+ assert config["d"] == "for diffusion"
+ assert config["e"] == [1, 3]
+
+ # can override default
+ obj = SampleObject(c=6)
+ config = obj.config
+ assert config["a"] == 2
+ assert config["b"] == 5
+ assert config["c"] == 6
+ assert config["d"] == "for diffusion"
+ assert config["e"] == [1, 3]
+
+ # can use positional arguments.
+ obj = SampleObject(1, c=6)
+ config = obj.config
+ assert config["a"] == 1
+ assert config["b"] == 5
+ assert config["c"] == 6
+ assert config["d"] == "for diffusion"
+ assert config["e"] == [1, 3]
+
+ def test_save_load(self):
+ obj = SampleObject()
+ config = obj.config
+
+ assert config["a"] == 2
+ assert config["b"] == 5
+ assert config["c"] == (2, 5)
+ assert config["d"] == "for diffusion"
+ assert config["e"] == [1, 3]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ obj.save_config(tmpdirname)
+ new_obj = SampleObject.from_config(SampleObject.load_config(tmpdirname))
+ new_config = new_obj.config
+
+ # unfreeze configs
+ config = dict(config)
+ new_config = dict(new_config)
+
+ assert config.pop("c") == (2, 5) # instantiated as tuple
+ assert new_config.pop("c") == [2, 5] # saved & loaded as list because of json
+ config.pop("_use_default_values")
+ assert config == new_config
+
+ def test_load_ddim_from_pndm(self):
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+
+ with CaptureLogger(logger) as cap_logger:
+ ddim = DDIMScheduler.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler"
+ )
+
+ assert ddim.__class__ == DDIMScheduler
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ def test_load_euler_from_pndm(self):
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+
+ with CaptureLogger(logger) as cap_logger:
+ euler = EulerDiscreteScheduler.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler"
+ )
+
+ assert euler.__class__ == EulerDiscreteScheduler
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ def test_load_euler_ancestral_from_pndm(self):
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+
+ with CaptureLogger(logger) as cap_logger:
+ euler = EulerAncestralDiscreteScheduler.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler"
+ )
+
+ assert euler.__class__ == EulerAncestralDiscreteScheduler
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ def test_load_pndm(self):
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+
+ with CaptureLogger(logger) as cap_logger:
+ pndm = PNDMScheduler.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler"
+ )
+
+ assert pndm.__class__ == PNDMScheduler
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ def test_overwrite_config_on_load(self):
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+
+ with CaptureLogger(logger) as cap_logger:
+ ddpm = DDPMScheduler.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch",
+ subfolder="scheduler",
+ prediction_type="sample",
+ beta_end=8,
+ )
+
+ with CaptureLogger(logger) as cap_logger_2:
+ ddpm_2 = DDPMScheduler.from_pretrained("google/ddpm-celebahq-256", beta_start=88)
+
+ assert ddpm.__class__ == DDPMScheduler
+ assert ddpm.config.prediction_type == "sample"
+ assert ddpm.config.beta_end == 8
+ assert ddpm_2.config.beta_start == 88
+
+ # no warning should be thrown
+ assert cap_logger.out == ""
+ assert cap_logger_2.out == ""
+
+ def test_load_dpmsolver(self):
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+
+ with CaptureLogger(logger) as cap_logger:
+ dpm = DPMSolverMultistepScheduler.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler"
+ )
+
+ assert dpm.__class__ == DPMSolverMultistepScheduler
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ def test_use_default_values(self):
+ # let's first save a config that should be in the form
+ # a=2,
+ # b=5,
+ # c=(2, 5),
+ # d="for diffusion",
+ # e=[1, 3],
+
+ config = SampleObject()
+
+ config_dict = {k: v for k, v in config.config.items() if not k.startswith("_")}
+
+ # make sure that default config has all keys in `_use_default_values`
+ assert set(config_dict.keys()) == set(config.config._use_default_values)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ config.save_config(tmpdirname)
+
+ # now loading it with SampleObject2 should put f into `_use_default_values`
+ config = SampleObject2.from_config(tmpdirname)
+
+ assert "f" in config._use_default_values
+ assert config.f == [1, 3]
+
+ # now loading the config, should **NOT** use [1, 3] for `f`, but the default [1, 4] value
+ # **BECAUSE** it is part of `config._use_default_values`
+ new_config = SampleObject4.from_config(config.config)
+ assert new_config.f == [5, 4]
+
+ config.config._use_default_values.pop()
+ new_config_2 = SampleObject4.from_config(config.config)
+ assert new_config_2.f == [1, 3]
+
+ # Nevertheless "e" should still be correctly loaded to [1, 3] from SampleObject2 instead of defaulting to [1, 5]
+ assert new_config_2.e == [1, 3]
diff --git a/tests/others/test_dependencies.py b/tests/others/test_dependencies.py
new file mode 100644
index 0000000..c0839ef
--- /dev/null
+++ b/tests/others/test_dependencies.py
@@ -0,0 +1,50 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import unittest
+from importlib import import_module
+
+
+class DependencyTester(unittest.TestCase):
+ def test_diffusers_import(self):
+ try:
+ import diffusers # noqa: F401
+ except ImportError:
+ assert False
+
+ def test_backend_registration(self):
+ import diffusers
+ from diffusers.dependency_versions_table import deps
+
+ all_classes = inspect.getmembers(diffusers, inspect.isclass)
+
+ for cls_name, cls_module in all_classes:
+ if "dummy_" in cls_module.__module__:
+ for backend in cls_module._backends:
+ if backend == "k_diffusion":
+ backend = "k-diffusion"
+ elif backend == "invisible_watermark":
+ backend = "invisible-watermark"
+ assert backend in deps, f"{backend} is not in the deps table!"
+
+ def test_pipeline_imports(self):
+ import diffusers
+ import diffusers.pipelines
+
+ all_classes = inspect.getmembers(diffusers, inspect.isclass)
+ for cls_name, cls_module in all_classes:
+ if hasattr(diffusers.pipelines, cls_name):
+ pipeline_folder_module = ".".join(str(cls_module.__module__).split(".")[:3])
+ _ = import_module(pipeline_folder_module, str(cls_name))
diff --git a/tests/others/test_ema.py b/tests/others/test_ema.py
new file mode 100644
index 0000000..48437c5
--- /dev/null
+++ b/tests/others/test_ema.py
@@ -0,0 +1,159 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+
+import torch
+
+from diffusers import UNet2DConditionModel
+from diffusers.training_utils import EMAModel
+from diffusers.utils.testing_utils import enable_full_determinism, skip_mps, torch_device
+
+
+enable_full_determinism()
+
+
+class EMAModelTests(unittest.TestCase):
+ model_id = "hf-internal-testing/tiny-stable-diffusion-pipe"
+ batch_size = 1
+ prompt_length = 77
+ text_encoder_hidden_dim = 32
+ num_in_channels = 4
+ latent_height = latent_width = 64
+ generator = torch.manual_seed(0)
+
+ def get_models(self, decay=0.9999):
+ unet = UNet2DConditionModel.from_pretrained(self.model_id, subfolder="unet")
+ unet = unet.to(torch_device)
+ ema_unet = EMAModel(unet.parameters(), decay=decay, model_cls=UNet2DConditionModel, model_config=unet.config)
+ return unet, ema_unet
+
+ def get_dummy_inputs(self):
+ noisy_latents = torch.randn(
+ self.batch_size, self.num_in_channels, self.latent_height, self.latent_width, generator=self.generator
+ ).to(torch_device)
+ timesteps = torch.randint(0, 1000, size=(self.batch_size,), generator=self.generator).to(torch_device)
+ encoder_hidden_states = torch.randn(
+ self.batch_size, self.prompt_length, self.text_encoder_hidden_dim, generator=self.generator
+ ).to(torch_device)
+ return noisy_latents, timesteps, encoder_hidden_states
+
+ def simulate_backprop(self, unet):
+ updated_state_dict = {}
+ for k, param in unet.state_dict().items():
+ updated_param = torch.randn_like(param) + (param * torch.randn_like(param))
+ updated_state_dict.update({k: updated_param})
+ unet.load_state_dict(updated_state_dict)
+ return unet
+
+ def test_optimization_steps_updated(self):
+ unet, ema_unet = self.get_models()
+ # Take the first (hypothetical) EMA step.
+ ema_unet.step(unet.parameters())
+ assert ema_unet.optimization_step == 1
+
+ # Take two more.
+ for _ in range(2):
+ ema_unet.step(unet.parameters())
+ assert ema_unet.optimization_step == 3
+
+ def test_shadow_params_not_updated(self):
+ unet, ema_unet = self.get_models()
+ # Since the `unet` is not being updated (i.e., backprop'd)
+ # there won't be any difference between the `params` of `unet`
+ # and `ema_unet` even if we call `ema_unet.step(unet.parameters())`.
+ ema_unet.step(unet.parameters())
+ orig_params = list(unet.parameters())
+ for s_param, param in zip(ema_unet.shadow_params, orig_params):
+ assert torch.allclose(s_param, param)
+
+ # The above holds true even if we call `ema.step()` multiple times since
+ # `unet` params are still not being updated.
+ for _ in range(4):
+ ema_unet.step(unet.parameters())
+ for s_param, param in zip(ema_unet.shadow_params, orig_params):
+ assert torch.allclose(s_param, param)
+
+ def test_shadow_params_updated(self):
+ unet, ema_unet = self.get_models()
+ # Here we simulate the parameter updates for `unet`. Since there might
+ # be some parameters which are initialized to zero we take extra care to
+ # initialize their values to something non-zero before the multiplication.
+ unet_pseudo_updated_step_one = self.simulate_backprop(unet)
+
+ # Take the EMA step.
+ ema_unet.step(unet_pseudo_updated_step_one.parameters())
+
+ # Now the EMA'd parameters won't be equal to the original model parameters.
+ orig_params = list(unet_pseudo_updated_step_one.parameters())
+ for s_param, param in zip(ema_unet.shadow_params, orig_params):
+ assert ~torch.allclose(s_param, param)
+
+ # Ensure this is the case when we take multiple EMA steps.
+ for _ in range(4):
+ ema_unet.step(unet.parameters())
+ for s_param, param in zip(ema_unet.shadow_params, orig_params):
+ assert ~torch.allclose(s_param, param)
+
+ def test_consecutive_shadow_params_updated(self):
+ # If we call EMA step after a backpropagation consecutively for two times,
+ # the shadow params from those two steps should be different.
+ unet, ema_unet = self.get_models()
+
+ # First backprop + EMA
+ unet_step_one = self.simulate_backprop(unet)
+ ema_unet.step(unet_step_one.parameters())
+ step_one_shadow_params = ema_unet.shadow_params
+
+ # Second backprop + EMA
+ unet_step_two = self.simulate_backprop(unet_step_one)
+ ema_unet.step(unet_step_two.parameters())
+ step_two_shadow_params = ema_unet.shadow_params
+
+ for step_one, step_two in zip(step_one_shadow_params, step_two_shadow_params):
+ assert ~torch.allclose(step_one, step_two)
+
+ def test_zero_decay(self):
+ # If there's no decay even if there are backprops, EMA steps
+ # won't take any effect i.e., the shadow params would remain the
+ # same.
+ unet, ema_unet = self.get_models(decay=0.0)
+ unet_step_one = self.simulate_backprop(unet)
+ ema_unet.step(unet_step_one.parameters())
+ step_one_shadow_params = ema_unet.shadow_params
+
+ unet_step_two = self.simulate_backprop(unet_step_one)
+ ema_unet.step(unet_step_two.parameters())
+ step_two_shadow_params = ema_unet.shadow_params
+
+ for step_one, step_two in zip(step_one_shadow_params, step_two_shadow_params):
+ assert torch.allclose(step_one, step_two)
+
+ @skip_mps
+ def test_serialization(self):
+ unet, ema_unet = self.get_models()
+ noisy_latents, timesteps, encoder_hidden_states = self.get_dummy_inputs()
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ ema_unet.save_pretrained(tmpdir)
+ loaded_unet = UNet2DConditionModel.from_pretrained(tmpdir, model_cls=UNet2DConditionModel)
+ loaded_unet = loaded_unet.to(unet.device)
+
+ # Since no EMA step has been performed the outputs should match.
+ output = unet(noisy_latents, timesteps, encoder_hidden_states).sample
+ output_loaded = loaded_unet(noisy_latents, timesteps, encoder_hidden_states).sample
+
+ assert torch.allclose(output, output_loaded, atol=1e-4)
diff --git a/tests/others/test_hub_utils.py b/tests/others/test_hub_utils.py
new file mode 100644
index 0000000..7a0c29d
--- /dev/null
+++ b/tests/others/test_hub_utils.py
@@ -0,0 +1,29 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+from pathlib import Path
+from tempfile import TemporaryDirectory
+
+from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
+
+
+class CreateModelCardTest(unittest.TestCase):
+ def test_generate_model_card_with_library_name(self):
+ with TemporaryDirectory() as tmpdir:
+ file_path = Path(tmpdir) / "README.md"
+ file_path.write_text("---\nlibrary_name: foo\n---\nContent\n")
+ model_card = load_or_create_model_card(file_path)
+ populate_model_card(model_card)
+ assert model_card.data.library_name == "foo"
diff --git a/tests/others/test_image_processor.py b/tests/others/test_image_processor.py
new file mode 100644
index 0000000..3397ca9
--- /dev/null
+++ b/tests/others/test_image_processor.py
@@ -0,0 +1,310 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import PIL.Image
+import torch
+
+from diffusers.image_processor import VaeImageProcessor
+
+
+class ImageProcessorTest(unittest.TestCase):
+ @property
+ def dummy_sample(self):
+ batch_size = 1
+ num_channels = 3
+ height = 8
+ width = 8
+
+ sample = torch.rand((batch_size, num_channels, height, width))
+
+ return sample
+
+ @property
+ def dummy_mask(self):
+ batch_size = 1
+ num_channels = 1
+ height = 8
+ width = 8
+
+ sample = torch.rand((batch_size, num_channels, height, width))
+
+ return sample
+
+ def to_np(self, image):
+ if isinstance(image[0], PIL.Image.Image):
+ return np.stack([np.array(i) for i in image], axis=0)
+ elif isinstance(image, torch.Tensor):
+ return image.cpu().numpy().transpose(0, 2, 3, 1)
+ return image
+
+ def test_vae_image_processor_pt(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=True)
+
+ input_pt = self.dummy_sample
+ input_np = self.to_np(input_pt)
+
+ for output_type in ["pt", "np", "pil"]:
+ out = image_processor.postprocess(
+ image_processor.preprocess(input_pt),
+ output_type=output_type,
+ )
+ out_np = self.to_np(out)
+ in_np = (input_np * 255).round() if output_type == "pil" else input_np
+ assert (
+ np.abs(in_np - out_np).max() < 1e-6
+ ), f"decoded output does not match input for output_type {output_type}"
+
+ def test_vae_image_processor_np(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=True)
+ input_np = self.dummy_sample.cpu().numpy().transpose(0, 2, 3, 1)
+
+ for output_type in ["pt", "np", "pil"]:
+ out = image_processor.postprocess(image_processor.preprocess(input_np), output_type=output_type)
+
+ out_np = self.to_np(out)
+ in_np = (input_np * 255).round() if output_type == "pil" else input_np
+ assert (
+ np.abs(in_np - out_np).max() < 1e-6
+ ), f"decoded output does not match input for output_type {output_type}"
+
+ def test_vae_image_processor_pil(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=True)
+
+ input_np = self.dummy_sample.cpu().numpy().transpose(0, 2, 3, 1)
+ input_pil = image_processor.numpy_to_pil(input_np)
+
+ for output_type in ["pt", "np", "pil"]:
+ out = image_processor.postprocess(image_processor.preprocess(input_pil), output_type=output_type)
+ for i, o in zip(input_pil, out):
+ in_np = np.array(i)
+ out_np = self.to_np(out) if output_type == "pil" else (self.to_np(out) * 255).round()
+ assert (
+ np.abs(in_np - out_np).max() < 1e-6
+ ), f"decoded output does not match input for output_type {output_type}"
+
+ def test_preprocess_input_3d(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=False)
+
+ input_pt_4d = self.dummy_sample
+ input_pt_3d = input_pt_4d.squeeze(0)
+
+ out_pt_4d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_4d),
+ output_type="np",
+ )
+ out_pt_3d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_3d),
+ output_type="np",
+ )
+
+ input_np_4d = self.to_np(self.dummy_sample)
+ input_np_3d = input_np_4d.squeeze(0)
+
+ out_np_4d = image_processor.postprocess(
+ image_processor.preprocess(input_np_4d),
+ output_type="np",
+ )
+ out_np_3d = image_processor.postprocess(
+ image_processor.preprocess(input_np_3d),
+ output_type="np",
+ )
+
+ assert np.abs(out_pt_4d - out_pt_3d).max() < 1e-6
+ assert np.abs(out_np_4d - out_np_3d).max() < 1e-6
+
+ def test_preprocess_input_list(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=False)
+
+ input_pt_4d = self.dummy_sample
+ input_pt_list = list(input_pt_4d)
+
+ out_pt_4d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_4d),
+ output_type="np",
+ )
+
+ out_pt_list = image_processor.postprocess(
+ image_processor.preprocess(input_pt_list),
+ output_type="np",
+ )
+
+ input_np_4d = self.to_np(self.dummy_sample)
+ input_np_list = list(input_np_4d)
+
+ out_np_4d = image_processor.postprocess(
+ image_processor.preprocess(input_np_4d),
+ output_type="np",
+ )
+
+ out_np_list = image_processor.postprocess(
+ image_processor.preprocess(input_np_list),
+ output_type="np",
+ )
+
+ assert np.abs(out_pt_4d - out_pt_list).max() < 1e-6
+ assert np.abs(out_np_4d - out_np_list).max() < 1e-6
+
+ def test_preprocess_input_mask_3d(self):
+ image_processor = VaeImageProcessor(
+ do_resize=False, do_normalize=False, do_binarize=True, do_convert_grayscale=True
+ )
+
+ input_pt_4d = self.dummy_mask
+ input_pt_3d = input_pt_4d.squeeze(0)
+ input_pt_2d = input_pt_3d.squeeze(0)
+
+ out_pt_4d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_4d),
+ output_type="np",
+ )
+ out_pt_3d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_3d),
+ output_type="np",
+ )
+
+ out_pt_2d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_2d),
+ output_type="np",
+ )
+
+ input_np_4d = self.to_np(self.dummy_mask)
+ input_np_3d = input_np_4d.squeeze(0)
+ input_np_3d_1 = input_np_4d.squeeze(-1)
+ input_np_2d = input_np_3d.squeeze(-1)
+
+ out_np_4d = image_processor.postprocess(
+ image_processor.preprocess(input_np_4d),
+ output_type="np",
+ )
+ out_np_3d = image_processor.postprocess(
+ image_processor.preprocess(input_np_3d),
+ output_type="np",
+ )
+
+ out_np_3d_1 = image_processor.postprocess(
+ image_processor.preprocess(input_np_3d_1),
+ output_type="np",
+ )
+
+ out_np_2d = image_processor.postprocess(
+ image_processor.preprocess(input_np_2d),
+ output_type="np",
+ )
+
+ assert np.abs(out_pt_4d - out_pt_3d).max() == 0
+ assert np.abs(out_pt_4d - out_pt_2d).max() == 0
+ assert np.abs(out_np_4d - out_np_3d).max() == 0
+ assert np.abs(out_np_4d - out_np_3d_1).max() == 0
+ assert np.abs(out_np_4d - out_np_2d).max() == 0
+
+ def test_preprocess_input_mask_list(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=False, do_convert_grayscale=True)
+
+ input_pt_4d = self.dummy_mask
+ input_pt_3d = input_pt_4d.squeeze(0)
+ input_pt_2d = input_pt_3d.squeeze(0)
+
+ inputs_pt = [input_pt_4d, input_pt_3d, input_pt_2d]
+ inputs_pt_list = [[input_pt] for input_pt in inputs_pt]
+
+ for input_pt, input_pt_list in zip(inputs_pt, inputs_pt_list):
+ out_pt = image_processor.postprocess(
+ image_processor.preprocess(input_pt),
+ output_type="np",
+ )
+ out_pt_list = image_processor.postprocess(
+ image_processor.preprocess(input_pt_list),
+ output_type="np",
+ )
+ assert np.abs(out_pt - out_pt_list).max() < 1e-6
+
+ input_np_4d = self.to_np(self.dummy_mask)
+ input_np_3d = input_np_4d.squeeze(0)
+ input_np_2d = input_np_3d.squeeze(-1)
+
+ inputs_np = [input_np_4d, input_np_3d, input_np_2d]
+ inputs_np_list = [[input_np] for input_np in inputs_np]
+
+ for input_np, input_np_list in zip(inputs_np, inputs_np_list):
+ out_np = image_processor.postprocess(
+ image_processor.preprocess(input_np),
+ output_type="np",
+ )
+ out_np_list = image_processor.postprocess(
+ image_processor.preprocess(input_np_list),
+ output_type="np",
+ )
+ assert np.abs(out_np - out_np_list).max() < 1e-6
+
+ def test_preprocess_input_mask_3d_batch(self):
+ image_processor = VaeImageProcessor(do_resize=False, do_normalize=False, do_convert_grayscale=True)
+
+ # create a dummy mask input with batch_size 2
+ dummy_mask_batch = torch.cat([self.dummy_mask] * 2, axis=0)
+
+ # squeeze out the channel dimension
+ input_pt_3d = dummy_mask_batch.squeeze(1)
+ input_np_3d = self.to_np(dummy_mask_batch).squeeze(-1)
+
+ input_pt_3d_list = list(input_pt_3d)
+ input_np_3d_list = list(input_np_3d)
+
+ out_pt_3d = image_processor.postprocess(
+ image_processor.preprocess(input_pt_3d),
+ output_type="np",
+ )
+ out_pt_3d_list = image_processor.postprocess(
+ image_processor.preprocess(input_pt_3d_list),
+ output_type="np",
+ )
+
+ assert np.abs(out_pt_3d - out_pt_3d_list).max() < 1e-6
+
+ out_np_3d = image_processor.postprocess(
+ image_processor.preprocess(input_np_3d),
+ output_type="np",
+ )
+ out_np_3d_list = image_processor.postprocess(
+ image_processor.preprocess(input_np_3d_list),
+ output_type="np",
+ )
+
+ assert np.abs(out_np_3d - out_np_3d_list).max() < 1e-6
+
+ def test_vae_image_processor_resize_pt(self):
+ image_processor = VaeImageProcessor(do_resize=True, vae_scale_factor=1)
+ input_pt = self.dummy_sample
+ b, c, h, w = input_pt.shape
+ scale = 2
+ out_pt = image_processor.resize(image=input_pt, height=h // scale, width=w // scale)
+ exp_pt_shape = (b, c, h // scale, w // scale)
+ assert (
+ out_pt.shape == exp_pt_shape
+ ), f"resized image output shape '{out_pt.shape}' didn't match expected shape '{exp_pt_shape}'."
+
+ def test_vae_image_processor_resize_np(self):
+ image_processor = VaeImageProcessor(do_resize=True, vae_scale_factor=1)
+ input_pt = self.dummy_sample
+ b, c, h, w = input_pt.shape
+ scale = 2
+ input_np = self.to_np(input_pt)
+ out_np = image_processor.resize(image=input_np, height=h // scale, width=w // scale)
+ exp_np_shape = (b, h // scale, w // scale, c)
+ assert (
+ out_np.shape == exp_np_shape
+ ), f"resized image output shape '{out_np.shape}' didn't match expected shape '{exp_np_shape}'."
diff --git a/tests/others/test_outputs.py b/tests/others/test_outputs.py
new file mode 100644
index 0000000..cf709d9
--- /dev/null
+++ b/tests/others/test_outputs.py
@@ -0,0 +1,93 @@
+import pickle as pkl
+import unittest
+from dataclasses import dataclass
+from typing import List, Union
+
+import numpy as np
+import PIL.Image
+
+from diffusers.utils.outputs import BaseOutput
+from diffusers.utils.testing_utils import require_torch
+
+
+@dataclass
+class CustomOutput(BaseOutput):
+ images: Union[List[PIL.Image.Image], np.ndarray]
+
+
+class ConfigTester(unittest.TestCase):
+ def test_outputs_single_attribute(self):
+ outputs = CustomOutput(images=np.random.rand(1, 3, 4, 4))
+
+ # check every way of getting the attribute
+ assert isinstance(outputs.images, np.ndarray)
+ assert outputs.images.shape == (1, 3, 4, 4)
+ assert isinstance(outputs["images"], np.ndarray)
+ assert outputs["images"].shape == (1, 3, 4, 4)
+ assert isinstance(outputs[0], np.ndarray)
+ assert outputs[0].shape == (1, 3, 4, 4)
+
+ # test with a non-tensor attribute
+ outputs = CustomOutput(images=[PIL.Image.new("RGB", (4, 4))])
+
+ # check every way of getting the attribute
+ assert isinstance(outputs.images, list)
+ assert isinstance(outputs.images[0], PIL.Image.Image)
+ assert isinstance(outputs["images"], list)
+ assert isinstance(outputs["images"][0], PIL.Image.Image)
+ assert isinstance(outputs[0], list)
+ assert isinstance(outputs[0][0], PIL.Image.Image)
+
+ def test_outputs_dict_init(self):
+ # test output reinitialization with a `dict` for compatibility with `accelerate`
+ outputs = CustomOutput({"images": np.random.rand(1, 3, 4, 4)})
+
+ # check every way of getting the attribute
+ assert isinstance(outputs.images, np.ndarray)
+ assert outputs.images.shape == (1, 3, 4, 4)
+ assert isinstance(outputs["images"], np.ndarray)
+ assert outputs["images"].shape == (1, 3, 4, 4)
+ assert isinstance(outputs[0], np.ndarray)
+ assert outputs[0].shape == (1, 3, 4, 4)
+
+ # test with a non-tensor attribute
+ outputs = CustomOutput({"images": [PIL.Image.new("RGB", (4, 4))]})
+
+ # check every way of getting the attribute
+ assert isinstance(outputs.images, list)
+ assert isinstance(outputs.images[0], PIL.Image.Image)
+ assert isinstance(outputs["images"], list)
+ assert isinstance(outputs["images"][0], PIL.Image.Image)
+ assert isinstance(outputs[0], list)
+ assert isinstance(outputs[0][0], PIL.Image.Image)
+
+ def test_outputs_serialization(self):
+ outputs_orig = CustomOutput(images=[PIL.Image.new("RGB", (4, 4))])
+ serialized = pkl.dumps(outputs_orig)
+ outputs_copy = pkl.loads(serialized)
+
+ # Check original and copy are equal
+ assert dir(outputs_orig) == dir(outputs_copy)
+ assert dict(outputs_orig) == dict(outputs_copy)
+ assert vars(outputs_orig) == vars(outputs_copy)
+
+ @require_torch
+ def test_torch_pytree(self):
+ # ensure torch.utils._pytree treats ModelOutput subclasses as nodes (and not leaves)
+ # this is important for DistributedDataParallel gradient synchronization with static_graph=True
+ import torch
+ import torch.utils._pytree
+
+ data = np.random.rand(1, 3, 4, 4)
+ x = CustomOutput(images=data)
+ self.assertFalse(torch.utils._pytree._is_leaf(x))
+
+ expected_flat_outs = [data]
+ expected_tree_spec = torch.utils._pytree.TreeSpec(CustomOutput, ["images"], [torch.utils._pytree.LeafSpec()])
+
+ actual_flat_outs, actual_tree_spec = torch.utils._pytree.tree_flatten(x)
+ self.assertEqual(expected_flat_outs, actual_flat_outs)
+ self.assertEqual(expected_tree_spec, actual_tree_spec)
+
+ unflattened_x = torch.utils._pytree.tree_unflatten(actual_flat_outs, actual_tree_spec)
+ self.assertEqual(x, unflattened_x)
diff --git a/tests/others/test_training.py b/tests/others/test_training.py
new file mode 100644
index 0000000..863ba6e
--- /dev/null
+++ b/tests/others/test_training.py
@@ -0,0 +1,86 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import torch
+
+from diffusers import DDIMScheduler, DDPMScheduler, UNet2DModel
+from diffusers.training_utils import set_seed
+from diffusers.utils.testing_utils import slow
+
+
+torch.backends.cuda.matmul.allow_tf32 = False
+
+
+class TrainingTests(unittest.TestCase):
+ def get_model_optimizer(self, resolution=32):
+ set_seed(0)
+ model = UNet2DModel(sample_size=resolution, in_channels=3, out_channels=3)
+ optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)
+ return model, optimizer
+
+ @slow
+ def test_training_step_equality(self):
+ device = "cpu" # ensure full determinism without setting the CUBLAS_WORKSPACE_CONFIG env variable
+ ddpm_scheduler = DDPMScheduler(
+ num_train_timesteps=1000,
+ beta_start=0.0001,
+ beta_end=0.02,
+ beta_schedule="linear",
+ clip_sample=True,
+ )
+ ddim_scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_start=0.0001,
+ beta_end=0.02,
+ beta_schedule="linear",
+ clip_sample=True,
+ )
+
+ assert ddpm_scheduler.config.num_train_timesteps == ddim_scheduler.config.num_train_timesteps
+
+ # shared batches for DDPM and DDIM
+ set_seed(0)
+ clean_images = [torch.randn((4, 3, 32, 32)).clip(-1, 1).to(device) for _ in range(4)]
+ noise = [torch.randn((4, 3, 32, 32)).to(device) for _ in range(4)]
+ timesteps = [torch.randint(0, 1000, (4,)).long().to(device) for _ in range(4)]
+
+ # train with a DDPM scheduler
+ model, optimizer = self.get_model_optimizer(resolution=32)
+ model.train().to(device)
+ for i in range(4):
+ optimizer.zero_grad()
+ ddpm_noisy_images = ddpm_scheduler.add_noise(clean_images[i], noise[i], timesteps[i])
+ ddpm_noise_pred = model(ddpm_noisy_images, timesteps[i]).sample
+ loss = torch.nn.functional.mse_loss(ddpm_noise_pred, noise[i])
+ loss.backward()
+ optimizer.step()
+ del model, optimizer
+
+ # recreate the model and optimizer, and retry with DDIM
+ model, optimizer = self.get_model_optimizer(resolution=32)
+ model.train().to(device)
+ for i in range(4):
+ optimizer.zero_grad()
+ ddim_noisy_images = ddim_scheduler.add_noise(clean_images[i], noise[i], timesteps[i])
+ ddim_noise_pred = model(ddim_noisy_images, timesteps[i]).sample
+ loss = torch.nn.functional.mse_loss(ddim_noise_pred, noise[i])
+ loss.backward()
+ optimizer.step()
+ del model, optimizer
+
+ self.assertTrue(torch.allclose(ddpm_noisy_images, ddim_noisy_images, atol=1e-5))
+ self.assertTrue(torch.allclose(ddpm_noise_pred, ddim_noise_pred, atol=1e-5))
diff --git a/tests/others/test_utils.py b/tests/others/test_utils.py
new file mode 100755
index 0000000..9ebae06
--- /dev/null
+++ b/tests/others/test_utils.py
@@ -0,0 +1,213 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from distutils.util import strtobool
+
+import pytest
+
+from diffusers import __version__
+from diffusers.utils import deprecate
+
+
+# Used to test the hub
+USER = "__DUMMY_TRANSFORMERS_USER__"
+ENDPOINT_STAGING = "https://hub-ci.huggingface.co"
+
+# Not critical, only usable on the sandboxed CI instance.
+TOKEN = "hf_94wBhPGp6KrrTH3KDchhKpRxZwd6dmHWLL"
+
+
+class DeprecateTester(unittest.TestCase):
+ higher_version = ".".join([str(int(__version__.split(".")[0]) + 1)] + __version__.split(".")[1:])
+ lower_version = "0.0.1"
+
+ def test_deprecate_function_arg(self):
+ kwargs = {"deprecated_arg": 4}
+
+ with self.assertWarns(FutureWarning) as warning:
+ output = deprecate("deprecated_arg", self.higher_version, "message", take_from=kwargs)
+
+ assert output == 4
+ assert (
+ str(warning.warning)
+ == f"The `deprecated_arg` argument is deprecated and will be removed in version {self.higher_version}."
+ " message"
+ )
+
+ def test_deprecate_function_arg_tuple(self):
+ kwargs = {"deprecated_arg": 4}
+
+ with self.assertWarns(FutureWarning) as warning:
+ output = deprecate(("deprecated_arg", self.higher_version, "message"), take_from=kwargs)
+
+ assert output == 4
+ assert (
+ str(warning.warning)
+ == f"The `deprecated_arg` argument is deprecated and will be removed in version {self.higher_version}."
+ " message"
+ )
+
+ def test_deprecate_function_args(self):
+ kwargs = {"deprecated_arg_1": 4, "deprecated_arg_2": 8}
+ with self.assertWarns(FutureWarning) as warning:
+ output_1, output_2 = deprecate(
+ ("deprecated_arg_1", self.higher_version, "Hey"),
+ ("deprecated_arg_2", self.higher_version, "Hey"),
+ take_from=kwargs,
+ )
+ assert output_1 == 4
+ assert output_2 == 8
+ assert (
+ str(warning.warnings[0].message)
+ == "The `deprecated_arg_1` argument is deprecated and will be removed in version"
+ f" {self.higher_version}. Hey"
+ )
+ assert (
+ str(warning.warnings[1].message)
+ == "The `deprecated_arg_2` argument is deprecated and will be removed in version"
+ f" {self.higher_version}. Hey"
+ )
+
+ def test_deprecate_function_incorrect_arg(self):
+ kwargs = {"deprecated_arg": 4}
+
+ with self.assertRaises(TypeError) as error:
+ deprecate(("wrong_arg", self.higher_version, "message"), take_from=kwargs)
+
+ assert "test_deprecate_function_incorrect_arg in" in str(error.exception)
+ assert "line" in str(error.exception)
+ assert "got an unexpected keyword argument `deprecated_arg`" in str(error.exception)
+
+ def test_deprecate_arg_no_kwarg(self):
+ with self.assertWarns(FutureWarning) as warning:
+ deprecate(("deprecated_arg", self.higher_version, "message"))
+
+ assert (
+ str(warning.warning)
+ == f"`deprecated_arg` is deprecated and will be removed in version {self.higher_version}. message"
+ )
+
+ def test_deprecate_args_no_kwarg(self):
+ with self.assertWarns(FutureWarning) as warning:
+ deprecate(
+ ("deprecated_arg_1", self.higher_version, "Hey"),
+ ("deprecated_arg_2", self.higher_version, "Hey"),
+ )
+ assert (
+ str(warning.warnings[0].message)
+ == f"`deprecated_arg_1` is deprecated and will be removed in version {self.higher_version}. Hey"
+ )
+ assert (
+ str(warning.warnings[1].message)
+ == f"`deprecated_arg_2` is deprecated and will be removed in version {self.higher_version}. Hey"
+ )
+
+ def test_deprecate_class_obj(self):
+ class Args:
+ arg = 5
+
+ with self.assertWarns(FutureWarning) as warning:
+ arg = deprecate(("arg", self.higher_version, "message"), take_from=Args())
+
+ assert arg == 5
+ assert (
+ str(warning.warning)
+ == f"The `arg` attribute is deprecated and will be removed in version {self.higher_version}. message"
+ )
+
+ def test_deprecate_class_objs(self):
+ class Args:
+ arg = 5
+ foo = 7
+
+ with self.assertWarns(FutureWarning) as warning:
+ arg_1, arg_2 = deprecate(
+ ("arg", self.higher_version, "message"),
+ ("foo", self.higher_version, "message"),
+ ("does not exist", self.higher_version, "message"),
+ take_from=Args(),
+ )
+
+ assert arg_1 == 5
+ assert arg_2 == 7
+ assert (
+ str(warning.warning)
+ == f"The `arg` attribute is deprecated and will be removed in version {self.higher_version}. message"
+ )
+ assert (
+ str(warning.warnings[0].message)
+ == f"The `arg` attribute is deprecated and will be removed in version {self.higher_version}. message"
+ )
+ assert (
+ str(warning.warnings[1].message)
+ == f"The `foo` attribute is deprecated and will be removed in version {self.higher_version}. message"
+ )
+
+ def test_deprecate_incorrect_version(self):
+ kwargs = {"deprecated_arg": 4}
+
+ with self.assertRaises(ValueError) as error:
+ deprecate(("wrong_arg", self.lower_version, "message"), take_from=kwargs)
+
+ assert (
+ str(error.exception)
+ == "The deprecation tuple ('wrong_arg', '0.0.1', 'message') should be removed since diffusers' version"
+ f" {__version__} is >= {self.lower_version}"
+ )
+
+ def test_deprecate_incorrect_no_standard_warn(self):
+ with self.assertWarns(FutureWarning) as warning:
+ deprecate(("deprecated_arg", self.higher_version, "This message is better!!!"), standard_warn=False)
+
+ assert str(warning.warning) == "This message is better!!!"
+
+ def test_deprecate_stacklevel(self):
+ with self.assertWarns(FutureWarning) as warning:
+ deprecate(("deprecated_arg", self.higher_version, "This message is better!!!"), standard_warn=False)
+ assert str(warning.warning) == "This message is better!!!"
+ assert "diffusers/tests/others/test_utils.py" in warning.filename
+
+
+def parse_flag_from_env(key, default=False):
+ try:
+ value = os.environ[key]
+ except KeyError:
+ # KEY isn't set, default to `default`.
+ _value = default
+ else:
+ # KEY is set, convert it to True or False.
+ try:
+ _value = strtobool(value)
+ except ValueError:
+ # More values are supported, but let's keep the message simple.
+ raise ValueError(f"If set, {key} must be yes or no.")
+ return _value
+
+
+_run_staging = parse_flag_from_env("HUGGINGFACE_CO_STAGING", default=False)
+
+
+def is_staging_test(test_case):
+ """
+ Decorator marking a test as a staging test.
+
+ Those tests will run using the staging environment of huggingface.co instead of the real model hub.
+ """
+ if not _run_staging:
+ return unittest.skip("test is staging test")(test_case)
+ else:
+ return pytest.mark.is_staging_test()(test_case)
diff --git a/tests/pipelines/__init__.py b/tests/pipelines/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/amused/__init__.py b/tests/pipelines/amused/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/amused/test_amused.py b/tests/pipelines/amused/test_amused.py
new file mode 100644
index 0000000..f03751e
--- /dev/null
+++ b/tests/pipelines/amused/test_amused.py
@@ -0,0 +1,181 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import AmusedPipeline, AmusedScheduler, UVit2DModel, VQModel
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class AmusedPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = AmusedPipeline
+ params = TEXT_TO_IMAGE_PARAMS | {"encoder_hidden_states", "negative_encoder_hidden_states"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ transformer = UVit2DModel(
+ hidden_size=32,
+ use_bias=False,
+ hidden_dropout=0.0,
+ cond_embed_dim=32,
+ micro_cond_encode_dim=2,
+ micro_cond_embed_dim=10,
+ encoder_hidden_size=32,
+ vocab_size=32,
+ codebook_size=32,
+ in_channels=32,
+ block_out_channels=32,
+ num_res_blocks=1,
+ downsample=True,
+ upsample=True,
+ block_num_heads=1,
+ num_hidden_layers=1,
+ num_attention_heads=1,
+ attention_dropout=0.0,
+ intermediate_size=32,
+ layer_norm_eps=1e-06,
+ ln_elementwise_affine=True,
+ )
+ scheduler = AmusedScheduler(mask_token_id=31)
+ torch.manual_seed(0)
+ vqvae = VQModel(
+ act_fn="silu",
+ block_out_channels=[32],
+ down_block_types=[
+ "DownEncoderBlock2D",
+ ],
+ in_channels=3,
+ latent_channels=32,
+ layers_per_block=2,
+ norm_num_groups=32,
+ num_vq_embeddings=32,
+ out_channels=3,
+ sample_size=32,
+ up_block_types=[
+ "UpDecoderBlock2D",
+ ],
+ mid_block_add_attention=False,
+ lookup_from_codebook=True,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=64,
+ layer_norm_eps=1e-05,
+ num_attention_heads=8,
+ num_hidden_layers=3,
+ pad_token_id=1,
+ vocab_size=1000,
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "transformer": transformer,
+ "scheduler": scheduler,
+ "vqvae": vqvae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ "height": 4,
+ "width": 4,
+ }
+ return inputs
+
+ def test_inference_batch_consistent(self, batch_sizes=[2]):
+ self._test_inference_batch_consistent(batch_sizes=batch_sizes, batch_generator=False)
+
+ @unittest.skip("aMUSEd does not support lists of generators")
+ def test_inference_batch_single_identical(self):
+ ...
+
+
+@slow
+@require_torch_gpu
+class AmusedPipelineSlowTests(unittest.TestCase):
+ def test_amused_256(self):
+ pipe = AmusedPipeline.from_pretrained("amused/amused-256")
+ pipe.to(torch_device)
+
+ image = pipe("dog", generator=torch.Generator().manual_seed(0), num_inference_steps=2, output_type="np").images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.4011, 0.3992, 0.3790, 0.3856, 0.3772, 0.3711, 0.3919, 0.3850, 0.3625])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ def test_amused_256_fp16(self):
+ pipe = AmusedPipeline.from_pretrained("amused/amused-256", variant="fp16", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+
+ image = pipe("dog", generator=torch.Generator().manual_seed(0), num_inference_steps=2, output_type="np").images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.0554, 0.05129, 0.0344, 0.0452, 0.0476, 0.0271, 0.0495, 0.0527, 0.0158])
+ assert np.abs(image_slice - expected_slice).max() < 7e-3
+
+ def test_amused_512(self):
+ pipe = AmusedPipeline.from_pretrained("amused/amused-512")
+ pipe.to(torch_device)
+
+ image = pipe("dog", generator=torch.Generator().manual_seed(0), num_inference_steps=2, output_type="np").images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.9960, 0.9960, 0.9946, 0.9980, 0.9947, 0.9932, 0.9960, 0.9961, 0.9947])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ def test_amused_512_fp16(self):
+ pipe = AmusedPipeline.from_pretrained("amused/amused-512", variant="fp16", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+
+ image = pipe("dog", generator=torch.Generator().manual_seed(0), num_inference_steps=2, output_type="np").images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.9983, 1.0, 1.0, 1.0, 1.0, 0.9989, 0.9994, 0.9976, 0.9977])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
diff --git a/tests/pipelines/amused/test_amused_img2img.py b/tests/pipelines/amused/test_amused_img2img.py
new file mode 100644
index 0000000..efbca1f
--- /dev/null
+++ b/tests/pipelines/amused/test_amused_img2img.py
@@ -0,0 +1,235 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import AmusedImg2ImgPipeline, AmusedScheduler, UVit2DModel, VQModel
+from diffusers.utils import load_image
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class AmusedImg2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = AmusedImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width", "latents"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {
+ "latents",
+ }
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ transformer = UVit2DModel(
+ hidden_size=32,
+ use_bias=False,
+ hidden_dropout=0.0,
+ cond_embed_dim=32,
+ micro_cond_encode_dim=2,
+ micro_cond_embed_dim=10,
+ encoder_hidden_size=32,
+ vocab_size=32,
+ codebook_size=32,
+ in_channels=32,
+ block_out_channels=32,
+ num_res_blocks=1,
+ downsample=True,
+ upsample=True,
+ block_num_heads=1,
+ num_hidden_layers=1,
+ num_attention_heads=1,
+ attention_dropout=0.0,
+ intermediate_size=32,
+ layer_norm_eps=1e-06,
+ ln_elementwise_affine=True,
+ )
+ scheduler = AmusedScheduler(mask_token_id=31)
+ torch.manual_seed(0)
+ vqvae = VQModel(
+ act_fn="silu",
+ block_out_channels=[32],
+ down_block_types=[
+ "DownEncoderBlock2D",
+ ],
+ in_channels=3,
+ latent_channels=32,
+ layers_per_block=2,
+ norm_num_groups=32,
+ num_vq_embeddings=32,
+ out_channels=3,
+ sample_size=32,
+ up_block_types=[
+ "UpDecoderBlock2D",
+ ],
+ mid_block_add_attention=False,
+ lookup_from_codebook=True,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=64,
+ layer_norm_eps=1e-05,
+ num_attention_heads=8,
+ num_hidden_layers=3,
+ pad_token_id=1,
+ vocab_size=1000,
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "transformer": transformer,
+ "scheduler": scheduler,
+ "vqvae": vqvae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ image = torch.full((1, 3, 4, 4), 1.0, dtype=torch.float32, device=device)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ "image": image,
+ }
+ return inputs
+
+ def test_inference_batch_consistent(self, batch_sizes=[2]):
+ self._test_inference_batch_consistent(batch_sizes=batch_sizes, batch_generator=False)
+
+ @unittest.skip("aMUSEd does not support lists of generators")
+ def test_inference_batch_single_identical(self):
+ ...
+
+
+@slow
+@require_torch_gpu
+class AmusedImg2ImgPipelineSlowTests(unittest.TestCase):
+ def test_amused_256(self):
+ pipe = AmusedImg2ImgPipeline.from_pretrained("amused/amused-256")
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains.jpg")
+ .resize((256, 256))
+ .convert("RGB")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.9993, 1.0, 0.9996, 1.0, 0.9995, 0.9925, 0.9990, 0.9954, 1.0])
+
+ assert np.abs(image_slice - expected_slice).max() < 1e-2
+
+ def test_amused_256_fp16(self):
+ pipe = AmusedImg2ImgPipeline.from_pretrained("amused/amused-256", torch_dtype=torch.float16, variant="fp16")
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains.jpg")
+ .resize((256, 256))
+ .convert("RGB")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.9980, 0.9980, 0.9940, 0.9944, 0.9960, 0.9908, 1.0, 1.0, 0.9986])
+
+ assert np.abs(image_slice - expected_slice).max() < 1e-2
+
+ def test_amused_512(self):
+ pipe = AmusedImg2ImgPipeline.from_pretrained("amused/amused-512")
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains.jpg")
+ .resize((512, 512))
+ .convert("RGB")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.1344, 0.0985, 0.0, 0.1194, 0.1809, 0.0765, 0.0854, 0.1371, 0.0933])
+ assert np.abs(image_slice - expected_slice).max() < 0.1
+
+ def test_amused_512_fp16(self):
+ pipe = AmusedImg2ImgPipeline.from_pretrained("amused/amused-512", variant="fp16", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains.jpg")
+ .resize((512, 512))
+ .convert("RGB")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.1536, 0.1767, 0.0227, 0.1079, 0.2400, 0.1427, 0.1511, 0.1564, 0.1542])
+ assert np.abs(image_slice - expected_slice).max() < 0.1
diff --git a/tests/pipelines/amused/test_amused_inpaint.py b/tests/pipelines/amused/test_amused_inpaint.py
new file mode 100644
index 0000000..d397f8d
--- /dev/null
+++ b/tests/pipelines/amused/test_amused_inpaint.py
@@ -0,0 +1,273 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import AmusedInpaintPipeline, AmusedScheduler, UVit2DModel, VQModel
+from diffusers.utils import load_image
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS, TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class AmusedInpaintPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = AmusedInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS - {"width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {
+ "latents",
+ }
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ transformer = UVit2DModel(
+ hidden_size=32,
+ use_bias=False,
+ hidden_dropout=0.0,
+ cond_embed_dim=32,
+ micro_cond_encode_dim=2,
+ micro_cond_embed_dim=10,
+ encoder_hidden_size=32,
+ vocab_size=32,
+ codebook_size=32,
+ in_channels=32,
+ block_out_channels=32,
+ num_res_blocks=1,
+ downsample=True,
+ upsample=True,
+ block_num_heads=1,
+ num_hidden_layers=1,
+ num_attention_heads=1,
+ attention_dropout=0.0,
+ intermediate_size=32,
+ layer_norm_eps=1e-06,
+ ln_elementwise_affine=True,
+ )
+ scheduler = AmusedScheduler(mask_token_id=31)
+ torch.manual_seed(0)
+ vqvae = VQModel(
+ act_fn="silu",
+ block_out_channels=[32],
+ down_block_types=[
+ "DownEncoderBlock2D",
+ ],
+ in_channels=3,
+ latent_channels=32,
+ layers_per_block=2,
+ norm_num_groups=32,
+ num_vq_embeddings=32,
+ out_channels=3,
+ sample_size=32,
+ up_block_types=[
+ "UpDecoderBlock2D",
+ ],
+ mid_block_add_attention=False,
+ lookup_from_codebook=True,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=64,
+ layer_norm_eps=1e-05,
+ num_attention_heads=8,
+ num_hidden_layers=3,
+ pad_token_id=1,
+ vocab_size=1000,
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "transformer": transformer,
+ "scheduler": scheduler,
+ "vqvae": vqvae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ image = torch.full((1, 3, 4, 4), 1.0, dtype=torch.float32, device=device)
+ mask_image = torch.full((1, 1, 4, 4), 1.0, dtype=torch.float32, device=device)
+ mask_image[0, 0, 0, 0] = 0
+ mask_image[0, 0, 0, 1] = 0
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ "image": image,
+ "mask_image": mask_image,
+ }
+ return inputs
+
+ def test_inference_batch_consistent(self, batch_sizes=[2]):
+ self._test_inference_batch_consistent(batch_sizes=batch_sizes, batch_generator=False)
+
+ @unittest.skip("aMUSEd does not support lists of generators")
+ def test_inference_batch_single_identical(self):
+ ...
+
+
+@slow
+@require_torch_gpu
+class AmusedInpaintPipelineSlowTests(unittest.TestCase):
+ def test_amused_256(self):
+ pipe = AmusedInpaintPipeline.from_pretrained("amused/amused-256")
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1.jpg")
+ .resize((256, 256))
+ .convert("RGB")
+ )
+
+ mask_image = (
+ load_image(
+ "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1_mask.png"
+ )
+ .resize((256, 256))
+ .convert("L")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ mask_image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.0699, 0.0716, 0.0608, 0.0715, 0.0797, 0.0638, 0.0802, 0.0924, 0.0634])
+ assert np.abs(image_slice - expected_slice).max() < 0.1
+
+ def test_amused_256_fp16(self):
+ pipe = AmusedInpaintPipeline.from_pretrained("amused/amused-256", variant="fp16", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1.jpg")
+ .resize((256, 256))
+ .convert("RGB")
+ )
+
+ mask_image = (
+ load_image(
+ "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1_mask.png"
+ )
+ .resize((256, 256))
+ .convert("L")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ mask_image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.0735, 0.0749, 0.0650, 0.0739, 0.0805, 0.0667, 0.0802, 0.0923, 0.0622])
+ assert np.abs(image_slice - expected_slice).max() < 0.1
+
+ def test_amused_512(self):
+ pipe = AmusedInpaintPipeline.from_pretrained("amused/amused-512")
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1.jpg")
+ .resize((512, 512))
+ .convert("RGB")
+ )
+
+ mask_image = (
+ load_image(
+ "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1_mask.png"
+ )
+ .resize((512, 512))
+ .convert("L")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ mask_image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0005, 0.0])
+ assert np.abs(image_slice - expected_slice).max() < 0.05
+
+ def test_amused_512_fp16(self):
+ pipe = AmusedInpaintPipeline.from_pretrained("amused/amused-512", variant="fp16", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+
+ image = (
+ load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1.jpg")
+ .resize((512, 512))
+ .convert("RGB")
+ )
+
+ mask_image = (
+ load_image(
+ "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/open_muse/mountains_1_mask.png"
+ )
+ .resize((512, 512))
+ .convert("L")
+ )
+
+ image = pipe(
+ "winter mountains",
+ image,
+ mask_image,
+ generator=torch.Generator().manual_seed(0),
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0025, 0.0])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
diff --git a/tests/pipelines/animatediff/__init__.py b/tests/pipelines/animatediff/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/animatediff/test_animatediff.py b/tests/pipelines/animatediff/test_animatediff.py
new file mode 100644
index 0000000..288f856
--- /dev/null
+++ b/tests/pipelines/animatediff/test_animatediff.py
@@ -0,0 +1,358 @@
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AnimateDiffPipeline,
+ AutoencoderKL,
+ DDIMScheduler,
+ MotionAdapter,
+ UNet2DConditionModel,
+ UNetMotionModel,
+)
+from diffusers.utils import is_xformers_available, logging
+from diffusers.utils.testing_utils import numpy_cosine_similarity_distance, require_torch_gpu, slow, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import IPAdapterTesterMixin, PipelineTesterMixin, SDFunctionTesterMixin
+
+
+def to_np(tensor):
+ if isinstance(tensor, torch.Tensor):
+ tensor = tensor.detach().cpu().numpy()
+
+ return tensor
+
+
+class AnimateDiffPipelineFastTests(
+ IPAdapterTesterMixin, SDFunctionTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = AnimateDiffPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "return_dict",
+ "callback_on_step_end",
+ "callback_on_step_end_tensor_inputs",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=2,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="linear",
+ clip_sample=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ motion_adapter = MotionAdapter(
+ block_out_channels=(32, 64),
+ motion_layers_per_block=2,
+ motion_norm_num_groups=2,
+ motion_num_attention_heads=4,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "motion_adapter": motion_adapter,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 7.5,
+ "output_type": "pt",
+ }
+ return inputs
+
+ def test_motion_unet_loading(self):
+ components = self.get_dummy_components()
+ pipe = AnimateDiffPipeline(**components)
+
+ assert isinstance(pipe.unet, UNetMotionModel)
+
+ @unittest.skip("Attention slicing is not enabled in this pipeline")
+ def test_attention_slicing_forward_pass(self):
+ pass
+
+ def test_inference_batch_single_identical(
+ self,
+ batch_size=2,
+ expected_max_diff=1e-4,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for components in pipe.components.values():
+ if hasattr(components, "set_default_attn_processor"):
+ components.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is has been used in self.get_dummy_inputs
+ inputs["generator"] = self.get_generator(0)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batched_inputs.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+ batched_inputs[name][-1] = 100 * "very long"
+
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ if "generator" in inputs:
+ batched_inputs["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_inputs["batch_size"] = batch_size
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ output = pipe(**inputs)
+ output_batch = pipe(**batched_inputs)
+
+ assert output_batch[0].shape[0] == batch_size
+
+ max_diff = np.abs(to_np(output_batch[0][0]) - to_np(output[0][0])).max()
+ assert max_diff < expected_max_diff
+
+ @unittest.skipIf(torch_device != "cuda", reason="CUDA and CPU are required to switch devices")
+ def test_to_device(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.to("cpu")
+ # pipeline creates a new motion UNet under the hood. So we need to check the device from pipe.components
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cpu" for device in model_devices))
+
+ output_cpu = pipe(**self.get_dummy_inputs("cpu"))[0]
+ self.assertTrue(np.isnan(output_cpu).sum() == 0)
+
+ pipe.to("cuda")
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cuda" for device in model_devices))
+
+ output_cuda = pipe(**self.get_dummy_inputs("cuda"))[0]
+ self.assertTrue(np.isnan(to_np(output_cuda)).sum() == 0)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ # pipeline creates a new motion UNet under the hood. So we need to check the dtype from pipe.components
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes))
+
+ pipe.to(dtype=torch.float16)
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes))
+
+ def test_prompt_embeds(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs.pop("prompt")
+ inputs["prompt_embeds"] = torch.randn((1, 4, 32), device=torch_device)
+ pipe(**inputs)
+
+ def test_free_init(self):
+ components = self.get_dummy_components()
+ pipe: AnimateDiffPipeline = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.to(torch_device)
+
+ inputs_normal = self.get_dummy_inputs(torch_device)
+ frames_normal = pipe(**inputs_normal).frames[0]
+
+ pipe.enable_free_init(
+ num_iters=2,
+ use_fast_sampling=True,
+ method="butterworth",
+ order=4,
+ spatial_stop_frequency=0.25,
+ temporal_stop_frequency=0.25,
+ )
+ inputs_enable_free_init = self.get_dummy_inputs(torch_device)
+ frames_enable_free_init = pipe(**inputs_enable_free_init).frames[0]
+
+ pipe.disable_free_init()
+ inputs_disable_free_init = self.get_dummy_inputs(torch_device)
+ frames_disable_free_init = pipe(**inputs_disable_free_init).frames[0]
+
+ sum_enabled = np.abs(to_np(frames_normal) - to_np(frames_enable_free_init)).sum()
+ max_diff_disabled = np.abs(to_np(frames_normal) - to_np(frames_disable_free_init)).max()
+ self.assertGreater(
+ sum_enabled, 1e1, "Enabling of FreeInit should lead to results different from the default pipeline results"
+ )
+ self.assertLess(
+ max_diff_disabled,
+ 1e-4,
+ "Disabling of FreeInit should lead to results similar to the default pipeline results",
+ )
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_without_offload = pipe(**inputs).frames[0]
+ output_without_offload = (
+ output_without_offload.cpu() if torch.is_tensor(output_without_offload) else output_without_offload
+ )
+
+ pipe.enable_xformers_memory_efficient_attention()
+ inputs = self.get_dummy_inputs(torch_device)
+ output_with_offload = pipe(**inputs).frames[0]
+ output_with_offload = (
+ output_with_offload.cpu() if torch.is_tensor(output_with_offload) else output_without_offload
+ )
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, 1e-4, "XFormers attention should not affect the inference results")
+
+
+@slow
+@require_torch_gpu
+class AnimateDiffPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_animatediff(self):
+ adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
+ pipe = AnimateDiffPipeline.from_pretrained("frankjoshua/toonyou_beta6", motion_adapter=adapter)
+ pipe = pipe.to(torch_device)
+ pipe.scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="linear",
+ steps_offset=1,
+ clip_sample=False,
+ )
+ pipe.enable_vae_slicing()
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "night, b&w photo of old house, post apocalypse, forest, storm weather, wind, rocks, 8k uhd, dslr, soft lighting, high quality, film grain"
+ negative_prompt = "bad quality, worse quality"
+
+ generator = torch.Generator("cpu").manual_seed(0)
+ output = pipe(
+ prompt,
+ negative_prompt=negative_prompt,
+ num_frames=16,
+ generator=generator,
+ guidance_scale=7.5,
+ num_inference_steps=3,
+ output_type="np",
+ )
+
+ image = output.frames[0]
+ assert image.shape == (16, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [
+ 0.11357737,
+ 0.11285847,
+ 0.11180121,
+ 0.11084166,
+ 0.11414117,
+ 0.09785956,
+ 0.10742754,
+ 0.10510018,
+ 0.08045256,
+ ]
+ )
+ assert numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice.flatten()) < 1e-3
diff --git a/tests/pipelines/animatediff/test_animatediff_video2video.py b/tests/pipelines/animatediff/test_animatediff_video2video.py
new file mode 100644
index 0000000..6cc54d9
--- /dev/null
+++ b/tests/pipelines/animatediff/test_animatediff_video2video.py
@@ -0,0 +1,304 @@
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AnimateDiffVideoToVideoPipeline,
+ AutoencoderKL,
+ DDIMScheduler,
+ MotionAdapter,
+ UNet2DConditionModel,
+ UNetMotionModel,
+)
+from diffusers.utils import is_xformers_available, logging
+from diffusers.utils.testing_utils import torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_PARAMS, VIDEO_TO_VIDEO_BATCH_PARAMS
+from ..test_pipelines_common import IPAdapterTesterMixin, PipelineTesterMixin
+
+
+def to_np(tensor):
+ if isinstance(tensor, torch.Tensor):
+ tensor = tensor.detach().cpu().numpy()
+
+ return tensor
+
+
+class AnimateDiffVideoToVideoPipelineFastTests(IPAdapterTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = AnimateDiffVideoToVideoPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = VIDEO_TO_VIDEO_BATCH_PARAMS
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "return_dict",
+ "callback_on_step_end",
+ "callback_on_step_end_tensor_inputs",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=2,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="linear",
+ clip_sample=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ motion_adapter = MotionAdapter(
+ block_out_channels=(32, 64),
+ motion_layers_per_block=2,
+ motion_norm_num_groups=2,
+ motion_num_attention_heads=4,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "motion_adapter": motion_adapter,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ video_height = 32
+ video_width = 32
+ video_num_frames = 2
+ video = [Image.new("RGB", (video_width, video_height))] * video_num_frames
+
+ inputs = {
+ "video": video,
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 7.5,
+ "output_type": "pt",
+ }
+ return inputs
+
+ def test_motion_unet_loading(self):
+ components = self.get_dummy_components()
+ pipe = AnimateDiffVideoToVideoPipeline(**components)
+
+ assert isinstance(pipe.unet, UNetMotionModel)
+
+ @unittest.skip("Attention slicing is not enabled in this pipeline")
+ def test_attention_slicing_forward_pass(self):
+ pass
+
+ def test_inference_batch_single_identical(
+ self,
+ batch_size=2,
+ expected_max_diff=1e-4,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for components in pipe.components.values():
+ if hasattr(components, "set_default_attn_processor"):
+ components.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is has been used in self.get_dummy_inputs
+ inputs["generator"] = self.get_generator(0)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batched_inputs.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+ batched_inputs[name][-1] = 100 * "very long"
+
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ if "generator" in inputs:
+ batched_inputs["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_inputs["batch_size"] = batch_size
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ output = pipe(**inputs)
+ output_batch = pipe(**batched_inputs)
+
+ assert output_batch[0].shape[0] == batch_size
+
+ max_diff = np.abs(to_np(output_batch[0][0]) - to_np(output[0][0])).max()
+ assert max_diff < expected_max_diff
+
+ @unittest.skipIf(torch_device != "cuda", reason="CUDA and CPU are required to switch devices")
+ def test_to_device(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.to("cpu")
+ # pipeline creates a new motion UNet under the hood. So we need to check the device from pipe.components
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cpu" for device in model_devices))
+
+ output_cpu = pipe(**self.get_dummy_inputs("cpu"))[0]
+ self.assertTrue(np.isnan(output_cpu).sum() == 0)
+
+ pipe.to("cuda")
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cuda" for device in model_devices))
+
+ output_cuda = pipe(**self.get_dummy_inputs("cuda"))[0]
+ self.assertTrue(np.isnan(to_np(output_cuda)).sum() == 0)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ # pipeline creates a new motion UNet under the hood. So we need to check the dtype from pipe.components
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes))
+
+ pipe.to(dtype=torch.float16)
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes))
+
+ def test_prompt_embeds(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs.pop("prompt")
+ inputs["prompt_embeds"] = torch.randn((1, 4, 32), device=torch_device)
+ pipe(**inputs)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_without_offload = pipe(**inputs).frames[0]
+ output_without_offload = (
+ output_without_offload.cpu() if torch.is_tensor(output_without_offload) else output_without_offload
+ )
+
+ pipe.enable_xformers_memory_efficient_attention()
+ inputs = self.get_dummy_inputs(torch_device)
+ output_with_offload = pipe(**inputs).frames[0]
+ output_with_offload = (
+ output_with_offload.cpu() if torch.is_tensor(output_with_offload) else output_without_offload
+ )
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, 1e-4, "XFormers attention should not affect the inference results")
+
+ def test_free_init(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.to(torch_device)
+
+ inputs_normal = self.get_dummy_inputs(torch_device)
+ frames_normal = pipe(**inputs_normal).frames[0]
+
+ pipe.enable_free_init(
+ num_iters=2,
+ use_fast_sampling=True,
+ method="butterworth",
+ order=4,
+ spatial_stop_frequency=0.25,
+ temporal_stop_frequency=0.25,
+ )
+ inputs_enable_free_init = self.get_dummy_inputs(torch_device)
+ frames_enable_free_init = pipe(**inputs_enable_free_init).frames[0]
+
+ pipe.disable_free_init()
+ inputs_disable_free_init = self.get_dummy_inputs(torch_device)
+ frames_disable_free_init = pipe(**inputs_disable_free_init).frames[0]
+
+ sum_enabled = np.abs(to_np(frames_normal) - to_np(frames_enable_free_init)).sum()
+ max_diff_disabled = np.abs(to_np(frames_normal) - to_np(frames_disable_free_init)).max()
+ self.assertGreater(
+ sum_enabled, 1e1, "Enabling of FreeInit should lead to results different from the default pipeline results"
+ )
+ self.assertLess(
+ max_diff_disabled,
+ 1e-4,
+ "Disabling of FreeInit should lead to results similar to the default pipeline results",
+ )
diff --git a/tests/pipelines/audioldm/__init__.py b/tests/pipelines/audioldm/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/audioldm/test_audioldm.py b/tests/pipelines/audioldm/test_audioldm.py
new file mode 100644
index 0000000..84b5788
--- /dev/null
+++ b/tests/pipelines/audioldm/test_audioldm.py
@@ -0,0 +1,447 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from transformers import (
+ ClapTextConfig,
+ ClapTextModelWithProjection,
+ RobertaTokenizer,
+ SpeechT5HifiGan,
+ SpeechT5HifiGanConfig,
+)
+
+from diffusers import (
+ AudioLDMPipeline,
+ AutoencoderKL,
+ DDIMScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ UNet2DConditionModel,
+)
+from diffusers.utils import is_xformers_available
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, torch_device
+
+from ..pipeline_params import TEXT_TO_AUDIO_BATCH_PARAMS, TEXT_TO_AUDIO_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class AudioLDMPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = AudioLDMPipeline
+ params = TEXT_TO_AUDIO_PARAMS
+ batch_params = TEXT_TO_AUDIO_BATCH_PARAMS
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "num_waveforms_per_prompt",
+ "generator",
+ "latents",
+ "output_type",
+ "return_dict",
+ "callback",
+ "callback_steps",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=(32, 64),
+ class_embed_type="simple_projection",
+ projection_class_embeddings_input_dim=32,
+ class_embeddings_concat=True,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=1,
+ out_channels=1,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = ClapTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ projection_dim=32,
+ )
+ text_encoder = ClapTextModelWithProjection(text_encoder_config)
+ tokenizer = RobertaTokenizer.from_pretrained("hf-internal-testing/tiny-random-roberta", model_max_length=77)
+
+ vocoder_config = SpeechT5HifiGanConfig(
+ model_in_dim=8,
+ sampling_rate=16000,
+ upsample_initial_channel=16,
+ upsample_rates=[2, 2],
+ upsample_kernel_sizes=[4, 4],
+ resblock_kernel_sizes=[3, 7],
+ resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5]],
+ normalize_before=False,
+ )
+
+ vocoder = SpeechT5HifiGan(vocoder_config)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "vocoder": vocoder,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ }
+ return inputs
+
+ def test_audioldm_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = audioldm_pipe(**inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 256
+
+ audio_slice = audio[:10]
+ expected_slice = np.array(
+ [-0.0050, 0.0050, -0.0060, 0.0033, -0.0026, 0.0033, -0.0027, 0.0033, -0.0028, 0.0033]
+ )
+
+ assert np.abs(audio_slice - expected_slice).max() < 1e-2
+
+ def test_audioldm_prompt_embeds(self):
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_1 = output.audios[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = audioldm_pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=audioldm_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ prompt_embeds = audioldm_pipe.text_encoder(
+ text_inputs,
+ )
+ prompt_embeds = prompt_embeds.text_embeds
+ # additional L_2 normalization over each hidden-state
+ prompt_embeds = F.normalize(prompt_embeds, dim=-1)
+
+ inputs["prompt_embeds"] = prompt_embeds
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_2 = output.audios[0]
+
+ assert np.abs(audio_1 - audio_2).max() < 1e-2
+
+ def test_audioldm_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_1 = output.audios[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ embeds = []
+ for p in [prompt, negative_prompt]:
+ text_inputs = audioldm_pipe.tokenizer(
+ p,
+ padding="max_length",
+ max_length=audioldm_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ text_embeds = audioldm_pipe.text_encoder(
+ text_inputs,
+ )
+ text_embeds = text_embeds.text_embeds
+ # additional L_2 normalization over each hidden-state
+ text_embeds = F.normalize(text_embeds, dim=-1)
+
+ embeds.append(text_embeds)
+
+ inputs["prompt_embeds"], inputs["negative_prompt_embeds"] = embeds
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_2 = output.audios[0]
+
+ assert np.abs(audio_1 - audio_2).max() < 1e-2
+
+ def test_audioldm_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "egg cracking"
+ output = audioldm_pipe(**inputs, negative_prompt=negative_prompt)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 256
+
+ audio_slice = audio[:10]
+ expected_slice = np.array(
+ [-0.0051, 0.0050, -0.0060, 0.0034, -0.0026, 0.0033, -0.0027, 0.0033, -0.0028, 0.0032]
+ )
+
+ assert np.abs(audio_slice - expected_slice).max() < 1e-2
+
+ def test_audioldm_num_waveforms_per_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A hammer hitting a wooden surface"
+
+ # test num_waveforms_per_prompt=1 (default)
+ audios = audioldm_pipe(prompt, num_inference_steps=2).audios
+
+ assert audios.shape == (1, 256)
+
+ # test num_waveforms_per_prompt=1 (default) for batch of prompts
+ batch_size = 2
+ audios = audioldm_pipe([prompt] * batch_size, num_inference_steps=2).audios
+
+ assert audios.shape == (batch_size, 256)
+
+ # test num_waveforms_per_prompt for single prompt
+ num_waveforms_per_prompt = 2
+ audios = audioldm_pipe(prompt, num_inference_steps=2, num_waveforms_per_prompt=num_waveforms_per_prompt).audios
+
+ assert audios.shape == (num_waveforms_per_prompt, 256)
+
+ # test num_waveforms_per_prompt for batch of prompts
+ batch_size = 2
+ audios = audioldm_pipe(
+ [prompt] * batch_size, num_inference_steps=2, num_waveforms_per_prompt=num_waveforms_per_prompt
+ ).audios
+
+ assert audios.shape == (batch_size * num_waveforms_per_prompt, 256)
+
+ def test_audioldm_audio_length_in_s(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+ vocoder_sampling_rate = audioldm_pipe.vocoder.config.sampling_rate
+
+ inputs = self.get_dummy_inputs(device)
+ output = audioldm_pipe(audio_length_in_s=0.016, **inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) / vocoder_sampling_rate == 0.016
+
+ output = audioldm_pipe(audio_length_in_s=0.032, **inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) / vocoder_sampling_rate == 0.032
+
+ def test_audioldm_vocoder_model_in_dim(self):
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDMPipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ prompt = ["hey"]
+
+ output = audioldm_pipe(prompt, num_inference_steps=1)
+ audio_shape = output.audios.shape
+ assert audio_shape == (1, 256)
+
+ config = audioldm_pipe.vocoder.config
+ config.model_in_dim *= 2
+ audioldm_pipe.vocoder = SpeechT5HifiGan(config).to(torch_device)
+ output = audioldm_pipe(prompt, num_inference_steps=1)
+ audio_shape = output.audios.shape
+ # waveform shape is unchanged, we just have 2x the number of mel channels in the spectrogram
+ assert audio_shape == (1, 256)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(test_mean_pixel_difference=False)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical()
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False)
+
+
+@nightly
+class AudioLDMPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 8, 128, 16))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 2.5,
+ }
+ return inputs
+
+ def test_audioldm(self):
+ audioldm_pipe = AudioLDMPipeline.from_pretrained("cvssp/audioldm")
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 25
+ audio = audioldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81920
+
+ audio_slice = audio[77230:77240]
+ expected_slice = np.array(
+ [-0.4884, -0.4607, 0.0023, 0.5007, 0.5896, 0.5151, 0.3813, -0.0208, -0.3687, -0.4315]
+ )
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 1e-2
+
+
+@nightly
+class AudioLDMPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 8, 128, 16))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 2.5,
+ }
+ return inputs
+
+ def test_audioldm_lms(self):
+ audioldm_pipe = AudioLDMPipeline.from_pretrained("cvssp/audioldm")
+ audioldm_pipe.scheduler = LMSDiscreteScheduler.from_config(audioldm_pipe.scheduler.config)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ audio = audioldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81920
+
+ audio_slice = audio[27780:27790]
+ expected_slice = np.array([-0.2131, -0.0873, -0.0124, -0.0189, 0.0569, 0.1373, 0.1883, 0.2886, 0.3297, 0.2212])
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 3e-2
diff --git a/tests/pipelines/audioldm2/__init__.py b/tests/pipelines/audioldm2/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/audioldm2/test_audioldm2.py b/tests/pipelines/audioldm2/test_audioldm2.py
new file mode 100644
index 0000000..58b1aef
--- /dev/null
+++ b/tests/pipelines/audioldm2/test_audioldm2.py
@@ -0,0 +1,569 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ ClapAudioConfig,
+ ClapConfig,
+ ClapFeatureExtractor,
+ ClapModel,
+ ClapTextConfig,
+ GPT2Config,
+ GPT2Model,
+ RobertaTokenizer,
+ SpeechT5HifiGan,
+ SpeechT5HifiGanConfig,
+ T5Config,
+ T5EncoderModel,
+ T5Tokenizer,
+)
+
+from diffusers import (
+ AudioLDM2Pipeline,
+ AudioLDM2ProjectionModel,
+ AudioLDM2UNet2DConditionModel,
+ AutoencoderKL,
+ DDIMScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, torch_device
+
+from ..pipeline_params import TEXT_TO_AUDIO_BATCH_PARAMS, TEXT_TO_AUDIO_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class AudioLDM2PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = AudioLDM2Pipeline
+ params = TEXT_TO_AUDIO_PARAMS
+ batch_params = TEXT_TO_AUDIO_BATCH_PARAMS
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "num_waveforms_per_prompt",
+ "generator",
+ "latents",
+ "output_type",
+ "return_dict",
+ "callback",
+ "callback_steps",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = AudioLDM2UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=([None, 16, 32], [None, 16, 32]),
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=1,
+ out_channels=1,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_branch_config = ClapTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=16,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=2,
+ num_hidden_layers=2,
+ pad_token_id=1,
+ vocab_size=1000,
+ projection_dim=16,
+ )
+ audio_branch_config = ClapAudioConfig(
+ spec_size=64,
+ window_size=4,
+ num_mel_bins=64,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ depths=[2, 2],
+ num_attention_heads=[2, 2],
+ num_hidden_layers=2,
+ hidden_size=192,
+ projection_dim=16,
+ patch_size=2,
+ patch_stride=2,
+ patch_embed_input_channels=4,
+ )
+ text_encoder_config = ClapConfig.from_text_audio_configs(
+ text_config=text_branch_config, audio_config=audio_branch_config, projection_dim=16
+ )
+ text_encoder = ClapModel(text_encoder_config)
+ tokenizer = RobertaTokenizer.from_pretrained("hf-internal-testing/tiny-random-roberta", model_max_length=77)
+ feature_extractor = ClapFeatureExtractor.from_pretrained(
+ "hf-internal-testing/tiny-random-ClapModel", hop_length=7900
+ )
+
+ torch.manual_seed(0)
+ text_encoder_2_config = T5Config(
+ vocab_size=32100,
+ d_model=32,
+ d_ff=37,
+ d_kv=8,
+ num_heads=2,
+ num_layers=2,
+ )
+ text_encoder_2 = T5EncoderModel(text_encoder_2_config)
+ tokenizer_2 = T5Tokenizer.from_pretrained("hf-internal-testing/tiny-random-T5Model", model_max_length=77)
+
+ torch.manual_seed(0)
+ language_model_config = GPT2Config(
+ n_embd=16,
+ n_head=2,
+ n_layer=2,
+ vocab_size=1000,
+ n_ctx=99,
+ n_positions=99,
+ )
+ language_model = GPT2Model(language_model_config)
+ language_model.config.max_new_tokens = 8
+
+ torch.manual_seed(0)
+ projection_model = AudioLDM2ProjectionModel(text_encoder_dim=16, text_encoder_1_dim=32, langauge_model_dim=16)
+
+ vocoder_config = SpeechT5HifiGanConfig(
+ model_in_dim=8,
+ sampling_rate=16000,
+ upsample_initial_channel=16,
+ upsample_rates=[2, 2],
+ upsample_kernel_sizes=[4, 4],
+ resblock_kernel_sizes=[3, 7],
+ resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5]],
+ normalize_before=False,
+ )
+
+ vocoder = SpeechT5HifiGan(vocoder_config)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer": tokenizer,
+ "tokenizer_2": tokenizer_2,
+ "feature_extractor": feature_extractor,
+ "language_model": language_model,
+ "projection_model": projection_model,
+ "vocoder": vocoder,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ }
+ return inputs
+
+ def test_audioldm2_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = audioldm_pipe(**inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 256
+
+ audio_slice = audio[:10]
+ expected_slice = np.array(
+ [0.0025, 0.0018, 0.0018, -0.0023, -0.0026, -0.0020, -0.0026, -0.0021, -0.0027, -0.0020]
+ )
+
+ assert np.abs(audio_slice - expected_slice).max() < 1e-4
+
+ def test_audioldm2_prompt_embeds(self):
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_1 = output.audios[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = audioldm_pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=audioldm_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ clap_prompt_embeds = audioldm_pipe.text_encoder.get_text_features(text_inputs)
+ clap_prompt_embeds = clap_prompt_embeds[:, None, :]
+
+ text_inputs = audioldm_pipe.tokenizer_2(
+ prompt,
+ padding="max_length",
+ max_length=True,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ t5_prompt_embeds = audioldm_pipe.text_encoder_2(
+ text_inputs,
+ )
+ t5_prompt_embeds = t5_prompt_embeds[0]
+
+ projection_embeds = audioldm_pipe.projection_model(clap_prompt_embeds, t5_prompt_embeds)[0]
+ generated_prompt_embeds = audioldm_pipe.generate_language_model(projection_embeds, max_new_tokens=8)
+
+ inputs["prompt_embeds"] = t5_prompt_embeds
+ inputs["generated_prompt_embeds"] = generated_prompt_embeds
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_2 = output.audios[0]
+
+ assert np.abs(audio_1 - audio_2).max() < 1e-2
+
+ def test_audioldm2_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_1 = output.audios[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ embeds = []
+ generated_embeds = []
+ for p in [prompt, negative_prompt]:
+ text_inputs = audioldm_pipe.tokenizer(
+ p,
+ padding="max_length",
+ max_length=audioldm_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ clap_prompt_embeds = audioldm_pipe.text_encoder.get_text_features(text_inputs)
+ clap_prompt_embeds = clap_prompt_embeds[:, None, :]
+
+ text_inputs = audioldm_pipe.tokenizer_2(
+ prompt,
+ padding="max_length",
+ max_length=True if len(embeds) == 0 else embeds[0].shape[1],
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ t5_prompt_embeds = audioldm_pipe.text_encoder_2(
+ text_inputs,
+ )
+ t5_prompt_embeds = t5_prompt_embeds[0]
+
+ projection_embeds = audioldm_pipe.projection_model(clap_prompt_embeds, t5_prompt_embeds)[0]
+ generated_prompt_embeds = audioldm_pipe.generate_language_model(projection_embeds, max_new_tokens=8)
+
+ embeds.append(t5_prompt_embeds)
+ generated_embeds.append(generated_prompt_embeds)
+
+ inputs["prompt_embeds"], inputs["negative_prompt_embeds"] = embeds
+ inputs["generated_prompt_embeds"], inputs["negative_generated_prompt_embeds"] = generated_embeds
+
+ # forward
+ output = audioldm_pipe(**inputs)
+ audio_2 = output.audios[0]
+
+ assert np.abs(audio_1 - audio_2).max() < 1e-2
+
+ def test_audioldm2_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "egg cracking"
+ output = audioldm_pipe(**inputs, negative_prompt=negative_prompt)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 256
+
+ audio_slice = audio[:10]
+ expected_slice = np.array(
+ [0.0025, 0.0018, 0.0018, -0.0023, -0.0026, -0.0020, -0.0026, -0.0021, -0.0027, -0.0020]
+ )
+
+ assert np.abs(audio_slice - expected_slice).max() < 1e-4
+
+ def test_audioldm2_num_waveforms_per_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A hammer hitting a wooden surface"
+
+ # test num_waveforms_per_prompt=1 (default)
+ audios = audioldm_pipe(prompt, num_inference_steps=2).audios
+
+ assert audios.shape == (1, 256)
+
+ # test num_waveforms_per_prompt=1 (default) for batch of prompts
+ batch_size = 2
+ audios = audioldm_pipe([prompt] * batch_size, num_inference_steps=2).audios
+
+ assert audios.shape == (batch_size, 256)
+
+ # test num_waveforms_per_prompt for single prompt
+ num_waveforms_per_prompt = 2
+ audios = audioldm_pipe(prompt, num_inference_steps=2, num_waveforms_per_prompt=num_waveforms_per_prompt).audios
+
+ assert audios.shape == (num_waveforms_per_prompt, 256)
+
+ # test num_waveforms_per_prompt for batch of prompts
+ batch_size = 2
+ audios = audioldm_pipe(
+ [prompt] * batch_size, num_inference_steps=2, num_waveforms_per_prompt=num_waveforms_per_prompt
+ ).audios
+
+ assert audios.shape == (batch_size * num_waveforms_per_prompt, 256)
+
+ def test_audioldm2_audio_length_in_s(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+ vocoder_sampling_rate = audioldm_pipe.vocoder.config.sampling_rate
+
+ inputs = self.get_dummy_inputs(device)
+ output = audioldm_pipe(audio_length_in_s=0.016, **inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) / vocoder_sampling_rate == 0.016
+
+ output = audioldm_pipe(audio_length_in_s=0.032, **inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) / vocoder_sampling_rate == 0.032
+
+ def test_audioldm2_vocoder_model_in_dim(self):
+ components = self.get_dummy_components()
+ audioldm_pipe = AudioLDM2Pipeline(**components)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ prompt = ["hey"]
+
+ output = audioldm_pipe(prompt, num_inference_steps=1)
+ audio_shape = output.audios.shape
+ assert audio_shape == (1, 256)
+
+ config = audioldm_pipe.vocoder.config
+ config.model_in_dim *= 2
+ audioldm_pipe.vocoder = SpeechT5HifiGan(config).to(torch_device)
+ output = audioldm_pipe(prompt, num_inference_steps=1)
+ audio_shape = output.audios.shape
+ # waveform shape is unchanged, we just have 2x the number of mel channels in the spectrogram
+ assert audio_shape == (1, 256)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(test_mean_pixel_difference=False)
+
+ @unittest.skip("Raises a not implemented error in AudioLDM2")
+ def test_xformers_attention_forwardGenerator_pass(self):
+ pass
+
+ def test_dict_tuple_outputs_equivalent(self):
+ # increase tolerance from 1e-4 -> 2e-4 to account for large composite model
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=2e-4)
+
+ def test_inference_batch_single_identical(self):
+ # increase tolerance from 1e-4 -> 2e-4 to account for large composite model
+ self._test_inference_batch_single_identical(expected_max_diff=2e-4)
+
+ def test_save_load_local(self):
+ # increase tolerance from 1e-4 -> 2e-4 to account for large composite model
+ super().test_save_load_local(expected_max_difference=2e-4)
+
+ def test_save_load_optional_components(self):
+ # increase tolerance from 1e-4 -> 2e-4 to account for large composite model
+ super().test_save_load_optional_components(expected_max_difference=2e-4)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ # The method component.dtype returns the dtype of the first parameter registered in the model, not the
+ # dtype of the entire model. In the case of CLAP, the first parameter is a float64 constant (logit scale)
+ model_dtypes = {key: component.dtype for key, component in components.items() if hasattr(component, "dtype")}
+
+ # Without the logit scale parameters, everything is float32
+ model_dtypes.pop("text_encoder")
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes.values()))
+
+ # the CLAP sub-models are float32
+ model_dtypes["clap_text_branch"] = components["text_encoder"].text_model.dtype
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes.values()))
+
+ # Once we send to fp16, all params are in half-precision, including the logit scale
+ pipe.to(dtype=torch.float16)
+ model_dtypes = {key: component.dtype for key, component in components.items() if hasattr(component, "dtype")}
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes.values()))
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ pass
+
+
+@nightly
+class AudioLDM2PipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 8, 128, 16))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 2.5,
+ }
+ return inputs
+
+ def test_audioldm2(self):
+ audioldm_pipe = AudioLDM2Pipeline.from_pretrained("cvssp/audioldm2")
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 25
+ audio = audioldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81952
+
+ # check the portion of the generated audio with the largest dynamic range (reduces flakiness)
+ audio_slice = audio[17275:17285]
+ expected_slice = np.array([0.0791, 0.0666, 0.1158, 0.1227, 0.1171, -0.2880, -0.1940, -0.0283, -0.0126, 0.1127])
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 1e-3
+
+ def test_audioldm2_lms(self):
+ audioldm_pipe = AudioLDM2Pipeline.from_pretrained("cvssp/audioldm2")
+ audioldm_pipe.scheduler = LMSDiscreteScheduler.from_config(audioldm_pipe.scheduler.config)
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ audio = audioldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81952
+
+ # check the portion of the generated audio with the largest dynamic range (reduces flakiness)
+ audio_slice = audio[31390:31400]
+ expected_slice = np.array(
+ [-0.1318, -0.0577, 0.0446, -0.0573, 0.0659, 0.1074, -0.2600, 0.0080, -0.2190, -0.4301]
+ )
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 1e-3
+
+ def test_audioldm2_large(self):
+ audioldm_pipe = AudioLDM2Pipeline.from_pretrained("cvssp/audioldm2-large")
+ audioldm_pipe = audioldm_pipe.to(torch_device)
+ audioldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ audio = audioldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81952
+
+ # check the portion of the generated audio with the largest dynamic range (reduces flakiness)
+ audio_slice = audio[8825:8835]
+ expected_slice = np.array(
+ [-0.1829, -0.1461, 0.0759, -0.1493, -0.1396, 0.5783, 0.3001, -0.3038, -0.0639, -0.2244]
+ )
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/blipdiffusion/__init__.py b/tests/pipelines/blipdiffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/blipdiffusion/test_blipdiffusion.py b/tests/pipelines/blipdiffusion/test_blipdiffusion.py
new file mode 100644
index 0000000..c5eaa38
--- /dev/null
+++ b/tests/pipelines/blipdiffusion/test_blipdiffusion.py
@@ -0,0 +1,196 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTokenizer
+from transformers.models.blip_2.configuration_blip_2 import Blip2Config
+from transformers.models.clip.configuration_clip import CLIPTextConfig
+
+from diffusers import AutoencoderKL, BlipDiffusionPipeline, PNDMScheduler, UNet2DConditionModel
+from diffusers.utils.testing_utils import enable_full_determinism
+from src.diffusers.pipelines.blip_diffusion.blip_image_processing import BlipImageProcessor
+from src.diffusers.pipelines.blip_diffusion.modeling_blip2 import Blip2QFormerModel
+from src.diffusers.pipelines.blip_diffusion.modeling_ctx_clip import ContextCLIPTextModel
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class BlipDiffusionPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = BlipDiffusionPipeline
+ params = [
+ "prompt",
+ "reference_image",
+ "source_subject_category",
+ "target_subject_category",
+ ]
+ batch_params = [
+ "prompt",
+ "reference_image",
+ "source_subject_category",
+ "target_subject_category",
+ ]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "num_inference_steps",
+ "neg_prompt",
+ "guidance_scale",
+ "prompt_strength",
+ "prompt_reps",
+ ]
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ vocab_size=1000,
+ hidden_size=16,
+ intermediate_size=16,
+ projection_dim=16,
+ num_hidden_layers=1,
+ num_attention_heads=1,
+ max_position_embeddings=77,
+ )
+ text_encoder = ContextCLIPTextModel(text_encoder_config)
+
+ vae = AutoencoderKL(
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownEncoderBlock2D",),
+ up_block_types=("UpDecoderBlock2D",),
+ block_out_channels=(32,),
+ layers_per_block=1,
+ act_fn="silu",
+ latent_channels=4,
+ norm_num_groups=16,
+ sample_size=16,
+ )
+
+ blip_vision_config = {
+ "hidden_size": 16,
+ "intermediate_size": 16,
+ "num_hidden_layers": 1,
+ "num_attention_heads": 1,
+ "image_size": 224,
+ "patch_size": 14,
+ "hidden_act": "quick_gelu",
+ }
+
+ blip_qformer_config = {
+ "vocab_size": 1000,
+ "hidden_size": 16,
+ "num_hidden_layers": 1,
+ "num_attention_heads": 1,
+ "intermediate_size": 16,
+ "max_position_embeddings": 512,
+ "cross_attention_frequency": 1,
+ "encoder_hidden_size": 16,
+ }
+ qformer_config = Blip2Config(
+ vision_config=blip_vision_config,
+ qformer_config=blip_qformer_config,
+ num_query_tokens=16,
+ tokenizer="hf-internal-testing/tiny-random-bert",
+ )
+ qformer = Blip2QFormerModel(qformer_config)
+
+ unet = UNet2DConditionModel(
+ block_out_channels=(16, 32),
+ norm_num_groups=16,
+ layers_per_block=1,
+ sample_size=16,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=16,
+ )
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ scheduler = PNDMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ set_alpha_to_one=False,
+ skip_prk_steps=True,
+ )
+
+ vae.eval()
+ qformer.eval()
+ text_encoder.eval()
+
+ image_processor = BlipImageProcessor()
+
+ components = {
+ "text_encoder": text_encoder,
+ "vae": vae,
+ "qformer": qformer,
+ "unet": unet,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "image_processor": image_processor,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ np.random.seed(seed)
+ reference_image = np.random.rand(32, 32, 3) * 255
+ reference_image = Image.fromarray(reference_image.astype("uint8")).convert("RGBA")
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "swimming underwater",
+ "generator": generator,
+ "reference_image": reference_image,
+ "source_subject_category": "dog",
+ "target_subject_category": "dog",
+ "height": 32,
+ "width": 32,
+ "guidance_scale": 7.5,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_blipdiffusion(self):
+ device = "cpu"
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ image = pipe(**self.get_dummy_inputs(device))[0]
+ image_slice = image[0, -3:, -3:, 0]
+
+ assert image.shape == (1, 16, 16, 4)
+
+ expected_slice = np.array([0.7096, 0.5900, 0.6703, 0.4032, 0.7766, 0.3629, 0.5447, 0.4149, 0.8172])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {image_slice.flatten()}, but got {image_slice.flatten()}"
diff --git a/tests/pipelines/consistency_models/__init__.py b/tests/pipelines/consistency_models/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/consistency_models/test_consistency_models.py b/tests/pipelines/consistency_models/test_consistency_models.py
new file mode 100644
index 0000000..2cf7c0a
--- /dev/null
+++ b/tests/pipelines/consistency_models/test_consistency_models.py
@@ -0,0 +1,294 @@
+import gc
+import unittest
+
+import numpy as np
+import torch
+from torch.backends.cuda import sdp_kernel
+
+from diffusers import (
+ CMStochasticIterativeScheduler,
+ ConsistencyModelPipeline,
+ UNet2DModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ nightly,
+ require_torch_2,
+ require_torch_gpu,
+ torch_device,
+)
+from diffusers.utils.torch_utils import randn_tensor
+
+from ..pipeline_params import UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS, UNCONDITIONAL_IMAGE_GENERATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class ConsistencyModelPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = ConsistencyModelPipeline
+ params = UNCONDITIONAL_IMAGE_GENERATION_PARAMS
+ batch_params = UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS
+
+ # Override required_optional_params to remove num_images_per_prompt
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "output_type",
+ "return_dict",
+ "callback",
+ "callback_steps",
+ ]
+ )
+
+ @property
+ def dummy_uncond_unet(self):
+ unet = UNet2DModel.from_pretrained(
+ "diffusers/consistency-models-test",
+ subfolder="test_unet",
+ )
+ return unet
+
+ @property
+ def dummy_cond_unet(self):
+ unet = UNet2DModel.from_pretrained(
+ "diffusers/consistency-models-test",
+ subfolder="test_unet_class_cond",
+ )
+ return unet
+
+ def get_dummy_components(self, class_cond=False):
+ if class_cond:
+ unet = self.dummy_cond_unet
+ else:
+ unet = self.dummy_uncond_unet
+
+ # Default to CM multistep sampler
+ scheduler = CMStochasticIterativeScheduler(
+ num_train_timesteps=40,
+ sigma_min=0.002,
+ sigma_max=80.0,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "batch_size": 1,
+ "num_inference_steps": None,
+ "timesteps": [22, 0],
+ "generator": generator,
+ "output_type": "np",
+ }
+
+ return inputs
+
+ def test_consistency_model_pipeline_multistep(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = ConsistencyModelPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.3572, 0.6273, 0.4031, 0.3961, 0.4321, 0.5730, 0.5266, 0.4780, 0.5004])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_consistency_model_pipeline_multistep_class_cond(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(class_cond=True)
+ pipe = ConsistencyModelPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["class_labels"] = 0
+ image = pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.3572, 0.6273, 0.4031, 0.3961, 0.4321, 0.5730, 0.5266, 0.4780, 0.5004])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_consistency_model_pipeline_onestep(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = ConsistencyModelPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["num_inference_steps"] = 1
+ inputs["timesteps"] = None
+ image = pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.5004, 0.5004, 0.4994, 0.5008, 0.4976, 0.5018, 0.4990, 0.4982, 0.4987])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_consistency_model_pipeline_onestep_class_cond(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(class_cond=True)
+ pipe = ConsistencyModelPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["num_inference_steps"] = 1
+ inputs["timesteps"] = None
+ inputs["class_labels"] = 0
+ image = pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.5004, 0.5004, 0.4994, 0.5008, 0.4976, 0.5018, 0.4990, 0.4982, 0.4987])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+
+@nightly
+@require_torch_gpu
+class ConsistencyModelPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, seed=0, get_fixed_latents=False, device="cpu", dtype=torch.float32, shape=(1, 3, 64, 64)):
+ generator = torch.manual_seed(seed)
+
+ inputs = {
+ "num_inference_steps": None,
+ "timesteps": [22, 0],
+ "class_labels": 0,
+ "generator": generator,
+ "output_type": "np",
+ }
+
+ if get_fixed_latents:
+ latents = self.get_fixed_latents(seed=seed, device=device, dtype=dtype, shape=shape)
+ inputs["latents"] = latents
+
+ return inputs
+
+ def get_fixed_latents(self, seed=0, device="cpu", dtype=torch.float32, shape=(1, 3, 64, 64)):
+ if isinstance(device, str):
+ device = torch.device(device)
+ generator = torch.Generator(device=device).manual_seed(seed)
+ latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+ return latents
+
+ def test_consistency_model_cd_multistep(self):
+ unet = UNet2DModel.from_pretrained("diffusers/consistency_models", subfolder="diffusers_cd_imagenet64_l2")
+ scheduler = CMStochasticIterativeScheduler(
+ num_train_timesteps=40,
+ sigma_min=0.002,
+ sigma_max=80.0,
+ )
+ pipe = ConsistencyModelPipeline(unet=unet, scheduler=scheduler)
+ pipe.to(torch_device=torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ expected_slice = np.array([0.0146, 0.0158, 0.0092, 0.0086, 0.0000, 0.0000, 0.0000, 0.0000, 0.0058])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_consistency_model_cd_onestep(self):
+ unet = UNet2DModel.from_pretrained("diffusers/consistency_models", subfolder="diffusers_cd_imagenet64_l2")
+ scheduler = CMStochasticIterativeScheduler(
+ num_train_timesteps=40,
+ sigma_min=0.002,
+ sigma_max=80.0,
+ )
+ pipe = ConsistencyModelPipeline(unet=unet, scheduler=scheduler)
+ pipe.to(torch_device=torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs()
+ inputs["num_inference_steps"] = 1
+ inputs["timesteps"] = None
+ image = pipe(**inputs).images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ expected_slice = np.array([0.0059, 0.0003, 0.0000, 0.0023, 0.0052, 0.0007, 0.0165, 0.0081, 0.0095])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ @require_torch_2
+ def test_consistency_model_cd_multistep_flash_attn(self):
+ unet = UNet2DModel.from_pretrained("diffusers/consistency_models", subfolder="diffusers_cd_imagenet64_l2")
+ scheduler = CMStochasticIterativeScheduler(
+ num_train_timesteps=40,
+ sigma_min=0.002,
+ sigma_max=80.0,
+ )
+ pipe = ConsistencyModelPipeline(unet=unet, scheduler=scheduler)
+ pipe.to(torch_device=torch_device, torch_dtype=torch.float16)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(get_fixed_latents=True, device=torch_device)
+ # Ensure usage of flash attention in torch 2.0
+ with sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
+ image = pipe(**inputs).images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ expected_slice = np.array([0.1845, 0.1371, 0.1211, 0.2035, 0.1954, 0.1323, 0.1773, 0.1593, 0.1314])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ @require_torch_2
+ def test_consistency_model_cd_onestep_flash_attn(self):
+ unet = UNet2DModel.from_pretrained("diffusers/consistency_models", subfolder="diffusers_cd_imagenet64_l2")
+ scheduler = CMStochasticIterativeScheduler(
+ num_train_timesteps=40,
+ sigma_min=0.002,
+ sigma_max=80.0,
+ )
+ pipe = ConsistencyModelPipeline(unet=unet, scheduler=scheduler)
+ pipe.to(torch_device=torch_device, torch_dtype=torch.float16)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(get_fixed_latents=True, device=torch_device)
+ inputs["num_inference_steps"] = 1
+ inputs["timesteps"] = None
+ # Ensure usage of flash attention in torch 2.0
+ with sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
+ image = pipe(**inputs).images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ expected_slice = np.array([0.1623, 0.2009, 0.2387, 0.1731, 0.1168, 0.1202, 0.2031, 0.1327, 0.2447])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
diff --git a/tests/pipelines/controlnet/__init__.py b/tests/pipelines/controlnet/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/controlnet/test_controlnet.py b/tests/pipelines/controlnet/test_controlnet.py
new file mode 100644
index 0000000..114a36b
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet.py
@@ -0,0 +1,1151 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import tempfile
+import traceback
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ControlNetModel,
+ DDIMScheduler,
+ EulerDiscreteScheduler,
+ LCMScheduler,
+ StableDiffusionControlNetPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.pipelines.controlnet.pipeline_controlnet import MultiControlNetModel
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_image,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_python39_or_higher,
+ require_torch_2,
+ require_torch_gpu,
+ run_test_in_subprocess,
+ slow,
+ torch_device,
+)
+from diffusers.utils.torch_utils import randn_tensor
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+# Will be run via run_test_in_subprocess
+def _test_stable_diffusion_compile(in_queue, out_queue, timeout):
+ error = None
+ try:
+ _ = in_queue.get(timeout=timeout)
+
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.to("cuda")
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ pipe.controlnet.to(memory_format=torch.channels_last)
+ pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "bird"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ ).resize((512, 512))
+
+ output = pipe(prompt, image, num_inference_steps=10, generator=generator, output_type="np")
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny_out_full.npy"
+ )
+ expected_image = np.resize(expected_image, (512, 512, 3))
+
+ assert np.abs(expected_image - image).max() < 1.0
+
+ except Exception:
+ error = f"{traceback.format_exc()}"
+
+ results = {"error": error}
+ out_queue.put(results, timeout=timeout)
+ out_queue.join()
+
+
+class ControlNetPipelineFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionControlNetPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=1,
+ time_cond_proj_dim=time_cond_proj_dim,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+ image = randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ )
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": image,
+ }
+
+ return inputs
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_controlnet_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionControlNetPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array(
+ [0.52700454, 0.3930534, 0.25509018, 0.7132304, 0.53696585, 0.46568912, 0.7095368, 0.7059624, 0.4744786]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_controlnet_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionControlNetPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array(
+ [0.52700454, 0.3930534, 0.25509018, 0.7132304, 0.53696585, 0.46568912, 0.7095368, 0.7059624, 0.4744786]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+
+class StableDiffusionMultiControlNetPipelineFastTests(
+ IPAdapterTesterMixin, PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionControlNetPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=1,
+ )
+ torch.manual_seed(0)
+
+ def init_weights(m):
+ if isinstance(m, torch.nn.Conv2d):
+ torch.nn.init.normal(m.weight)
+ m.bias.data.fill_(1.0)
+
+ controlnet1 = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ controlnet1.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ controlnet2 = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ controlnet2.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ controlnet = MultiControlNetModel([controlnet1, controlnet2])
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+
+ images = [
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ ]
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": images,
+ }
+
+ return inputs
+
+ def test_control_guidance_switch(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ scale = 10.0
+ steps = 4
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_1 = pipe(**inputs)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_2 = pipe(**inputs, control_guidance_start=0.1, control_guidance_end=0.2)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_3 = pipe(**inputs, control_guidance_start=[0.1, 0.3], control_guidance_end=[0.2, 0.7])[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_4 = pipe(**inputs, control_guidance_start=0.4, control_guidance_end=[0.5, 0.8])[0]
+
+ # make sure that all outputs are different
+ assert np.sum(np.abs(output_1 - output_2)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_3)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_4)) > 1e-3
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_pretrained_raise_not_implemented_exception(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ with tempfile.TemporaryDirectory() as tmpdir:
+ try:
+ # save_pretrained is not implemented for Multi-ControlNet
+ pipe.save_pretrained(tmpdir)
+ except NotImplementedError:
+ pass
+
+ def test_inference_multiple_prompt_input(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionControlNetPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"], inputs["prompt"]]
+ inputs["image"] = [inputs["image"], inputs["image"]]
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ assert image.shape == (2, 64, 64, 3)
+
+ image_1, image_2 = image
+ # make sure that the outputs are different
+ assert np.sum(np.abs(image_1 - image_2)) > 1e-3
+
+ # multiple prompts, single image conditioning
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"], inputs["prompt"]]
+ output_1 = sd_pipe(**inputs)
+
+ assert np.abs(image - output_1.images).max() < 1e-3
+
+
+class StableDiffusionMultiControlNetOneModelPipelineFastTests(
+ IPAdapterTesterMixin, PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionControlNetPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=1,
+ )
+ torch.manual_seed(0)
+
+ def init_weights(m):
+ if isinstance(m, torch.nn.Conv2d):
+ torch.nn.init.normal(m.weight)
+ m.bias.data.fill_(1.0)
+
+ controlnet = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ controlnet.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ controlnet = MultiControlNetModel([controlnet])
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+
+ images = [
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ ]
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": images,
+ }
+
+ return inputs
+
+ def test_control_guidance_switch(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ scale = 10.0
+ steps = 4
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_1 = pipe(**inputs)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_2 = pipe(**inputs, control_guidance_start=0.1, control_guidance_end=0.2)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_3 = pipe(
+ **inputs,
+ control_guidance_start=[0.1],
+ control_guidance_end=[0.2],
+ )[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_4 = pipe(**inputs, control_guidance_start=0.4, control_guidance_end=[0.5])[0]
+
+ # make sure that all outputs are different
+ assert np.sum(np.abs(output_1 - output_2)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_3)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_4)) > 1e-3
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_pretrained_raise_not_implemented_exception(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ with tempfile.TemporaryDirectory() as tmpdir:
+ try:
+ # save_pretrained is not implemented for Multi-ControlNet
+ pipe.save_pretrained(tmpdir)
+ except NotImplementedError:
+ pass
+
+
+@slow
+@require_torch_gpu
+class ControlNetPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_canny(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "bird"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (768, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 9e-2
+
+ def test_depth(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-depth")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "Stormtrooper's lecture"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 8e-1
+
+ def test_hed(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-hed")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "oil painting of handsome old man, masterpiece"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/man_hed.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (704, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/man_hed_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 8e-2
+
+ def test_mlsd(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-mlsd")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "room"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_mlsd.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (704, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_mlsd_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 5e-2
+
+ def test_normal(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-normal")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "cute toy"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/cute_toy_normal.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/cute_toy_normal_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 5e-2
+
+ def test_openpose(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "Chef in the kitchen"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (768, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/chef_pose_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 8e-2
+
+ def test_scribble(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-scribble")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(5)
+ prompt = "bag"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bag_scribble.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (640, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bag_scribble_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 8e-2
+
+ def test_seg(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-seg")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(5)
+ prompt = "house"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/house_seg.png"
+ )
+
+ output = pipe(prompt, image, generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/house_seg_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 8e-2
+
+ def test_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-seg")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ prompt = "house"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/house_seg.png"
+ )
+
+ _ = pipe(
+ prompt,
+ image,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 7 GB is allocated
+ assert mem_bytes < 4 * 10**9
+
+ def test_canny_guess_mode(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = ""
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+
+ output = pipe(
+ prompt,
+ image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ guidance_scale=3.0,
+ guess_mode=True,
+ )
+
+ image = output.images[0]
+ assert image.shape == (768, 512, 3)
+
+ image_slice = image[-3:, -3:, -1]
+ expected_slice = np.array([0.2724, 0.2846, 0.2724, 0.3843, 0.3682, 0.2736, 0.4675, 0.3862, 0.2887])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_canny_guess_mode_euler(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = ""
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+
+ output = pipe(
+ prompt,
+ image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ guidance_scale=3.0,
+ guess_mode=True,
+ )
+
+ image = output.images[0]
+ assert image.shape == (768, 512, 3)
+
+ image_slice = image[-3:, -3:, -1]
+ expected_slice = np.array([0.1655, 0.1721, 0.1623, 0.1685, 0.1711, 0.1646, 0.1651, 0.1631, 0.1494])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ @require_python39_or_higher
+ @require_torch_2
+ def test_stable_diffusion_compile(self):
+ run_test_in_subprocess(test_case=self, target_func=_test_stable_diffusion_compile, inputs=None)
+
+ def test_v11_shuffle_global_pool_conditions(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11e_sd15_shuffle")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "New York"
+ image = load_image(
+ "https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"
+ )
+
+ output = pipe(
+ prompt,
+ image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ guidance_scale=7.0,
+ )
+
+ image = output.images[0]
+ assert image.shape == (512, 640, 3)
+
+ image_slice = image[-3:, -3:, -1]
+ expected_slice = np.array([0.1338, 0.1597, 0.1202, 0.1687, 0.1377, 0.1017, 0.2070, 0.1574, 0.1348])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_load_local(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny")
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.enable_model_cpu_offload()
+
+ controlnet = ControlNetModel.from_single_file(
+ "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"
+ )
+ pipe_sf = StableDiffusionControlNetPipeline.from_single_file(
+ "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors",
+ safety_checker=None,
+ controlnet=controlnet,
+ scheduler_type="pndm",
+ )
+ pipe_sf.unet.set_default_attn_processor()
+ pipe_sf.enable_model_cpu_offload()
+
+ control_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ ).resize((512, 512))
+ prompt = "bird"
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ prompt,
+ image=control_image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ ).images[0]
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output_sf = pipe_sf(
+ prompt,
+ image=control_image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ ).images[0]
+
+ max_diff = numpy_cosine_similarity_distance(output_sf.flatten(), output.flatten())
+ assert max_diff < 1e-3
+
+ def test_single_file_component_configs(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny", variant="fp16")
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", variant="fp16", safety_checker=None, controlnet=controlnet
+ )
+
+ controlnet_single_file = ControlNetModel.from_single_file(
+ "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"
+ )
+ single_file_pipe = StableDiffusionControlNetPipeline.from_single_file(
+ "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors",
+ safety_checker=None,
+ controlnet=controlnet_single_file,
+ scheduler_type="pndm",
+ )
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.controlnet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.controlnet.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionMultiControlNetPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_pose_and_canny(self):
+ controlnet_canny = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+ controlnet_pose = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose")
+
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=[controlnet_pose, controlnet_canny]
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "bird and Chef"
+ image_canny = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+ image_pose = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png"
+ )
+
+ output = pipe(prompt, [image_pose, image_canny], generator=generator, output_type="np", num_inference_steps=3)
+
+ image = output.images[0]
+
+ assert image.shape == (768, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose_canny_out.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 5e-2
diff --git a/tests/pipelines/controlnet/test_controlnet_blip_diffusion.py b/tests/pipelines/controlnet/test_controlnet_blip_diffusion.py
new file mode 100644
index 0000000..fe4c9da
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet_blip_diffusion.py
@@ -0,0 +1,216 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTokenizer
+from transformers.models.blip_2.configuration_blip_2 import Blip2Config
+from transformers.models.clip.configuration_clip import CLIPTextConfig
+
+from diffusers import (
+ AutoencoderKL,
+ BlipDiffusionControlNetPipeline,
+ ControlNetModel,
+ PNDMScheduler,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import enable_full_determinism
+from src.diffusers.pipelines.blip_diffusion.blip_image_processing import BlipImageProcessor
+from src.diffusers.pipelines.blip_diffusion.modeling_blip2 import Blip2QFormerModel
+from src.diffusers.pipelines.blip_diffusion.modeling_ctx_clip import ContextCLIPTextModel
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class BlipDiffusionControlNetPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = BlipDiffusionControlNetPipeline
+ params = [
+ "prompt",
+ "reference_image",
+ "source_subject_category",
+ "target_subject_category",
+ "condtioning_image",
+ ]
+ batch_params = [
+ "prompt",
+ "reference_image",
+ "source_subject_category",
+ "target_subject_category",
+ "condtioning_image",
+ ]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "num_inference_steps",
+ "neg_prompt",
+ "guidance_scale",
+ "prompt_strength",
+ "prompt_reps",
+ ]
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ vocab_size=1000,
+ hidden_size=16,
+ intermediate_size=16,
+ projection_dim=16,
+ num_hidden_layers=1,
+ num_attention_heads=1,
+ max_position_embeddings=77,
+ )
+ text_encoder = ContextCLIPTextModel(text_encoder_config)
+
+ vae = AutoencoderKL(
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownEncoderBlock2D",),
+ up_block_types=("UpDecoderBlock2D",),
+ block_out_channels=(32,),
+ layers_per_block=1,
+ act_fn="silu",
+ latent_channels=4,
+ norm_num_groups=16,
+ sample_size=16,
+ )
+
+ blip_vision_config = {
+ "hidden_size": 16,
+ "intermediate_size": 16,
+ "num_hidden_layers": 1,
+ "num_attention_heads": 1,
+ "image_size": 224,
+ "patch_size": 14,
+ "hidden_act": "quick_gelu",
+ }
+
+ blip_qformer_config = {
+ "vocab_size": 1000,
+ "hidden_size": 16,
+ "num_hidden_layers": 1,
+ "num_attention_heads": 1,
+ "intermediate_size": 16,
+ "max_position_embeddings": 512,
+ "cross_attention_frequency": 1,
+ "encoder_hidden_size": 16,
+ }
+ qformer_config = Blip2Config(
+ vision_config=blip_vision_config,
+ qformer_config=blip_qformer_config,
+ num_query_tokens=16,
+ tokenizer="hf-internal-testing/tiny-random-bert",
+ )
+ qformer = Blip2QFormerModel(qformer_config)
+
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 16),
+ layers_per_block=1,
+ norm_num_groups=4,
+ sample_size=16,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=16,
+ )
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ scheduler = PNDMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ set_alpha_to_one=False,
+ skip_prk_steps=True,
+ )
+ controlnet = ControlNetModel(
+ block_out_channels=(4, 16),
+ layers_per_block=1,
+ in_channels=4,
+ norm_num_groups=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=16,
+ conditioning_embedding_out_channels=(8, 16),
+ )
+
+ vae.eval()
+ qformer.eval()
+ text_encoder.eval()
+
+ image_processor = BlipImageProcessor()
+
+ components = {
+ "text_encoder": text_encoder,
+ "vae": vae,
+ "qformer": qformer,
+ "unet": unet,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "controlnet": controlnet,
+ "image_processor": image_processor,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ np.random.seed(seed)
+ reference_image = np.random.rand(32, 32, 3) * 255
+ reference_image = Image.fromarray(reference_image.astype("uint8")).convert("RGBA")
+ cond_image = np.random.rand(32, 32, 3) * 255
+ cond_image = Image.fromarray(cond_image.astype("uint8")).convert("RGBA")
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "swimming underwater",
+ "generator": generator,
+ "reference_image": reference_image,
+ "condtioning_image": cond_image,
+ "source_subject_category": "dog",
+ "target_subject_category": "dog",
+ "height": 32,
+ "width": 32,
+ "guidance_scale": 7.5,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_blipdiffusion_controlnet(self):
+ device = "cpu"
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ image = pipe(**self.get_dummy_inputs(device))[0]
+ image_slice = image[0, -3:, -3:, 0]
+
+ assert image.shape == (1, 16, 16, 4)
+ expected_slice = np.array([0.7953, 0.7136, 0.6597, 0.4779, 0.7389, 0.4111, 0.5826, 0.4150, 0.8422])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
diff --git a/tests/pipelines/controlnet/test_controlnet_img2img.py b/tests/pipelines/controlnet/test_controlnet_img2img.py
new file mode 100644
index 0000000..89e2b38
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet_img2img.py
@@ -0,0 +1,479 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This model implementation is heavily inspired by https://github.com/haofanwang/ControlNet-for-Diffusers/
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ControlNetModel,
+ DDIMScheduler,
+ StableDiffusionControlNetImg2ImgPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.pipelines.controlnet.pipeline_controlnet import MultiControlNetModel
+from diffusers.utils import load_image
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+from diffusers.utils.torch_utils import randn_tensor
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class ControlNetImg2ImgPipelineFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionControlNetImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS.union({"control_image"})
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=1,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+ control_image = randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ )
+ image = floats_tensor(control_image.shape, rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": image,
+ "control_image": control_image,
+ }
+
+ return inputs
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+
+class StableDiffusionMultiControlNetPipelineFastTests(
+ IPAdapterTesterMixin, PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionControlNetImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=1,
+ )
+ torch.manual_seed(0)
+
+ def init_weights(m):
+ if isinstance(m, torch.nn.Conv2d):
+ torch.nn.init.normal(m.weight)
+ m.bias.data.fill_(1.0)
+
+ controlnet1 = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ controlnet1.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ controlnet2 = ControlNetModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ norm_num_groups=1,
+ )
+ controlnet2.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ controlnet = MultiControlNetModel([controlnet1, controlnet2])
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+
+ control_image = [
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ ]
+
+ image = floats_tensor(control_image[0].shape, rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": image,
+ "control_image": control_image,
+ }
+
+ return inputs
+
+ def test_control_guidance_switch(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ scale = 10.0
+ steps = 4
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_1 = pipe(**inputs)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_2 = pipe(**inputs, control_guidance_start=0.1, control_guidance_end=0.2)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_3 = pipe(**inputs, control_guidance_start=[0.1, 0.3], control_guidance_end=[0.2, 0.7])[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_4 = pipe(**inputs, control_guidance_start=0.4, control_guidance_end=[0.5, 0.8])[0]
+
+ # make sure that all outputs are different
+ assert np.sum(np.abs(output_1 - output_2)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_3)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_4)) > 1e-3
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_pretrained_raise_not_implemented_exception(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ with tempfile.TemporaryDirectory() as tmpdir:
+ try:
+ # save_pretrained is not implemented for Multi-ControlNet
+ pipe.save_pretrained(tmpdir)
+ except NotImplementedError:
+ pass
+
+
+@slow
+@require_torch_gpu
+class ControlNetImg2ImgPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_canny(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+
+ pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "evil space-punk bird"
+ control_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ ).resize((512, 512))
+ image = load_image(
+ "https://huggingface.co/lllyasviel/sd-controlnet-canny/resolve/main/images/bird.png"
+ ).resize((512, 512))
+
+ output = pipe(
+ prompt,
+ image,
+ control_image=control_image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=50,
+ strength=0.6,
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/img2img.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 9e-2
+
+ def test_load_local(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny")
+ pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.enable_model_cpu_offload()
+
+ controlnet = ControlNetModel.from_single_file(
+ "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"
+ )
+ pipe_sf = StableDiffusionControlNetImg2ImgPipeline.from_single_file(
+ "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors",
+ safety_checker=None,
+ controlnet=controlnet,
+ scheduler_type="pndm",
+ )
+ pipe_sf.unet.set_default_attn_processor()
+ pipe_sf.enable_model_cpu_offload()
+
+ control_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ ).resize((512, 512))
+ image = load_image(
+ "https://huggingface.co/lllyasviel/sd-controlnet-canny/resolve/main/images/bird.png"
+ ).resize((512, 512))
+ prompt = "bird"
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ prompt,
+ image=image,
+ control_image=control_image,
+ strength=0.9,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ ).images[0]
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output_sf = pipe_sf(
+ prompt,
+ image=image,
+ control_image=control_image,
+ strength=0.9,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ ).images[0]
+
+ max_diff = numpy_cosine_similarity_distance(output_sf.flatten(), output.flatten())
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/controlnet/test_controlnet_inpaint.py b/tests/pipelines/controlnet/test_controlnet_inpaint.py
new file mode 100644
index 0000000..67e0da4
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet_inpaint.py
@@ -0,0 +1,605 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This model implementation is heavily based on:
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ControlNetModel,
+ DDIMScheduler,
+ StableDiffusionControlNetInpaintPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.pipelines.controlnet.pipeline_controlnet import MultiControlNetModel
+from diffusers.utils import load_image
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+from diffusers.utils.torch_utils import randn_tensor
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class ControlNetInpaintPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionControlNetInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset({"control_image"}) # skip `image` and `mask` for now, only test for control_image
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=9,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ )
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+ control_image = randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ )
+ init_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ init_image = init_image.cpu().permute(0, 2, 3, 1)[0]
+
+ image = Image.fromarray(np.uint8(init_image)).convert("RGB").resize((64, 64))
+ mask_image = Image.fromarray(np.uint8(init_image + 4)).convert("RGB").resize((64, 64))
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": image,
+ "mask_image": mask_image,
+ "control_image": control_image,
+ }
+
+ return inputs
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+
+class ControlNetSimpleInpaintPipelineFastTests(ControlNetInpaintPipelineFastTests):
+ pipeline_class = StableDiffusionControlNetInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset([])
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ )
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+
+class MultiControlNetInpaintPipelineFastTests(
+ PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionControlNetInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=9,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ torch.manual_seed(0)
+
+ def init_weights(m):
+ if isinstance(m, torch.nn.Conv2d):
+ torch.nn.init.normal(m.weight)
+ m.bias.data.fill_(1.0)
+
+ controlnet1 = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ )
+ controlnet1.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ controlnet2 = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ cross_attention_dim=32,
+ conditioning_embedding_out_channels=(16, 32),
+ )
+ controlnet2.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ controlnet = MultiControlNetModel([controlnet1, controlnet2])
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+
+ control_image = [
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ ]
+ init_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ init_image = init_image.cpu().permute(0, 2, 3, 1)[0]
+
+ image = Image.fromarray(np.uint8(init_image)).convert("RGB").resize((64, 64))
+ mask_image = Image.fromarray(np.uint8(init_image + 4)).convert("RGB").resize((64, 64))
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": image,
+ "mask_image": mask_image,
+ "control_image": control_image,
+ }
+
+ return inputs
+
+ def test_control_guidance_switch(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ scale = 10.0
+ steps = 4
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_1 = pipe(**inputs)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_2 = pipe(**inputs, control_guidance_start=0.1, control_guidance_end=0.2)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_3 = pipe(**inputs, control_guidance_start=[0.1, 0.3], control_guidance_end=[0.2, 0.7])[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_4 = pipe(**inputs, control_guidance_start=0.4, control_guidance_end=[0.5, 0.8])[0]
+
+ # make sure that all outputs are different
+ assert np.sum(np.abs(output_1 - output_2)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_3)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_4)) > 1e-3
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_pretrained_raise_not_implemented_exception(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ with tempfile.TemporaryDirectory() as tmpdir:
+ try:
+ # save_pretrained is not implemented for Multi-ControlNet
+ pipe.save_pretrained(tmpdir)
+ except NotImplementedError:
+ pass
+
+
+@slow
+@require_torch_gpu
+class ControlNetInpaintPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_canny(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+
+ pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None, controlnet=controlnet
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image = load_image(
+ "https://huggingface.co/lllyasviel/sd-controlnet-canny/resolve/main/images/bird.png"
+ ).resize((512, 512))
+
+ mask_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_mask.png"
+ ).resize((512, 512))
+
+ prompt = "pitch black hole"
+
+ control_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ ).resize((512, 512))
+
+ output = pipe(
+ prompt,
+ image=image,
+ mask_image=mask_image,
+ control_image=control_image,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/inpaint.npy"
+ )
+
+ assert np.abs(expected_image - image).max() < 9e-2
+
+ def test_inpaint(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint")
+
+ pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(33)
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main/stable_diffusion_inpaint/boy.png"
+ )
+ init_image = init_image.resize((512, 512))
+
+ mask_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main/stable_diffusion_inpaint/boy_mask.png"
+ )
+ mask_image = mask_image.resize((512, 512))
+
+ prompt = "a handsome man with ray-ban sunglasses"
+
+ def make_inpaint_condition(image, image_mask):
+ image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
+ image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
+
+ assert image.shape[0:1] == image_mask.shape[0:1], "image and image_mask must have the same image size"
+ image[image_mask > 0.5] = -1.0 # set as masked pixel
+ image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
+ image = torch.from_numpy(image)
+ return image
+
+ control_image = make_inpaint_condition(init_image, mask_image)
+
+ output = pipe(
+ prompt,
+ image=init_image,
+ mask_image=mask_image,
+ control_image=control_image,
+ guidance_scale=9.0,
+ eta=1.0,
+ generator=generator,
+ num_inference_steps=20,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/boy_ray_ban.npy"
+ )
+
+ assert numpy_cosine_similarity_distance(expected_image.flatten(), image.flatten()) < 1e-2
+
+ def test_load_local(self):
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny")
+ pipe_1 = StableDiffusionControlNetInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None, controlnet=controlnet
+ )
+
+ controlnet = ControlNetModel.from_single_file(
+ "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"
+ )
+ pipe_2 = StableDiffusionControlNetInpaintPipeline.from_single_file(
+ "https://huggingface.co/runwayml/stable-diffusion-inpainting/blob/main/sd-v1-5-inpainting.ckpt",
+ safety_checker=None,
+ controlnet=controlnet,
+ )
+ control_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ ).resize((512, 512))
+ image = load_image(
+ "https://huggingface.co/lllyasviel/sd-controlnet-canny/resolve/main/images/bird.png"
+ ).resize((512, 512))
+ mask_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_mask.png"
+ ).resize((512, 512))
+
+ pipes = [pipe_1, pipe_2]
+ images = []
+ for pipe in pipes:
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "bird"
+ output = pipe(
+ prompt,
+ image=image,
+ control_image=control_image,
+ mask_image=mask_image,
+ strength=0.9,
+ generator=generator,
+ output_type="np",
+ num_inference_steps=3,
+ )
+ images.append(output.images[0])
+
+ del pipe
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ max_diff = numpy_cosine_similarity_distance(images[0].flatten(), images[1].flatten())
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/controlnet/test_controlnet_inpaint_sdxl.py b/tests/pipelines/controlnet/test_controlnet_inpaint_sdxl.py
new file mode 100644
index 0000000..5f38263
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet_inpaint_sdxl.py
@@ -0,0 +1,304 @@
+# coding=utf-8
+# Copyright 2024 Harutatsu Akiyama, Jinbin Bai, and HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ControlNetModel,
+ EulerDiscreteScheduler,
+ StableDiffusionXLControlNetInpaintPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, require_torch_gpu, torch_device
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import (
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class ControlNetPipelineSDXLFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLControlNetInpaintPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = frozenset(IMAGE_TO_IMAGE_IMAGE_PARAMS.union({"mask_image", "control_image"}))
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0, img_res=64):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ # Get random floats in [0, 1] as image
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ mask_image = torch.ones_like(image)
+ controlnet_embedder_scale_factor = 2
+ control_image = (
+ floats_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ rng=random.Random(seed),
+ )
+ .to(device)
+ .cpu()
+ )
+ control_image = control_image.cpu().permute(0, 2, 3, 1)[0]
+ # Convert image and mask_image to [0, 255]
+ image = 255 * image
+ mask_image = 255 * mask_image
+ control_image = 255 * control_image
+ # Convert to PIL image
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((img_res, img_res))
+ mask_image = Image.fromarray(np.uint8(mask_image)).convert("L").resize((img_res, img_res))
+ control_image = Image.fromarray(np.uint8(control_image)).convert("RGB").resize((img_res, img_res))
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": init_image,
+ "mask_image": mask_image,
+ "control_image": control_image,
+ }
+ return inputs
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_multi_prompts(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+
+ # forward with single prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = inputs["prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = "different prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # manually set a negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same negative_prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = inputs["negative_prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = "different negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ def test_controlnet_sdxl_guess(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["guess_mode"] = True
+
+ output = sd_pipe(**inputs)
+ image_slice = output.images[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [0.5381963, 0.4836803, 0.45821992, 0.5577731, 0.51210403, 0.4794795, 0.59282357, 0.5647199, 0.43100584]
+ )
+
+ # make sure that it's equal
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-4
+
+ # TODO(Patrick, Sayak) - skip for now as this requires more refiner tests
+ def test_save_load_optional_components(self):
+ pass
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
diff --git a/tests/pipelines/controlnet/test_controlnet_sdxl.py b/tests/pipelines/controlnet/test_controlnet_sdxl.py
new file mode 100644
index 0000000..c82ce6c
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet_sdxl.py
@@ -0,0 +1,1173 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ControlNetModel,
+ EulerDiscreteScheduler,
+ HeunDiscreteScheduler,
+ LCMScheduler,
+ StableDiffusionXLControlNetPipeline,
+ StableDiffusionXLImg2ImgPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.models.unets.unet_2d_blocks import UNetMidBlock2D
+from diffusers.pipelines.controlnet.pipeline_controlnet import MultiControlNetModel
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_image,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+from diffusers.utils.torch_utils import randn_tensor
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLControlNetPipelineFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionXLControlNetPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ time_cond_proj_dim=time_cond_proj_dim,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ torch.manual_seed(0)
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+ image = randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ )
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ "image": image,
+ }
+
+ return inputs
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_multi_prompts(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+
+ # forward with single prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = inputs["prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = "different prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # manually set a negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same negative_prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = inputs["negative_prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = "different negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # copied from test_stable_diffusion_xl.py
+ def test_stable_diffusion_xl_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 2 * [inputs["prompt"]]
+ inputs["num_images_per_prompt"] = 2
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 2 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_controlnet_sdxl_guess(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["guess_mode"] = True
+
+ output = sd_pipe(**inputs)
+ image_slice = output.images[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [0.7330834, 0.590667, 0.5667336, 0.6029023, 0.5679491, 0.5968194, 0.4032986, 0.47612396, 0.5089609]
+ )
+
+ # make sure that it's equal
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-4
+
+ def test_controlnet_sdxl_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLControlNetPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.7799, 0.614, 0.6162, 0.7082, 0.6662, 0.5833, 0.4148, 0.5182, 0.4866])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # copied from test_stable_diffusion_xl.py:test_stable_diffusion_two_xl_mixture_of_denoiser_fast
+ # with `StableDiffusionXLControlNetPipeline` instead of `StableDiffusionXLPipeline`
+ def test_controlnet_sdxl_two_mixture_of_denoiser_fast(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLControlNetPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+
+ components_without_controlnet = {k: v for k, v in components.items() if k != "controlnet"}
+ pipe_2 = StableDiffusionXLImg2ImgPipeline(**components_without_controlnet).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps,
+ split,
+ scheduler_cls_orig,
+ expected_tss,
+ num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps,
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss))
+ expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split, expected_tss))
+ expected_steps = expected_steps_1 + expected_steps_2
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss))
+ expected_steps_2 = list(filter(lambda ts: ts < split, expected_tss))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {
+ **inputs,
+ **{
+ "denoising_end": 1.0 - (split / num_train_timesteps),
+ "output_type": "latent",
+ },
+ }
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ inputs_2 = {
+ **inputs,
+ **{
+ "denoising_start": 1.0 - (split / num_train_timesteps),
+ "image": latents,
+ },
+ }
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+ assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ steps = 10
+ for split in [300, 700]:
+ for scheduler_cls_timesteps in [
+ (EulerDiscreteScheduler, [901, 801, 701, 601, 501, 401, 301, 201, 101, 1]),
+ (
+ HeunDiscreteScheduler,
+ [
+ 901.0,
+ 801.0,
+ 801.0,
+ 701.0,
+ 701.0,
+ 601.0,
+ 601.0,
+ 501.0,
+ 501.0,
+ 401.0,
+ 401.0,
+ 301.0,
+ 301.0,
+ 201.0,
+ 201.0,
+ 101.0,
+ 101.0,
+ 1.0,
+ 1.0,
+ ],
+ ),
+ ]:
+ assert_run_mixture(steps, split, scheduler_cls_timesteps[0], scheduler_cls_timesteps[1])
+
+
+class StableDiffusionXLMultiControlNetPipelineFastTests(
+ PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, SDXLOptionalComponentsTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLControlNetPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ torch.manual_seed(0)
+
+ def init_weights(m):
+ if isinstance(m, torch.nn.Conv2d):
+ torch.nn.init.normal(m.weight)
+ m.bias.data.fill_(1.0)
+
+ controlnet1 = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ controlnet1.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ controlnet2 = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ controlnet2.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ controlnet = MultiControlNetModel([controlnet1, controlnet2])
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+
+ images = [
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ ]
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ "image": images,
+ }
+
+ return inputs
+
+ def test_control_guidance_switch(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ scale = 10.0
+ steps = 4
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_1 = pipe(**inputs)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_2 = pipe(**inputs, control_guidance_start=0.1, control_guidance_end=0.2)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_3 = pipe(**inputs, control_guidance_start=[0.1, 0.3], control_guidance_end=[0.2, 0.7])[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_4 = pipe(**inputs, control_guidance_start=0.4, control_guidance_end=[0.5, 0.8])[0]
+
+ # make sure that all outputs are different
+ assert np.sum(np.abs(output_1 - output_2)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_3)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_4)) > 1e-3
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_load_optional_components(self):
+ return self._test_save_load_optional_components()
+
+
+class StableDiffusionXLMultiControlNetOneModelPipelineFastTests(
+ PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, SDXLOptionalComponentsTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLControlNetPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = frozenset([]) # TO_DO: add image_params once refactored VaeImageProcessor.preprocess
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ torch.manual_seed(0)
+
+ def init_weights(m):
+ if isinstance(m, torch.nn.Conv2d):
+ torch.nn.init.normal(m.weight)
+ m.bias.data.fill_(1.0)
+
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ controlnet.controlnet_down_blocks.apply(init_weights)
+
+ torch.manual_seed(0)
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ controlnet = MultiControlNetModel([controlnet])
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ controlnet_embedder_scale_factor = 2
+ images = [
+ randn_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ generator=generator,
+ device=torch.device(device),
+ ),
+ ]
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ "image": images,
+ }
+
+ return inputs
+
+ def test_control_guidance_switch(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ scale = 10.0
+ steps = 4
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_1 = pipe(**inputs)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_2 = pipe(**inputs, control_guidance_start=0.1, control_guidance_end=0.2)[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_3 = pipe(
+ **inputs,
+ control_guidance_start=[0.1],
+ control_guidance_end=[0.2],
+ )[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = steps
+ inputs["controlnet_conditioning_scale"] = scale
+ output_4 = pipe(**inputs, control_guidance_start=0.4, control_guidance_end=[0.5])[0]
+
+ # make sure that all outputs are different
+ assert np.sum(np.abs(output_1 - output_2)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_3)) > 1e-3
+ assert np.sum(np.abs(output_1 - output_4)) > 1e-3
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ def test_negative_conditions(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice_without_neg_cond = image[0, -3:, -3:, -1]
+
+ image = pipe(
+ **inputs,
+ negative_original_size=(512, 512),
+ negative_crops_coords_top_left=(0, 0),
+ negative_target_size=(1024, 1024),
+ ).images
+ image_slice_with_neg_cond = image[0, -3:, -3:, -1]
+
+ self.assertTrue(np.abs(image_slice_without_neg_cond - image_slice_with_neg_cond).max() > 1e-2)
+
+
+@slow
+@require_torch_gpu
+class ControlNetSDXLPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_canny(self):
+ controlnet = ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0")
+
+ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet
+ )
+ pipe.enable_sequential_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "bird"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+
+ images = pipe(prompt, image=image, generator=generator, output_type="np", num_inference_steps=3).images
+
+ assert images[0].shape == (768, 512, 3)
+
+ original_image = images[0, -3:, -3:, -1].flatten()
+ expected_image = np.array([0.4185, 0.4127, 0.4089, 0.4046, 0.4115, 0.4096, 0.4081, 0.4112, 0.3913])
+ assert np.allclose(original_image, expected_image, atol=1e-04)
+
+ def test_depth(self):
+ controlnet = ControlNetModel.from_pretrained("diffusers/controlnet-depth-sdxl-1.0")
+
+ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet
+ )
+ pipe.enable_sequential_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "Stormtrooper's lecture"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png"
+ )
+
+ images = pipe(prompt, image=image, generator=generator, output_type="np", num_inference_steps=3).images
+
+ assert images[0].shape == (512, 512, 3)
+
+ original_image = images[0, -3:, -3:, -1].flatten()
+ expected_image = np.array([0.4399, 0.5112, 0.5478, 0.4314, 0.472, 0.4823, 0.4647, 0.4957, 0.4853])
+ assert np.allclose(original_image, expected_image, atol=1e-04)
+
+ def test_download_ckpt_diff_format_is_same(self):
+ controlnet = ControlNetModel.from_pretrained("diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16)
+ single_file_url = (
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors"
+ )
+ pipe_single_file = StableDiffusionXLControlNetPipeline.from_single_file(
+ single_file_url, controlnet=controlnet, torch_dtype=torch.float16
+ )
+ pipe_single_file.unet.set_default_attn_processor()
+ pipe_single_file.enable_model_cpu_offload()
+ pipe_single_file.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ prompt = "Stormtrooper's lecture"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png"
+ )
+ single_file_images = pipe_single_file(
+ prompt, image=image, generator=generator, output_type="np", num_inference_steps=2
+ ).images
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, torch_dtype=torch.float16
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.enable_model_cpu_offload()
+ images = pipe(prompt, image=image, generator=generator, output_type="np", num_inference_steps=2).images
+
+ assert images[0].shape == (512, 512, 3)
+ assert single_file_images[0].shape == (512, 512, 3)
+
+ max_diff = numpy_cosine_similarity_distance(images[0].flatten(), single_file_images[0].flatten())
+ assert max_diff < 5e-2
+
+ def test_single_file_component_configs(self):
+ controlnet = ControlNetModel.from_pretrained(
+ "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16, variant="fp16"
+ )
+ pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ variant="fp16",
+ controlnet=controlnet,
+ torch_dtype=torch.float16,
+ )
+
+ single_file_url = (
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors"
+ )
+ single_file_pipe = StableDiffusionXLControlNetPipeline.from_single_file(
+ single_file_url, controlnet=controlnet, torch_dtype=torch.float16
+ )
+
+ for param_name, param_value in single_file_pipe.text_encoder.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder.config.to_dict()[param_name] == param_value
+
+ for param_name, param_value in single_file_pipe.text_encoder_2.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder_2.config.to_dict()[param_name] == param_value
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+
+class StableDiffusionSSD1BControlNetPipelineFastTests(StableDiffusionXLControlNetPipelineFastTests):
+ def test_controlnet_sdxl_guess(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["guess_mode"] = True
+
+ output = sd_pipe(**inputs)
+ image_slice = output.images[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [0.6831671, 0.5702532, 0.5459845, 0.6299793, 0.58563006, 0.6033695, 0.4493941, 0.46132287, 0.5035841]
+ )
+
+ # make sure that it's equal
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-4
+
+ def test_controlnet_sdxl_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLControlNetPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.6850, 0.5135, 0.5545, 0.7033, 0.6617, 0.5971, 0.4165, 0.5480, 0.5070])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_conditioning_channels(self):
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ mid_block_type="UNetMidBlock2D",
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ time_cond_proj_dim=None,
+ )
+
+ controlnet = ControlNetModel.from_unet(unet, conditioning_channels=4)
+ assert type(controlnet.mid_block) == UNetMidBlock2D
+ assert controlnet.conditioning_channels == 4
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ mid_block_type="UNetMidBlock2D",
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ time_cond_proj_dim=time_cond_proj_dim,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ mid_block_type="UNetMidBlock2D",
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ torch.manual_seed(0)
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
diff --git a/tests/pipelines/controlnet/test_controlnet_sdxl_img2img.py b/tests/pipelines/controlnet/test_controlnet_sdxl_img2img.py
new file mode 100644
index 0000000..7d2ba8c
--- /dev/null
+++ b/tests/pipelines/controlnet/test_controlnet_sdxl_img2img.py
@@ -0,0 +1,351 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ControlNetModel,
+ EulerDiscreteScheduler,
+ StableDiffusionXLControlNetImg2ImgPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, require_torch_gpu, torch_device
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class ControlNetPipelineSDXLImg2ImgFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionXLControlNetImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self, skip_first_text_encoder=False):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64 if not skip_first_text_encoder else 32,
+ )
+ torch.manual_seed(0)
+ controlnet = ControlNetModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ in_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ conditioning_embedding_out_channels=(16, 32),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ torch.manual_seed(0)
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "controlnet": controlnet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder if not skip_first_text_encoder else None,
+ "tokenizer": tokenizer if not skip_first_text_encoder else None,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "image_encoder": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ controlnet_embedder_scale_factor = 2
+ image = floats_tensor(
+ (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
+ rng=random.Random(seed),
+ ).to(device)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "image": image,
+ "control_image": image,
+ }
+
+ return inputs
+
+ def test_stable_diffusion_xl_controlnet_img2img(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [0.5557202, 0.46418434, 0.46983826, 0.623529, 0.5557242, 0.49262643, 0.6070508, 0.5702978, 0.43777135]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_controlnet_img2img_guess(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["guess_mode"] = True
+
+ output = sd_pipe(**inputs)
+ image_slice = output.images[0, -3:, -3:, -1]
+ assert output.images.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [0.5557202, 0.46418434, 0.46983826, 0.623529, 0.5557242, 0.49262643, 0.6070508, 0.5702978, 0.43777135]
+ )
+
+ # make sure that it's equal
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ # TODO(Patrick, Sayak) - skip for now as this requires more refiner tests
+ def test_save_load_optional_components(self):
+ pass
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_multi_prompts(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+
+ # forward with single prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = inputs["prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = "different prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # manually set a negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same negative_prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = inputs["negative_prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = "different negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # copied from test_stable_diffusion_xl.py
+ def test_stable_diffusion_xl_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 2 * [inputs["prompt"]]
+ inputs["num_images_per_prompt"] = 2
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 2 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
diff --git a/tests/pipelines/controlnet/test_flax_controlnet.py b/tests/pipelines/controlnet/test_flax_controlnet.py
new file mode 100644
index 0000000..db19bd8
--- /dev/null
+++ b/tests/pipelines/controlnet/test_flax_controlnet.py
@@ -0,0 +1,127 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+from diffusers import FlaxControlNetModel, FlaxStableDiffusionControlNetPipeline
+from diffusers.utils import is_flax_available, load_image
+from diffusers.utils.testing_utils import require_flax, slow
+
+
+if is_flax_available():
+ import jax
+ import jax.numpy as jnp
+ from flax.jax_utils import replicate
+ from flax.training.common_utils import shard
+
+
+@slow
+@require_flax
+class FlaxControlNetPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+
+ def test_canny(self):
+ controlnet, controlnet_params = FlaxControlNetModel.from_pretrained(
+ "lllyasviel/sd-controlnet-canny", from_pt=True, dtype=jnp.bfloat16
+ )
+ pipe, params = FlaxStableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, from_pt=True, dtype=jnp.bfloat16
+ )
+ params["controlnet"] = controlnet_params
+
+ prompts = "bird"
+ num_samples = jax.device_count()
+ prompt_ids = pipe.prepare_text_inputs([prompts] * num_samples)
+
+ canny_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+ )
+ processed_image = pipe.prepare_image_inputs([canny_image] * num_samples)
+
+ rng = jax.random.PRNGKey(0)
+ rng = jax.random.split(rng, jax.device_count())
+
+ p_params = replicate(params)
+ prompt_ids = shard(prompt_ids)
+ processed_image = shard(processed_image)
+
+ images = pipe(
+ prompt_ids=prompt_ids,
+ image=processed_image,
+ params=p_params,
+ prng_seed=rng,
+ num_inference_steps=50,
+ jit=True,
+ ).images
+ assert images.shape == (jax.device_count(), 1, 768, 512, 3)
+
+ images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
+ image_slice = images[0, 253:256, 253:256, -1]
+
+ output_slice = jnp.asarray(jax.device_get(image_slice.flatten()))
+ expected_slice = jnp.array(
+ [0.167969, 0.116699, 0.081543, 0.154297, 0.132812, 0.108887, 0.169922, 0.169922, 0.205078]
+ )
+ print(f"output_slice: {output_slice}")
+ assert jnp.abs(output_slice - expected_slice).max() < 1e-2
+
+ def test_pose(self):
+ controlnet, controlnet_params = FlaxControlNetModel.from_pretrained(
+ "lllyasviel/sd-controlnet-openpose", from_pt=True, dtype=jnp.bfloat16
+ )
+ pipe, params = FlaxStableDiffusionControlNetPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", controlnet=controlnet, from_pt=True, dtype=jnp.bfloat16
+ )
+ params["controlnet"] = controlnet_params
+
+ prompts = "Chef in the kitchen"
+ num_samples = jax.device_count()
+ prompt_ids = pipe.prepare_text_inputs([prompts] * num_samples)
+
+ pose_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png"
+ )
+ processed_image = pipe.prepare_image_inputs([pose_image] * num_samples)
+
+ rng = jax.random.PRNGKey(0)
+ rng = jax.random.split(rng, jax.device_count())
+
+ p_params = replicate(params)
+ prompt_ids = shard(prompt_ids)
+ processed_image = shard(processed_image)
+
+ images = pipe(
+ prompt_ids=prompt_ids,
+ image=processed_image,
+ params=p_params,
+ prng_seed=rng,
+ num_inference_steps=50,
+ jit=True,
+ ).images
+ assert images.shape == (jax.device_count(), 1, 768, 512, 3)
+
+ images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
+ image_slice = images[0, 253:256, 253:256, -1]
+
+ output_slice = jnp.asarray(jax.device_get(image_slice.flatten()))
+ expected_slice = jnp.array(
+ [[0.271484, 0.261719, 0.275391, 0.277344, 0.279297, 0.291016, 0.294922, 0.302734, 0.302734]]
+ )
+ print(f"output_slice: {output_slice}")
+ assert jnp.abs(output_slice - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/dance_diffusion/__init__.py b/tests/pipelines/dance_diffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/dance_diffusion/test_dance_diffusion.py b/tests/pipelines/dance_diffusion/test_dance_diffusion.py
new file mode 100644
index 0000000..212505c
--- /dev/null
+++ b/tests/pipelines/dance_diffusion/test_dance_diffusion.py
@@ -0,0 +1,161 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import DanceDiffusionPipeline, IPNDMScheduler, UNet1DModel
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, skip_mps, torch_device
+
+from ..pipeline_params import UNCONDITIONAL_AUDIO_GENERATION_BATCH_PARAMS, UNCONDITIONAL_AUDIO_GENERATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class DanceDiffusionPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = DanceDiffusionPipeline
+ params = UNCONDITIONAL_AUDIO_GENERATION_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {
+ "callback",
+ "latents",
+ "callback_steps",
+ "output_type",
+ "num_images_per_prompt",
+ }
+ batch_params = UNCONDITIONAL_AUDIO_GENERATION_BATCH_PARAMS
+ test_attention_slicing = False
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet1DModel(
+ block_out_channels=(32, 32, 64),
+ extra_in_channels=16,
+ sample_size=512,
+ sample_rate=16_000,
+ in_channels=2,
+ out_channels=2,
+ flip_sin_to_cos=True,
+ use_timestep_embedding=False,
+ time_embedding_type="fourier",
+ mid_block_type="UNetMidBlock1D",
+ down_block_types=("DownBlock1DNoSkip", "DownBlock1D", "AttnDownBlock1D"),
+ up_block_types=("AttnUpBlock1D", "UpBlock1D", "UpBlock1DNoSkip"),
+ )
+ scheduler = IPNDMScheduler()
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "batch_size": 1,
+ "generator": generator,
+ "num_inference_steps": 4,
+ }
+ return inputs
+
+ def test_dance_diffusion(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = DanceDiffusionPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = pipe(**inputs)
+ audio = output.audios
+
+ audio_slice = audio[0, -3:, -3:]
+
+ assert audio.shape == (1, 2, components["unet"].sample_size)
+ expected_slice = np.array([-0.7265, 1.0000, -0.8388, 0.1175, 0.9498, -1.0000])
+ assert np.abs(audio_slice.flatten() - expected_slice).max() < 1e-2
+
+ @skip_mps
+ def test_save_load_local(self):
+ return super().test_save_load_local()
+
+ @skip_mps
+ def test_dict_tuple_outputs_equivalent(self):
+ return super().test_dict_tuple_outputs_equivalent(expected_max_difference=3e-3)
+
+ @skip_mps
+ def test_save_load_optional_components(self):
+ return super().test_save_load_optional_components()
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ return super().test_attention_slicing_forward_pass()
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+
+@nightly
+@require_torch_gpu
+class PipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_dance_diffusion(self):
+ device = torch_device
+
+ pipe = DanceDiffusionPipeline.from_pretrained("harmonai/maestro-150k")
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ output = pipe(generator=generator, num_inference_steps=100, audio_length_in_s=4.096)
+ audio = output.audios
+
+ audio_slice = audio[0, -3:, -3:]
+
+ assert audio.shape == (1, 2, pipe.unet.sample_size)
+ expected_slice = np.array([-0.0192, -0.0231, -0.0318, -0.0059, 0.0002, -0.0020])
+
+ assert np.abs(audio_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_dance_diffusion_fp16(self):
+ device = torch_device
+
+ pipe = DanceDiffusionPipeline.from_pretrained("harmonai/maestro-150k", torch_dtype=torch.float16)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ output = pipe(generator=generator, num_inference_steps=100, audio_length_in_s=4.096)
+ audio = output.audios
+
+ audio_slice = audio[0, -3:, -3:]
+
+ assert audio.shape == (1, 2, pipe.unet.sample_size)
+ expected_slice = np.array([-0.0367, -0.0488, -0.0771, -0.0525, -0.0444, -0.0341])
+
+ assert np.abs(audio_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/ddim/__init__.py b/tests/pipelines/ddim/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/ddim/test_ddim.py b/tests/pipelines/ddim/test_ddim.py
new file mode 100644
index 0000000..0d84a8e
--- /dev/null
+++ b/tests/pipelines/ddim/test_ddim.py
@@ -0,0 +1,143 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import DDIMPipeline, DDIMScheduler, UNet2DModel
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device
+
+from ..pipeline_params import UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS, UNCONDITIONAL_IMAGE_GENERATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class DDIMPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = DDIMPipeline
+ params = UNCONDITIONAL_IMAGE_GENERATION_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {
+ "num_images_per_prompt",
+ "latents",
+ "callback",
+ "callback_steps",
+ }
+ batch_params = UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ scheduler = DDIMScheduler()
+ components = {"unet": unet, "scheduler": scheduler}
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "batch_size": 1,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_inference(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ self.assertEqual(image.shape, (1, 32, 32, 3))
+ expected_slice = np.array(
+ [1.000e00, 5.717e-01, 4.717e-01, 1.000e00, 0.000e00, 1.000e00, 3.000e-04, 0.000e00, 9.000e-04]
+ )
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=3e-3)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=3e-3)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+
+@slow
+@require_torch_gpu
+class DDIMPipelineIntegrationTests(unittest.TestCase):
+ def test_inference_cifar10(self):
+ model_id = "google/ddpm-cifar10-32"
+
+ unet = UNet2DModel.from_pretrained(model_id)
+ scheduler = DDIMScheduler()
+
+ ddim = DDIMPipeline(unet=unet, scheduler=scheduler)
+ ddim.to(torch_device)
+ ddim.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ image = ddim(generator=generator, eta=0.0, output_type="numpy").images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.1723, 0.1617, 0.1600, 0.1626, 0.1497, 0.1513, 0.1505, 0.1442, 0.1453])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_inference_ema_bedroom(self):
+ model_id = "google/ddpm-ema-bedroom-256"
+
+ unet = UNet2DModel.from_pretrained(model_id)
+ scheduler = DDIMScheduler.from_pretrained(model_id)
+
+ ddpm = DDIMPipeline(unet=unet, scheduler=scheduler)
+ ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ image = ddpm(generator=generator, output_type="numpy").images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.0060, 0.0201, 0.0344, 0.0024, 0.0018, 0.0002, 0.0022, 0.0000, 0.0069])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/ddpm/__init__.py b/tests/pipelines/ddpm/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/ddpm/test_ddpm.py b/tests/pipelines/ddpm/test_ddpm.py
new file mode 100644
index 0000000..bf25ced
--- /dev/null
+++ b/tests/pipelines/ddpm/test_ddpm.py
@@ -0,0 +1,111 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import DDPMPipeline, DDPMScheduler, UNet2DModel
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device
+
+
+enable_full_determinism()
+
+
+class DDPMPipelineFastTests(unittest.TestCase):
+ @property
+ def dummy_uncond_unet(self):
+ torch.manual_seed(0)
+ model = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ return model
+
+ def test_fast_inference(self):
+ device = "cpu"
+ unet = self.dummy_uncond_unet
+ scheduler = DDPMScheduler()
+
+ ddpm = DDPMPipeline(unet=unet, scheduler=scheduler)
+ ddpm.to(device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image = ddpm(generator=generator, num_inference_steps=2, output_type="numpy").images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = ddpm(generator=generator, num_inference_steps=2, output_type="numpy", return_dict=False)[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array(
+ [9.956e-01, 5.785e-01, 4.675e-01, 9.930e-01, 0.0, 1.000, 1.199e-03, 2.648e-04, 5.101e-04]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_inference_predict_sample(self):
+ unet = self.dummy_uncond_unet
+ scheduler = DDPMScheduler(prediction_type="sample")
+
+ ddpm = DDPMPipeline(unet=unet, scheduler=scheduler)
+ ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ image = ddpm(generator=generator, num_inference_steps=2, output_type="numpy").images
+
+ generator = torch.manual_seed(0)
+ image_eps = ddpm(generator=generator, num_inference_steps=2, output_type="numpy")[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_eps_slice = image_eps[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ tolerance = 1e-2 if torch_device != "mps" else 3e-2
+ assert np.abs(image_slice.flatten() - image_eps_slice.flatten()).max() < tolerance
+
+
+@slow
+@require_torch_gpu
+class DDPMPipelineIntegrationTests(unittest.TestCase):
+ def test_inference_cifar10(self):
+ model_id = "google/ddpm-cifar10-32"
+
+ unet = UNet2DModel.from_pretrained(model_id)
+ scheduler = DDPMScheduler.from_pretrained(model_id)
+
+ ddpm = DDPMPipeline(unet=unet, scheduler=scheduler)
+ ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ image = ddpm(generator=generator, output_type="numpy").images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.4200, 0.3588, 0.1939, 0.3847, 0.3382, 0.2647, 0.4155, 0.3582, 0.3385])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/deepfloyd_if/__init__.py b/tests/pipelines/deepfloyd_if/__init__.py
new file mode 100644
index 0000000..094254a
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/__init__.py
@@ -0,0 +1,272 @@
+import tempfile
+
+import numpy as np
+import torch
+from transformers import AutoTokenizer, T5EncoderModel
+
+from diffusers import DDPMScheduler, UNet2DConditionModel
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.pipelines.deepfloyd_if import IFWatermarker
+from diffusers.utils.testing_utils import torch_device
+
+from ..test_pipelines_common import to_np
+
+
+# WARN: the hf-internal-testing/tiny-random-t5 text encoder has some non-determinism in the `save_load` tests.
+
+
+class IFPipelineTesterMixin:
+ def _get_dummy_components(self):
+ torch.manual_seed(0)
+ text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ torch.manual_seed(0)
+ tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ sample_size=32,
+ layers_per_block=1,
+ block_out_channels=[32, 64],
+ down_block_types=[
+ "ResnetDownsampleBlock2D",
+ "SimpleCrossAttnDownBlock2D",
+ ],
+ mid_block_type="UNetMidBlock2DSimpleCrossAttn",
+ up_block_types=["SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"],
+ in_channels=3,
+ out_channels=6,
+ cross_attention_dim=32,
+ encoder_hid_dim=32,
+ attention_head_dim=8,
+ addition_embed_type="text",
+ addition_embed_type_num_heads=2,
+ cross_attention_norm="group_norm",
+ resnet_time_scale_shift="scale_shift",
+ act_fn="gelu",
+ )
+ unet.set_attn_processor(AttnAddedKVProcessor()) # For reproducibility tests
+
+ torch.manual_seed(0)
+ scheduler = DDPMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="squaredcos_cap_v2",
+ beta_start=0.0001,
+ beta_end=0.02,
+ thresholding=True,
+ dynamic_thresholding_ratio=0.95,
+ sample_max_value=1.0,
+ prediction_type="epsilon",
+ variance_type="learned_range",
+ )
+
+ torch.manual_seed(0)
+ watermarker = IFWatermarker()
+
+ return {
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "unet": unet,
+ "scheduler": scheduler,
+ "watermarker": watermarker,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+
+ def _get_superresolution_dummy_components(self):
+ torch.manual_seed(0)
+ text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ torch.manual_seed(0)
+ tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ sample_size=32,
+ layers_per_block=[1, 2],
+ block_out_channels=[32, 64],
+ down_block_types=[
+ "ResnetDownsampleBlock2D",
+ "SimpleCrossAttnDownBlock2D",
+ ],
+ mid_block_type="UNetMidBlock2DSimpleCrossAttn",
+ up_block_types=["SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"],
+ in_channels=6,
+ out_channels=6,
+ cross_attention_dim=32,
+ encoder_hid_dim=32,
+ attention_head_dim=8,
+ addition_embed_type="text",
+ addition_embed_type_num_heads=2,
+ cross_attention_norm="group_norm",
+ resnet_time_scale_shift="scale_shift",
+ act_fn="gelu",
+ class_embed_type="timestep",
+ mid_block_scale_factor=1.414,
+ time_embedding_act_fn="gelu",
+ time_embedding_dim=32,
+ )
+ unet.set_attn_processor(AttnAddedKVProcessor()) # For reproducibility tests
+
+ torch.manual_seed(0)
+ scheduler = DDPMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="squaredcos_cap_v2",
+ beta_start=0.0001,
+ beta_end=0.02,
+ thresholding=True,
+ dynamic_thresholding_ratio=0.95,
+ sample_max_value=1.0,
+ prediction_type="epsilon",
+ variance_type="learned_range",
+ )
+
+ torch.manual_seed(0)
+ image_noising_scheduler = DDPMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="squaredcos_cap_v2",
+ beta_start=0.0001,
+ beta_end=0.02,
+ )
+
+ torch.manual_seed(0)
+ watermarker = IFWatermarker()
+
+ return {
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "unet": unet,
+ "scheduler": scheduler,
+ "image_noising_scheduler": image_noising_scheduler,
+ "watermarker": watermarker,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+
+ # this test is modified from the base class because if pipelines set the text encoder
+ # as optional with the intention that the user is allowed to encode the prompt once
+ # and then pass the embeddings directly to the pipeline. The base class test uses
+ # the unmodified arguments from `self.get_dummy_inputs` which will pass the unencoded
+ # prompt to the pipeline when the text encoder is set to None, throwing an error.
+ # So we make the test reflect the intended usage of setting the text encoder to None.
+ def _test_save_load_optional_components(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = inputs["prompt"]
+ generator = inputs["generator"]
+ num_inference_steps = inputs["num_inference_steps"]
+ output_type = inputs["output_type"]
+
+ if "image" in inputs:
+ image = inputs["image"]
+ else:
+ image = None
+
+ if "mask_image" in inputs:
+ mask_image = inputs["mask_image"]
+ else:
+ mask_image = None
+
+ if "original_image" in inputs:
+ original_image = inputs["original_image"]
+ else:
+ original_image = None
+
+ prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(prompt)
+
+ # inputs with prompt converted to embeddings
+ inputs = {
+ "prompt_embeds": prompt_embeds,
+ "negative_prompt_embeds": negative_prompt_embeds,
+ "generator": generator,
+ "num_inference_steps": num_inference_steps,
+ "output_type": output_type,
+ }
+
+ if image is not None:
+ inputs["image"] = image
+
+ if mask_image is not None:
+ inputs["mask_image"] = mask_image
+
+ if original_image is not None:
+ inputs["original_image"] = original_image
+
+ # set all optional components to None
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ pipe_loaded.unet.set_attn_processor(AttnAddedKVProcessor()) # For reproducibility tests
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ generator = inputs["generator"]
+ num_inference_steps = inputs["num_inference_steps"]
+ output_type = inputs["output_type"]
+
+ # inputs with prompt converted to embeddings
+ inputs = {
+ "prompt_embeds": prompt_embeds,
+ "negative_prompt_embeds": negative_prompt_embeds,
+ "generator": generator,
+ "num_inference_steps": num_inference_steps,
+ "output_type": output_type,
+ }
+
+ if image is not None:
+ inputs["image"] = image
+
+ if mask_image is not None:
+ inputs["mask_image"] = mask_image
+
+ if original_image is not None:
+ inputs["original_image"] = original_image
+
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, 1e-4)
+
+ # Modified from `PipelineTesterMixin` to set the attn processor as it's not serialized.
+ # This should be handled in the base test and then this method can be removed.
+ def _test_save_load_local(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ pipe_loaded.unet.set_attn_processor(AttnAddedKVProcessor()) # For reproducibility tests
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, 1e-4)
diff --git a/tests/pipelines/deepfloyd_if/test_if.py b/tests/pipelines/deepfloyd_if/test_if.py
new file mode 100644
index 0000000..96fd013
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/test_if.py
@@ -0,0 +1,120 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import torch
+
+from diffusers import (
+ IFPipeline,
+)
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import load_numpy, require_torch_gpu, skip_mps, slow, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+from . import IFPipelineTesterMixin
+
+
+@skip_mps
+class IFPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = IFPipeline
+ params = TEXT_TO_IMAGE_PARAMS - {"width", "height", "latents"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+
+ def get_dummy_components(self):
+ return self._get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ # Due to non-determinism in save load of the hf-internal-testing/tiny-random-t5 text encoder
+ super().test_save_load_float16(expected_max_diff=1e-1)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(expected_max_diff=1e-2)
+
+ def test_save_load_local(self):
+ self._test_save_load_local()
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=1e-2,
+ )
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+
+@slow
+@require_torch_gpu
+class IFPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_if_text_to_image(self):
+ pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
+ pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+ pipe.enable_model_cpu_offload()
+
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.empty_cache()
+ torch.cuda.reset_peak_memory_stats()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ prompt="anime turtle",
+ num_inference_steps=2,
+ generator=generator,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 12 * 10**9
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/if/test_if.npy"
+ )
+ assert_mean_pixel_difference(image, expected_image)
+ pipe.remove_all_hooks()
diff --git a/tests/pipelines/deepfloyd_if/test_if_img2img.py b/tests/pipelines/deepfloyd_if/test_if_img2img.py
new file mode 100644
index 0000000..17a5e37
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/test_if_img2img.py
@@ -0,0 +1,131 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import torch
+
+from diffusers import IFImg2ImgPipeline
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import floats_tensor, load_numpy, require_torch_gpu, skip_mps, slow, torch_device
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+from . import IFPipelineTesterMixin
+
+
+@skip_mps
+class IFImg2ImgPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = IFImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+
+ def get_dummy_components(self):
+ return self._get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ # Due to non-determinism in save load of the hf-internal-testing/tiny-random-t5 text encoder
+ super().test_save_load_float16(expected_max_diff=1e-1)
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(expected_max_diff=1e-2)
+
+ def test_save_load_local(self):
+ self._test_save_load_local()
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=1e-2,
+ )
+
+
+@slow
+@require_torch_gpu
+class IFImg2ImgPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_if_img2img(self):
+ pipe = IFImg2ImgPipeline.from_pretrained(
+ "DeepFloyd/IF-I-L-v1.0",
+ variant="fp16",
+ torch_dtype=torch.float16,
+ )
+ pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+ pipe.enable_model_cpu_offload()
+
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(0)).to(torch_device)
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ prompt="anime turtle",
+ image=image,
+ num_inference_steps=2,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 12 * 10**9
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/if/test_if_img2img.npy"
+ )
+ assert_mean_pixel_difference(image, expected_image)
+
+ pipe.remove_all_hooks()
diff --git a/tests/pipelines/deepfloyd_if/test_if_img2img_superresolution.py b/tests/pipelines/deepfloyd_if/test_if_img2img_superresolution.py
new file mode 100644
index 0000000..d37f7f4
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/test_if_img2img_superresolution.py
@@ -0,0 +1,136 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import torch
+
+from diffusers import IFImg2ImgSuperResolutionPipeline
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import floats_tensor, load_numpy, require_torch_gpu, skip_mps, slow, torch_device
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+from . import IFPipelineTesterMixin
+
+
+@skip_mps
+class IFImg2ImgSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = IFImg2ImgSuperResolutionPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS.union({"original_image"})
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+
+ def get_dummy_components(self):
+ return self._get_superresolution_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ original_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = floats_tensor((1, 3, 16, 16), rng=random.Random(seed)).to(device)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "original_image": original_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ # Due to non-determinism in save load of the hf-internal-testing/tiny-random-t5 text encoder
+ super().test_save_load_float16(expected_max_diff=1e-1)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(expected_max_diff=1e-2)
+
+ def test_save_load_local(self):
+ self._test_save_load_local()
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=1e-2,
+ )
+
+
+@slow
+@require_torch_gpu
+class IFImg2ImgSuperResolutionPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_if_img2img_superresolution(self):
+ pipe = IFImg2ImgSuperResolutionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0",
+ variant="fp16",
+ torch_dtype=torch.float16,
+ )
+ pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ original_image = floats_tensor((1, 3, 256, 256), rng=random.Random(0)).to(torch_device)
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(0)).to(torch_device)
+
+ output = pipe(
+ prompt="anime turtle",
+ image=image,
+ original_image=original_image,
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (256, 256, 3)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 12 * 10**9
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/if/test_if_img2img_superresolution_stage_II.npy"
+ )
+ assert_mean_pixel_difference(image, expected_image)
+
+ pipe.remove_all_hooks()
diff --git a/tests/pipelines/deepfloyd_if/test_if_inpainting.py b/tests/pipelines/deepfloyd_if/test_if_inpainting.py
new file mode 100644
index 0000000..85dea36
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/test_if_inpainting.py
@@ -0,0 +1,134 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import torch
+
+from diffusers import IFInpaintingPipeline
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import floats_tensor, load_numpy, require_torch_gpu, skip_mps, slow, torch_device
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+from . import IFPipelineTesterMixin
+
+
+@skip_mps
+class IFInpaintingPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = IFInpaintingPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS - {"width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+
+ def get_dummy_components(self):
+ return self._get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ mask_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ # Due to non-determinism in save load of the hf-internal-testing/tiny-random-t5 text encoder
+ super().test_save_load_float16(expected_max_diff=1e-1)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(expected_max_diff=1e-2)
+
+ def test_save_load_local(self):
+ self._test_save_load_local()
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=1e-2,
+ )
+
+
+@slow
+@require_torch_gpu
+class IFInpaintingPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_if_inpainting(self):
+ pipe = IFInpaintingPipeline.from_pretrained(
+ "DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16
+ )
+ pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+ pipe.enable_model_cpu_offload()
+
+ # Super resolution test
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(0)).to(torch_device)
+ mask_image = floats_tensor((1, 3, 64, 64), rng=random.Random(1)).to(torch_device)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ prompt="anime prompts",
+ image=image,
+ mask_image=mask_image,
+ num_inference_steps=2,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 12 * 10**9
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/if/test_if_inpainting.npy"
+ )
+ assert_mean_pixel_difference(image, expected_image)
+ pipe.remove_all_hooks()
diff --git a/tests/pipelines/deepfloyd_if/test_if_inpainting_superresolution.py b/tests/pipelines/deepfloyd_if/test_if_inpainting_superresolution.py
new file mode 100644
index 0000000..f8e782d
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/test_if_inpainting_superresolution.py
@@ -0,0 +1,143 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import torch
+
+from diffusers import IFInpaintingSuperResolutionPipeline
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import floats_tensor, load_numpy, require_torch_gpu, skip_mps, slow, torch_device
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+from . import IFPipelineTesterMixin
+
+
+@skip_mps
+class IFInpaintingSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = IFInpaintingSuperResolutionPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS - {"width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS.union({"original_image"})
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+
+ def get_dummy_components(self):
+ return self._get_superresolution_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ image = floats_tensor((1, 3, 16, 16), rng=random.Random(seed)).to(device)
+ original_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ mask_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "original_image": original_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ # Due to non-determinism in save load of the hf-internal-testing/tiny-random-t5 text encoder
+ super().test_save_load_float16(expected_max_diff=1e-1)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(expected_max_diff=1e-2)
+
+ def test_save_load_local(self):
+ self._test_save_load_local()
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=1e-2,
+ )
+
+
+@slow
+@require_torch_gpu
+class IFInpaintingSuperResolutionPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_if_inpainting_superresolution(self):
+ pipe = IFInpaintingSuperResolutionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0", variant="fp16", torch_dtype=torch.float16
+ )
+ pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+ pipe.enable_model_cpu_offload()
+
+ # Super resolution test
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(0)).to(torch_device)
+ original_image = floats_tensor((1, 3, 256, 256), rng=random.Random(0)).to(torch_device)
+ mask_image = floats_tensor((1, 3, 256, 256), rng=random.Random(1)).to(torch_device)
+
+ output = pipe(
+ prompt="anime turtle",
+ image=image,
+ original_image=original_image,
+ mask_image=mask_image,
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (256, 256, 3)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 12 * 10**9
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/if/test_if_inpainting_superresolution_stage_II.npy"
+ )
+ assert_mean_pixel_difference(image, expected_image)
+
+ pipe.remove_all_hooks()
diff --git a/tests/pipelines/deepfloyd_if/test_if_superresolution.py b/tests/pipelines/deepfloyd_if/test_if_superresolution.py
new file mode 100644
index 0000000..ca20517
--- /dev/null
+++ b/tests/pipelines/deepfloyd_if/test_if_superresolution.py
@@ -0,0 +1,130 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import torch
+
+from diffusers import IFSuperResolutionPipeline
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import floats_tensor, load_numpy, require_torch_gpu, skip_mps, slow, torch_device
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+from . import IFPipelineTesterMixin
+
+
+@skip_mps
+class IFSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = IFSuperResolutionPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+
+ def get_dummy_components(self):
+ return self._get_superresolution_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ # Due to non-determinism in save load of the hf-internal-testing/tiny-random-t5 text encoder
+ super().test_save_load_float16(expected_max_diff=1e-1)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(expected_max_diff=1e-2)
+
+ def test_save_load_local(self):
+ self._test_save_load_local()
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=1e-2,
+ )
+
+
+@slow
+@require_torch_gpu
+class IFSuperResolutionPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_if_superresolution(self):
+ pipe = IFSuperResolutionPipeline.from_pretrained(
+ "DeepFloyd/IF-II-L-v1.0", variant="fp16", torch_dtype=torch.float16
+ )
+ pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+ pipe.enable_model_cpu_offload()
+
+ # Super resolution test
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(0)).to(torch_device)
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ prompt="anime turtle",
+ image=image,
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (256, 256, 3)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 12 * 10**9
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/if/test_if_superresolution_stage_II.npy"
+ )
+ assert_mean_pixel_difference(image, expected_image)
+
+ pipe.remove_all_hooks()
diff --git a/tests/pipelines/dit/__init__.py b/tests/pipelines/dit/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/dit/test_dit.py b/tests/pipelines/dit/test_dit.py
new file mode 100644
index 0000000..1f36776
--- /dev/null
+++ b/tests/pipelines/dit/test_dit.py
@@ -0,0 +1,151 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import AutoencoderKL, DDIMScheduler, DiTPipeline, DPMSolverMultistepScheduler, Transformer2DModel
+from diffusers.utils import is_xformers_available
+from diffusers.utils.testing_utils import enable_full_determinism, load_numpy, nightly, require_torch_gpu, torch_device
+
+from ..pipeline_params import (
+ CLASS_CONDITIONED_IMAGE_GENERATION_BATCH_PARAMS,
+ CLASS_CONDITIONED_IMAGE_GENERATION_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class DiTPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = DiTPipeline
+ params = CLASS_CONDITIONED_IMAGE_GENERATION_PARAMS
+ required_optional_params = PipelineTesterMixin.required_optional_params - {
+ "latents",
+ "num_images_per_prompt",
+ "callback",
+ "callback_steps",
+ }
+ batch_params = CLASS_CONDITIONED_IMAGE_GENERATION_BATCH_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ transformer = Transformer2DModel(
+ sample_size=16,
+ num_layers=2,
+ patch_size=4,
+ attention_head_dim=8,
+ num_attention_heads=2,
+ in_channels=4,
+ out_channels=8,
+ attention_bias=True,
+ activation_fn="gelu-approximate",
+ num_embeds_ada_norm=1000,
+ norm_type="ada_norm_zero",
+ norm_elementwise_affine=False,
+ )
+ vae = AutoencoderKL()
+ scheduler = DDIMScheduler()
+ components = {"transformer": transformer.eval(), "vae": vae.eval(), "scheduler": scheduler}
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "class_labels": [1],
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_inference(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ self.assertEqual(image.shape, (1, 16, 16, 3))
+ expected_slice = np.array([0.2946, 0.6601, 0.4329, 0.3296, 0.4144, 0.5319, 0.7273, 0.5013, 0.4457])
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=1e-3)
+
+
+@nightly
+@require_torch_gpu
+class DiTPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_dit_256(self):
+ generator = torch.manual_seed(0)
+
+ pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256")
+ pipe.to("cuda")
+
+ words = ["vase", "umbrella", "white shark", "white wolf"]
+ ids = pipe.get_label_ids(words)
+
+ images = pipe(ids, generator=generator, num_inference_steps=40, output_type="np").images
+
+ for word, image in zip(words, images):
+ expected_image = load_numpy(
+ f"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/dit/{word}.npy"
+ )
+ assert np.abs((expected_image - image).max()) < 1e-2
+
+ def test_dit_512(self):
+ pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-512")
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.to("cuda")
+
+ words = ["vase", "umbrella"]
+ ids = pipe.get_label_ids(words)
+
+ generator = torch.manual_seed(0)
+ images = pipe(ids, generator=generator, num_inference_steps=25, output_type="np").images
+
+ for word, image in zip(words, images):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ f"/dit/{word}_512.npy"
+ )
+
+ assert np.abs((expected_image - image).max()) < 1e-1
diff --git a/tests/pipelines/i2vgen_xl/__init__.py b/tests/pipelines/i2vgen_xl/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/i2vgen_xl/test_i2vgenxl.py b/tests/pipelines/i2vgen_xl/test_i2vgenxl.py
new file mode 100644
index 0000000..aeda671
--- /dev/null
+++ b/tests/pipelines/i2vgen_xl/test_i2vgenxl.py
@@ -0,0 +1,265 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ I2VGenXLPipeline,
+)
+from diffusers.models.unets import I2VGenXLUNet
+from diffusers.utils import is_xformers_available, load_image
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ numpy_cosine_similarity_distance,
+ print_tensor_test,
+ require_torch_gpu,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, SDFunctionTesterMixin
+
+
+enable_full_determinism()
+
+
+@skip_mps
+class I2VGenXLPipelineFastTests(SDFunctionTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = I2VGenXLPipeline
+ params = frozenset(["prompt", "negative_prompt", "image"])
+ batch_params = frozenset(["prompt", "negative_prompt", "image", "generator"])
+ # No `output_type`.
+ required_optional_params = frozenset(["num_inference_steps", "generator", "latents", "return_dict"])
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+
+ torch.manual_seed(0)
+ unet = I2VGenXLUNet(
+ block_out_channels=(4, 8),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock3D", "DownBlock3D"),
+ up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"),
+ cross_attention_dim=4,
+ attention_head_dim=4,
+ num_attention_heads=None,
+ norm_num_groups=2,
+ )
+
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=(8,),
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=32,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=4,
+ intermediate_size=16,
+ layer_norm_eps=1e-05,
+ num_attention_heads=2,
+ num_hidden_layers=2,
+ pad_token_id=1,
+ vocab_size=1000,
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+ vision_encoder_config = CLIPVisionConfig(
+ hidden_size=4,
+ projection_dim=4,
+ num_hidden_layers=2,
+ num_attention_heads=2,
+ image_size=32,
+ intermediate_size=16,
+ patch_size=1,
+ )
+ image_encoder = CLIPVisionModelWithProjection(vision_encoder_config)
+
+ torch.manual_seed(0)
+ feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "image_encoder": image_encoder,
+ "tokenizer": tokenizer,
+ "feature_extractor": feature_extractor,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ input_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": input_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "pt",
+ "num_frames": 4,
+ "width": 32,
+ "height": 32,
+ }
+ return inputs
+
+ def test_text_to_video_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["output_type"] = "np"
+ frames = pipe(**inputs).frames
+
+ image_slice = frames[0][0][-3:, -3:, -1]
+
+ assert frames[0][0].shape == (32, 32, 3)
+ expected_slice = np.array([0.5146, 0.6525, 0.6032, 0.5204, 0.5675, 0.4125, 0.3016, 0.5172, 0.4095])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=0.006)
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ super().test_sequential_cpu_offload_forward_pass(expected_max_diff=0.008)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=0.008)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=0.008)
+
+ @unittest.skip("Deprecated functionality")
+ def test_attention_slicing_forward_pass(self):
+ pass
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False, expected_max_diff=1e-2)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(batch_size=2, expected_max_diff=0.008)
+
+ def test_model_cpu_offload_forward_pass(self):
+ super().test_model_cpu_offload_forward_pass(expected_max_diff=0.008)
+
+ def test_num_videos_per_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["output_type"] = "np"
+ frames = pipe(**inputs, num_videos_per_prompt=2).frames
+
+ assert frames.shape == (2, 4, 32, 32, 3)
+ assert frames[0][0].shape == (32, 32, 3)
+
+ image_slice = frames[0][0][-3:, -3:, -1]
+ expected_slice = np.array([0.5146, 0.6525, 0.6032, 0.5204, 0.5675, 0.4125, 0.3016, 0.5172, 0.4095])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+
+@slow
+@require_torch_gpu
+class I2VGenXLPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_i2vgen_xl(self):
+ pipe = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
+ pipe = pipe.to(torch_device)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
+ )
+
+ generator = torch.Generator("cpu").manual_seed(0)
+ num_frames = 3
+
+ output = pipe(
+ image=image,
+ prompt="my cat",
+ num_frames=num_frames,
+ generator=generator,
+ num_inference_steps=3,
+ output_type="np",
+ )
+
+ image = output.frames[0]
+ assert image.shape == (num_frames, 704, 1280, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ print_tensor_test(image_slice.flatten())
+ expected_slice = np.array([0.5482, 0.6244, 0.6274, 0.4584, 0.5935, 0.5937, 0.4579, 0.5767, 0.5892])
+ assert numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice.flatten()) < 1e-3
diff --git a/tests/pipelines/ip_adapters/test_ip_adapter_stable_diffusion.py b/tests/pipelines/ip_adapters/test_ip_adapter_stable_diffusion.py
new file mode 100644
index 0000000..6289ee8
--- /dev/null
+++ b/tests/pipelines/ip_adapters/test_ip_adapter_stable_diffusion.py
@@ -0,0 +1,539 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPImageProcessor,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+ StableDiffusionImg2ImgPipeline,
+ StableDiffusionInpaintPipeline,
+ StableDiffusionPipeline,
+ StableDiffusionXLImg2ImgPipeline,
+ StableDiffusionXLInpaintPipeline,
+ StableDiffusionXLPipeline,
+)
+from diffusers.image_processor import IPAdapterMaskProcessor
+from diffusers.models.attention_processor import AttnProcessor, AttnProcessor2_0
+from diffusers.utils import load_image
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ is_flaky,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+
+enable_full_determinism()
+
+
+class IPAdapterNightlyTestsMixin(unittest.TestCase):
+ dtype = torch.float16
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_image_encoder(self, repo_id, subfolder):
+ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ repo_id, subfolder=subfolder, torch_dtype=self.dtype
+ ).to(torch_device)
+ return image_encoder
+
+ def get_image_processor(self, repo_id):
+ image_processor = CLIPImageProcessor.from_pretrained(repo_id)
+ return image_processor
+
+ def get_dummy_inputs(self, for_image_to_image=False, for_inpainting=False, for_sdxl=False, for_masks=False):
+ image = load_image(
+ "https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png"
+ )
+ if for_sdxl:
+ image = image.resize((1024, 1024))
+
+ input_kwargs = {
+ "prompt": "best quality, high quality",
+ "negative_prompt": "monochrome, lowres, bad anatomy, worst quality, low quality",
+ "num_inference_steps": 5,
+ "generator": torch.Generator(device="cpu").manual_seed(33),
+ "ip_adapter_image": image,
+ "output_type": "np",
+ }
+ if for_image_to_image:
+ image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/vermeer.jpg")
+ ip_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/river.png")
+
+ if for_sdxl:
+ image = image.resize((1024, 1024))
+ ip_image = ip_image.resize((1024, 1024))
+
+ input_kwargs.update({"image": image, "ip_adapter_image": ip_image})
+
+ elif for_inpainting:
+ image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/inpaint_image.png")
+ mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/mask.png")
+ ip_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/girl.png")
+
+ if for_sdxl:
+ image = image.resize((1024, 1024))
+ mask = mask.resize((1024, 1024))
+ ip_image = ip_image.resize((1024, 1024))
+
+ input_kwargs.update({"image": image, "mask_image": mask, "ip_adapter_image": ip_image})
+
+ elif for_masks:
+ face_image1 = load_image(
+ "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png"
+ )
+ face_image2 = load_image(
+ "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png"
+ )
+ mask1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png")
+ mask2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png")
+ input_kwargs.update(
+ {
+ "ip_adapter_image": [[face_image1], [face_image2]],
+ "cross_attention_kwargs": {"ip_adapter_masks": [mask1, mask2]},
+ }
+ )
+
+ return input_kwargs
+
+
+@slow
+@require_torch_gpu
+class IPAdapterSDIntegrationTests(IPAdapterNightlyTestsMixin):
+ def test_text_to_image(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+
+ inputs = self.get_dummy_inputs()
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array([0.80810547, 0.88183594, 0.9296875, 0.9189453, 0.9848633, 1.0, 0.97021484, 1.0, 1.0])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-plus_sd15.bin")
+
+ inputs = self.get_dummy_inputs()
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [0.30444336, 0.26513672, 0.22436523, 0.2758789, 0.25585938, 0.20751953, 0.25390625, 0.24633789, 0.21923828]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_image_to_image(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+
+ inputs = self.get_dummy_inputs(for_image_to_image=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [0.22167969, 0.21875, 0.21728516, 0.22607422, 0.21948242, 0.23925781, 0.22387695, 0.25268555, 0.2722168]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-plus_sd15.bin")
+
+ inputs = self.get_dummy_inputs(for_image_to_image=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [0.35913086, 0.265625, 0.26367188, 0.24658203, 0.19750977, 0.39990234, 0.15258789, 0.20336914, 0.5517578]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_inpainting(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+
+ inputs = self.get_dummy_inputs(for_inpainting=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [0.27148438, 0.24047852, 0.22167969, 0.23217773, 0.21118164, 0.21142578, 0.21875, 0.20751953, 0.20019531]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-plus_sd15.bin")
+
+ inputs = self.get_dummy_inputs(for_inpainting=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_text_to_image_model_cpu_offload(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+ pipeline.to(torch_device)
+
+ inputs = self.get_dummy_inputs()
+ output_without_offload = pipeline(**inputs).images
+
+ pipeline.enable_model_cpu_offload()
+ inputs = self.get_dummy_inputs()
+ output_with_offload = pipeline(**inputs).images
+ max_diff = np.abs(output_with_offload - output_without_offload).max()
+ self.assertLess(max_diff, 1e-3, "CPU offloading should not affect the inference results")
+
+ offloaded_modules = [
+ v
+ for k, v in pipeline.components.items()
+ if isinstance(v, torch.nn.Module) and k not in pipeline._exclude_from_cpu_offload
+ ]
+ (
+ self.assertTrue(all(v.device.type == "cpu" for v in offloaded_modules)),
+ f"Not offloaded: {[v for v in offloaded_modules if v.device.type != 'cpu']}",
+ )
+
+ def test_text_to_image_full_face(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin")
+ pipeline.set_ip_adapter_scale(0.7)
+
+ inputs = self.get_dummy_inputs()
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+ expected_slice = np.array([0.1704, 0.1296, 0.1272, 0.2212, 0.1514, 0.1479, 0.4172, 0.4263, 0.4360])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_unload(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
+ pipeline.set_ip_adapter_scale(0.7)
+
+ pipeline.unload_ip_adapter()
+
+ assert getattr(pipeline, "image_encoder") is None
+ assert getattr(pipeline, "feature_extractor") is not None
+ processors = [
+ isinstance(attn_proc, (AttnProcessor, AttnProcessor2_0))
+ for name, attn_proc in pipeline.unet.attn_processors.items()
+ ]
+ assert processors == [True] * len(processors)
+
+ @is_flaky
+ def test_multi(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, safety_checker=None, torch_dtype=self.dtype
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter(
+ "h94/IP-Adapter", subfolder="models", weight_name=["ip-adapter_sd15.bin", "ip-adapter-plus_sd15.bin"]
+ )
+ pipeline.set_ip_adapter_scale([0.7, 0.3])
+
+ inputs = self.get_dummy_inputs()
+ ip_adapter_image = inputs["ip_adapter_image"]
+ inputs["ip_adapter_image"] = [ip_adapter_image, [ip_adapter_image] * 2]
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+ expected_slice = np.array([0.5234, 0.5352, 0.5625, 0.5713, 0.5947, 0.6206, 0.5786, 0.6187, 0.6494])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+
+@slow
+@require_torch_gpu
+class IPAdapterSDXLIntegrationTests(IPAdapterNightlyTestsMixin):
+ def test_text_to_image_sdxl(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="sdxl_models/image_encoder")
+ feature_extractor = self.get_image_processor("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+
+ pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ feature_extractor=feature_extractor,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+
+ inputs = self.get_dummy_inputs()
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [
+ 0.09630299,
+ 0.09551358,
+ 0.08480701,
+ 0.09070173,
+ 0.09437338,
+ 0.09264627,
+ 0.08883232,
+ 0.09287417,
+ 0.09197289,
+ ]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+
+ pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ feature_extractor=feature_extractor,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter-plus_sdxl_vit-h.bin",
+ )
+
+ inputs = self.get_dummy_inputs()
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [0.0576596, 0.05600825, 0.04479006, 0.05288461, 0.05461192, 0.05137569, 0.04867965, 0.05301541, 0.04939842]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_image_to_image_sdxl(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="sdxl_models/image_encoder")
+ feature_extractor = self.get_image_processor("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+
+ pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ feature_extractor=feature_extractor,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+
+ inputs = self.get_dummy_inputs(for_image_to_image=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [
+ 0.06513795,
+ 0.07009393,
+ 0.07234055,
+ 0.07426041,
+ 0.07002589,
+ 0.06415862,
+ 0.07827643,
+ 0.07962808,
+ 0.07411247,
+ ]
+ )
+
+ assert np.allclose(image_slice, expected_slice, atol=1e-3)
+
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ feature_extractor = self.get_image_processor("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+
+ pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ feature_extractor=feature_extractor,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter-plus_sdxl_vit-h.bin",
+ )
+
+ inputs = self.get_dummy_inputs(for_image_to_image=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+
+ expected_slice = np.array(
+ [
+ 0.07126552,
+ 0.07025367,
+ 0.07348302,
+ 0.07580167,
+ 0.07467338,
+ 0.06918576,
+ 0.07480252,
+ 0.08279955,
+ 0.08547315,
+ ]
+ )
+
+ assert np.allclose(image_slice, expected_slice, atol=1e-3)
+
+ def test_inpainting_sdxl(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="sdxl_models/image_encoder")
+ feature_extractor = self.get_image_processor("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+
+ pipeline = StableDiffusionXLInpaintPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ feature_extractor=feature_extractor,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
+
+ inputs = self.get_dummy_inputs(for_inpainting=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+ image_slice.tolist()
+
+ expected_slice = np.array(
+ [0.14181179, 0.1493012, 0.14283323, 0.14602411, 0.14915377, 0.15015268, 0.14725655, 0.15009224, 0.15164584]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ feature_extractor = self.get_image_processor("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+
+ pipeline = StableDiffusionXLInpaintPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ feature_extractor=feature_extractor,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter(
+ "h94/IP-Adapter",
+ subfolder="sdxl_models",
+ weight_name="ip-adapter-plus_sdxl_vit-h.bin",
+ )
+
+ inputs = self.get_dummy_inputs(for_inpainting=True)
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+ image_slice.tolist()
+
+ expected_slice = np.array([0.1398, 0.1476, 0.1407, 0.1442, 0.1470, 0.1480, 0.1449, 0.1481, 0.1494])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_ip_adapter_single_mask(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter(
+ "h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus-face_sdxl_vit-h.safetensors"
+ )
+ pipeline.set_ip_adapter_scale(0.7)
+
+ inputs = self.get_dummy_inputs(for_masks=True)
+ mask = inputs["cross_attention_kwargs"]["ip_adapter_masks"][0]
+ processor = IPAdapterMaskProcessor()
+ mask = processor.preprocess(mask)
+ inputs["cross_attention_kwargs"]["ip_adapter_masks"] = mask
+ inputs["ip_adapter_image"] = inputs["ip_adapter_image"][0]
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+ expected_slice = np.array(
+ [0.7307304, 0.73450166, 0.73731124, 0.7377061, 0.7318013, 0.73720926, 0.74746597, 0.7409929, 0.74074936]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
+
+ def test_ip_adapter_multiple_masks(self):
+ image_encoder = self.get_image_encoder(repo_id="h94/IP-Adapter", subfolder="models/image_encoder")
+ pipeline = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ image_encoder=image_encoder,
+ torch_dtype=self.dtype,
+ )
+ pipeline.to(torch_device)
+ pipeline.load_ip_adapter(
+ "h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"] * 2
+ )
+ pipeline.set_ip_adapter_scale([0.7] * 2)
+
+ inputs = self.get_dummy_inputs(for_masks=True)
+ masks = inputs["cross_attention_kwargs"]["ip_adapter_masks"]
+ processor = IPAdapterMaskProcessor()
+ masks = processor.preprocess(masks)
+ inputs["cross_attention_kwargs"]["ip_adapter_masks"] = masks
+ images = pipeline(**inputs).images
+ image_slice = images[0, :3, :3, -1].flatten()
+ expected_slice = np.array(
+ [0.79474676, 0.7977683, 0.8013954, 0.7988008, 0.7970615, 0.8029355, 0.80614823, 0.8050743, 0.80627424]
+ )
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 5e-4
diff --git a/tests/pipelines/kandinsky/__init__.py b/tests/pipelines/kandinsky/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/kandinsky/test_kandinsky.py b/tests/pipelines/kandinsky/test_kandinsky.py
new file mode 100644
index 0000000..48b4520
--- /dev/null
+++ b/tests/pipelines/kandinsky/test_kandinsky.py
@@ -0,0 +1,323 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import XLMRobertaTokenizerFast
+
+from diffusers import DDIMScheduler, KandinskyPipeline, KandinskyPriorPipeline, UNet2DConditionModel, VQModel
+from diffusers.pipelines.kandinsky.text_encoder import MCLIPConfig, MultilingualCLIP
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_numpy,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 32
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = XLMRobertaTokenizerFast.from_pretrained("YiYiXu/tiny-random-mclip-base")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = MCLIPConfig(
+ numDims=self.cross_attention_dim,
+ transformerDimensions=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ vocab_size=1005,
+ )
+
+ text_encoder = MultilingualCLIP(config)
+ text_encoder = text_encoder.eval()
+
+ return text_encoder
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 4,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "text_image",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "text_image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ clip_sample=False,
+ set_alpha_to_one=False,
+ steps_offset=1,
+ prediction_type="epsilon",
+ thresholding=False,
+ )
+
+ components = {
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.cross_attention_dim), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.cross_attention_dim), rng=random.Random(seed + 1)).to(device)
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyPipeline
+ params = [
+ "prompt",
+ "image_embeds",
+ "negative_image_embeds",
+ ]
+ batch_params = ["prompt", "negative_prompt", "image_embeds", "negative_image_embeds"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummy = Dummies()
+ return dummy.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummy = Dummies()
+ return dummy.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([1.0000, 1.0000, 0.2766, 1.0000, 0.5447, 0.1737, 1.0000, 0.4316, 0.9024])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+
+@slow
+@require_torch_gpu
+class KandinskyPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_text2img(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinsky/kandinsky_text2img_cat_fp16.npy"
+ )
+
+ pipe_prior = KandinskyPriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ prompt = "red cat, 4k photo"
+
+ generator = torch.Generator(device="cuda").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ generator = torch.Generator(device="cuda").manual_seed(0)
+ output = pipeline(
+ prompt,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky/test_kandinsky_combined.py b/tests/pipelines/kandinsky/test_kandinsky_combined.py
new file mode 100644
index 0000000..480e129
--- /dev/null
+++ b/tests/pipelines/kandinsky/test_kandinsky_combined.py
@@ -0,0 +1,361 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+
+from diffusers import KandinskyCombinedPipeline, KandinskyImg2ImgCombinedPipeline, KandinskyInpaintCombinedPipeline
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+from .test_kandinsky import Dummies
+from .test_kandinsky_img2img import Dummies as Img2ImgDummies
+from .test_kandinsky_inpaint import Dummies as InpaintDummies
+from .test_kandinsky_prior import Dummies as PriorDummies
+
+
+enable_full_determinism()
+
+
+class KandinskyPipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyCombinedPipeline
+ params = [
+ "prompt",
+ ]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = True
+
+ def get_dummy_components(self):
+ dummy = Dummies()
+ prior_dummy = PriorDummies()
+ components = dummy.get_dummy_components()
+
+ components.update({f"prior_{k}": v for k, v in prior_dummy.get_dummy_components().items()})
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ prior_dummy = PriorDummies()
+ inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
+ inputs.update(
+ {
+ "height": 64,
+ "width": 64,
+ }
+ )
+ return inputs
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.0000, 0.0000, 0.6777, 0.1363, 0.3624, 0.7868, 0.3869, 0.3395, 0.5068])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=2e-1)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+
+class KandinskyPipelineImg2ImgCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyImg2ImgCombinedPipeline
+ params = ["prompt", "image"]
+ batch_params = ["prompt", "negative_prompt", "image"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummy = Img2ImgDummies()
+ prior_dummy = PriorDummies()
+ components = dummy.get_dummy_components()
+
+ components.update({f"prior_{k}": v for k, v in prior_dummy.get_dummy_components().items()})
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ prior_dummy = PriorDummies()
+ dummy = Img2ImgDummies()
+ inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
+ inputs.update(dummy.get_dummy_inputs(device=device, seed=seed))
+ inputs.pop("image_embeds")
+ inputs.pop("negative_image_embeds")
+ return inputs
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.4260, 0.3596, 0.4571, 0.3890, 0.4087, 0.5137, 0.4819, 0.4116, 0.5053])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=5e-4)
+
+
+class KandinskyPipelineInpaintCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyInpaintCombinedPipeline
+ params = ["prompt", "image", "mask_image"]
+ batch_params = ["prompt", "negative_prompt", "image", "mask_image"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummy = InpaintDummies()
+ prior_dummy = PriorDummies()
+ components = dummy.get_dummy_components()
+
+ components.update({f"prior_{k}": v for k, v in prior_dummy.get_dummy_components().items()})
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ prior_dummy = PriorDummies()
+ dummy = InpaintDummies()
+ inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
+ inputs.update(dummy.get_dummy_inputs(device=device, seed=seed))
+ inputs.pop("image_embeds")
+ inputs.pop("negative_image_embeds")
+ return inputs
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.0477, 0.0808, 0.2972, 0.2705, 0.3620, 0.6247, 0.4464, 0.2870, 0.3530])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=5e-4)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-3)
diff --git a/tests/pipelines/kandinsky/test_kandinsky_img2img.py b/tests/pipelines/kandinsky/test_kandinsky_img2img.py
new file mode 100644
index 0000000..484d9d9
--- /dev/null
+++ b/tests/pipelines/kandinsky/test_kandinsky_img2img.py
@@ -0,0 +1,417 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import XLMRobertaTokenizerFast
+
+from diffusers import (
+ DDIMScheduler,
+ DDPMScheduler,
+ KandinskyImg2ImgPipeline,
+ KandinskyPriorPipeline,
+ UNet2DConditionModel,
+ VQModel,
+)
+from diffusers.pipelines.kandinsky.text_encoder import MCLIPConfig, MultilingualCLIP
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 32
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = XLMRobertaTokenizerFast.from_pretrained("YiYiXu/tiny-random-mclip-base")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = MCLIPConfig(
+ numDims=self.cross_attention_dim,
+ transformerDimensions=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ vocab_size=1005,
+ )
+
+ text_encoder = MultilingualCLIP(config)
+ text_encoder = text_encoder.eval()
+
+ return text_encoder
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 4,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "text_image",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "text_image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ ddim_config = {
+ "num_train_timesteps": 1000,
+ "beta_schedule": "linear",
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "clip_sample": False,
+ "set_alpha_to_one": False,
+ "steps_offset": 0,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ }
+
+ scheduler = DDIMScheduler(**ddim_config)
+
+ components = {
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.cross_attention_dim), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.cross_attention_dim), rng=random.Random(seed + 1)).to(device)
+ # create init_image
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((256, 256))
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "image": init_image,
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "num_inference_steps": 10,
+ "guidance_scale": 7.0,
+ "strength": 0.2,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyImg2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyImg2ImgPipeline
+ params = ["prompt", "image_embeds", "negative_image_embeds", "image"]
+ batch_params = [
+ "prompt",
+ "negative_prompt",
+ "image_embeds",
+ "negative_image_embeds",
+ "image",
+ ]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "strength",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummies = Dummies()
+ return dummies.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummies = Dummies()
+ return dummies.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky_img2img(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.5816, 0.5872, 0.4634, 0.5982, 0.4767, 0.4710, 0.4669, 0.4717, 0.4966])
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+
+@slow
+@require_torch_gpu
+class KandinskyImg2ImgPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_img2img(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinsky/kandinsky_img2img_frog.npy"
+ )
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
+ )
+ prompt = "A red cartoon frog, 4k"
+
+ pipe_prior = KandinskyPriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyImg2ImgPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ output = pipeline(
+ prompt,
+ image=init_image,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ height=768,
+ width=768,
+ strength=0.2,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
+
+
+@nightly
+@require_torch_gpu
+class KandinskyImg2ImgPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_img2img_ddpm(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinsky/kandinsky_img2img_ddpm_frog.npy"
+ )
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/frog.png"
+ )
+ prompt = "A red cartoon frog, 4k"
+
+ pipe_prior = KandinskyPriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
+ pipeline = KandinskyImg2ImgPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ output = pipeline(
+ prompt,
+ image=init_image,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ height=768,
+ width=768,
+ strength=0.2,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky/test_kandinsky_inpaint.py b/tests/pipelines/kandinsky/test_kandinsky_inpaint.py
new file mode 100644
index 0000000..15b8c21
--- /dev/null
+++ b/tests/pipelines/kandinsky/test_kandinsky_inpaint.py
@@ -0,0 +1,356 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import XLMRobertaTokenizerFast
+
+from diffusers import DDIMScheduler, KandinskyInpaintPipeline, KandinskyPriorPipeline, UNet2DConditionModel, VQModel
+from diffusers.pipelines.kandinsky.text_encoder import MCLIPConfig, MultilingualCLIP
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 32
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = XLMRobertaTokenizerFast.from_pretrained("YiYiXu/tiny-random-mclip-base")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = MCLIPConfig(
+ numDims=self.cross_attention_dim,
+ transformerDimensions=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ vocab_size=1005,
+ )
+
+ text_encoder = MultilingualCLIP(config)
+ text_encoder = text_encoder.eval()
+
+ return text_encoder
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 9,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "text_image",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "text_image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ clip_sample=False,
+ set_alpha_to_one=False,
+ steps_offset=1,
+ prediction_type="epsilon",
+ thresholding=False,
+ )
+
+ components = {
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.cross_attention_dim), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.cross_attention_dim), rng=random.Random(seed + 1)).to(device)
+ # create init_image
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((256, 256))
+ # create mask
+ mask = np.zeros((64, 64), dtype=np.float32)
+ mask[:32, :32] = 1
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "image": init_image,
+ "mask_image": mask,
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "num_inference_steps": 2,
+ "guidance_scale": 4.0,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyInpaintPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyInpaintPipeline
+ params = ["prompt", "image_embeds", "negative_image_embeds", "image", "mask_image"]
+ batch_params = [
+ "prompt",
+ "negative_prompt",
+ "image_embeds",
+ "negative_image_embeds",
+ "image",
+ "mask_image",
+ ]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummies = Dummies()
+ return dummies.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummies = Dummies()
+ return dummies.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky_inpaint(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.8222, 0.8896, 0.4373, 0.8088, 0.4905, 0.2609, 0.6816, 0.4291, 0.5129])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+
+@nightly
+@require_torch_gpu
+class KandinskyInpaintPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_inpaint(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinsky/kandinsky_inpaint_cat_with_hat_fp16.npy"
+ )
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
+ )
+ mask = np.zeros((768, 768), dtype=np.float32)
+ mask[:250, 250:-250] = 1
+
+ prompt = "a hat"
+
+ pipe_prior = KandinskyPriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyInpaintPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ output = pipeline(
+ prompt,
+ image=init_image,
+ mask_image=mask,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ height=768,
+ width=768,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky/test_kandinsky_prior.py b/tests/pipelines/kandinsky/test_kandinsky_prior.py
new file mode 100644
index 0000000..8e5456b
--- /dev/null
+++ b/tests/pipelines/kandinsky/test_kandinsky_prior.py
@@ -0,0 +1,237 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from torch import nn
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import KandinskyPriorPipeline, PriorTransformer, UnCLIPScheduler
+from diffusers.utils.testing_utils import enable_full_determinism, skip_mps, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config)
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "num_attention_heads": 2,
+ "attention_head_dim": 12,
+ "embedding_dim": self.text_embedder_hidden_size,
+ "num_layers": 1,
+ }
+
+ model = PriorTransformer(**model_kwargs)
+ # clip_std and clip_mean is initialized to be 0 so PriorTransformer.post_process_latents will always return 0 - set clip_std to be 1 so it won't return 0
+ model.clip_std = nn.Parameter(torch.ones(model.clip_std.shape))
+ return model
+
+ @property
+ def dummy_image_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=self.text_embedder_hidden_size,
+ image_size=224,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_channels=3,
+ num_hidden_layers=5,
+ patch_size=14,
+ )
+
+ model = CLIPVisionModelWithProjection(config)
+ return model
+
+ @property
+ def dummy_image_processor(self):
+ image_processor = CLIPImageProcessor(
+ crop_size=224,
+ do_center_crop=True,
+ do_normalize=True,
+ do_resize=True,
+ image_mean=[0.48145466, 0.4578275, 0.40821073],
+ image_std=[0.26862954, 0.26130258, 0.27577711],
+ resample=3,
+ size=224,
+ )
+
+ return image_processor
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ image_encoder = self.dummy_image_encoder
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ image_processor = self.dummy_image_processor
+
+ scheduler = UnCLIPScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="sample",
+ num_train_timesteps=1000,
+ clip_sample=True,
+ clip_sample_range=10.0,
+ )
+
+ components = {
+ "prior": prior,
+ "image_encoder": image_encoder,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "image_processor": image_processor,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyPriorPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyPriorPipeline
+ params = ["prompt"]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "generator",
+ "num_inference_steps",
+ "latents",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummy = Dummies()
+ return dummy.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummy = Dummies()
+ return dummy.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky_prior(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.image_embeds
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -10:]
+ image_from_tuple_slice = image_from_tuple[0, -10:]
+
+ assert image.shape == (1, 32)
+
+ expected_slice = np.array(
+ [-0.0532, 1.7120, 0.3656, -1.0852, -0.8946, -1.1756, 0.4348, 0.2482, 0.5146, -0.1156]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
diff --git a/tests/pipelines/kandinsky2_2/__init__.py b/tests/pipelines/kandinsky2_2/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky.py b/tests/pipelines/kandinsky2_2/test_kandinsky.py
new file mode 100644
index 0000000..91b54d8
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky.py
@@ -0,0 +1,272 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import DDIMScheduler, KandinskyV22Pipeline, KandinskyV22PriorPipeline, UNet2DConditionModel, VQModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_numpy,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 32
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 4,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "image",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ clip_sample=False,
+ set_alpha_to_one=False,
+ steps_offset=1,
+ prediction_type="epsilon",
+ thresholding=False,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed + 1)).to(
+ device
+ )
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyV22PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22Pipeline
+ params = [
+ "image_embeds",
+ "negative_image_embeds",
+ ]
+ batch_params = ["image_embeds", "negative_image_embeds"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ callback_cfg_params = ["image_embds"]
+ test_xformers_attention = False
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummies = Dummies()
+ return dummies.get_dummy_inputs(device=device, seed=seed)
+
+ def get_dummy_components(self):
+ dummies = Dummies()
+ return dummies.get_dummy_components()
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.3420, 0.9505, 0.3919, 1.0000, 0.5188, 0.3109, 0.6139, 0.5624, 0.6811])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+
+@slow
+@require_torch_gpu
+class KandinskyV22PipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_text2img(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/kandinskyv22_text2img_cat_fp16.npy"
+ )
+
+ pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyV22Pipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ prompt = "red cat, 4k photo"
+
+ generator = torch.Generator(device="cuda").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ generator = torch.Generator(device="cuda").manual_seed(0)
+ output = pipeline(
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_combined.py b/tests/pipelines/kandinsky2_2/test_kandinsky_combined.py
new file mode 100644
index 0000000..40bb3b0
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_combined.py
@@ -0,0 +1,406 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+
+from diffusers import (
+ KandinskyV22CombinedPipeline,
+ KandinskyV22Img2ImgCombinedPipeline,
+ KandinskyV22InpaintCombinedPipeline,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+from .test_kandinsky import Dummies
+from .test_kandinsky_img2img import Dummies as Img2ImgDummies
+from .test_kandinsky_inpaint import Dummies as InpaintDummies
+from .test_kandinsky_prior import Dummies as PriorDummies
+
+
+enable_full_determinism()
+
+
+class KandinskyV22PipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22CombinedPipeline
+ params = [
+ "prompt",
+ ]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = True
+ callback_cfg_params = ["image_embds"]
+
+ def get_dummy_components(self):
+ dummy = Dummies()
+ prior_dummy = PriorDummies()
+ components = dummy.get_dummy_components()
+
+ components.update({f"prior_{k}": v for k, v in prior_dummy.get_dummy_components().items()})
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ prior_dummy = PriorDummies()
+ inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
+ inputs.update(
+ {
+ "height": 64,
+ "width": 64,
+ }
+ )
+ return inputs
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.3013, 0.0471, 0.5176, 0.1817, 0.2566, 0.7076, 0.6712, 0.4421, 0.7503])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+ def test_model_cpu_offload_forward_pass(self):
+ super().test_model_cpu_offload_forward_pass(expected_max_diff=5e-4)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-3)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=5e-3)
+
+ def test_callback_inputs(self):
+ pass
+
+ def test_callback_cfg(self):
+ pass
+
+
+class KandinskyV22PipelineImg2ImgCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22Img2ImgCombinedPipeline
+ params = ["prompt", "image"]
+ batch_params = ["prompt", "negative_prompt", "image"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["image_embds"]
+
+ def get_dummy_components(self):
+ dummy = Img2ImgDummies()
+ prior_dummy = PriorDummies()
+ components = dummy.get_dummy_components()
+
+ components.update({f"prior_{k}": v for k, v in prior_dummy.get_dummy_components().items()})
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ prior_dummy = PriorDummies()
+ dummy = Img2ImgDummies()
+ inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
+ inputs.update(dummy.get_dummy_inputs(device=device, seed=seed))
+ inputs.pop("image_embeds")
+ inputs.pop("negative_image_embeds")
+ return inputs
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.4353, 0.4710, 0.5128, 0.4806, 0.5054, 0.5348, 0.5224, 0.4603, 0.5025])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=2e-1)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+ def test_model_cpu_offload_forward_pass(self):
+ super().test_model_cpu_offload_forward_pass(expected_max_diff=5e-4)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=5e-4)
+
+ def save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-3)
+
+ def test_callback_inputs(self):
+ pass
+
+ def test_callback_cfg(self):
+ pass
+
+
+class KandinskyV22PipelineInpaintCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22InpaintCombinedPipeline
+ params = ["prompt", "image", "mask_image"]
+ batch_params = ["prompt", "negative_prompt", "image", "mask_image"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummy = InpaintDummies()
+ prior_dummy = PriorDummies()
+ components = dummy.get_dummy_components()
+
+ components.update({f"prior_{k}": v for k, v in prior_dummy.get_dummy_components().items()})
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ prior_dummy = PriorDummies()
+ dummy = InpaintDummies()
+ inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
+ inputs.update(dummy.get_dummy_inputs(device=device, seed=seed))
+ inputs.pop("image_embeds")
+ inputs.pop("negative_image_embeds")
+ return inputs
+
+ def test_kandinsky(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.5039, 0.4926, 0.4898, 0.4978, 0.4838, 0.4942, 0.4738, 0.4702, 0.4816])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)
+
+ def test_model_cpu_offload_forward_pass(self):
+ super().test_model_cpu_offload_forward_pass(expected_max_diff=5e-4)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-3)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=5e-4)
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ super().test_sequential_cpu_offload_forward_pass(expected_max_diff=5e-4)
+
+ def test_callback_inputs(self):
+ pass
+
+ def test_callback_cfg(self):
+ pass
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_controlnet.py b/tests/pipelines/kandinsky2_2/test_kandinsky_controlnet.py
new file mode 100644
index 0000000..7deee83
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_controlnet.py
@@ -0,0 +1,285 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import (
+ DDIMScheduler,
+ KandinskyV22ControlnetPipeline,
+ KandinskyV22PriorPipeline,
+ UNet2DConditionModel,
+ VQModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class KandinskyV22ControlnetPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22ControlnetPipeline
+ params = ["image_embeds", "negative_image_embeds", "hint"]
+ batch_params = ["image_embeds", "negative_image_embeds", "hint"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 8,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "image_hint",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 32, 64, 64],
+ "down_block_types": [
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ "AttnDownEncoderBlock2D",
+ ],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": ["AttnUpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ clip_sample=False,
+ set_alpha_to_one=False,
+ steps_offset=1,
+ prediction_type="epsilon",
+ thresholding=False,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed + 1)).to(
+ device
+ )
+
+ # create hint
+ hint = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "hint": hint,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_kandinsky_controlnet(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [0.6959826, 0.868279, 0.7558092, 0.68769467, 0.85805804, 0.65977496, 0.44885302, 0.5959111, 0.4251595]
+ )
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=5e-4)
+
+
+@nightly
+@require_torch_gpu
+class KandinskyV22ControlnetPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_controlnet(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/kandinskyv22_controlnet_robotcat_fp16.npy"
+ )
+
+ hint = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/hint_image_cat.png"
+ )
+ hint = torch.from_numpy(np.array(hint)).float() / 255.0
+ hint = hint.permute(2, 0, 1).unsqueeze(0)
+
+ pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ prompt = "A robot, 4k photo"
+
+ generator = torch.Generator(device="cuda").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ generator = torch.Generator(device="cuda").manual_seed(0)
+ output = pipeline(
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ hint=hint,
+ generator=generator,
+ num_inference_steps=100,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_controlnet_img2img.py b/tests/pipelines/kandinsky2_2/test_kandinsky_controlnet_img2img.py
new file mode 100644
index 0000000..c7d6af9
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_controlnet_img2img.py
@@ -0,0 +1,303 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+
+from diffusers import (
+ DDIMScheduler,
+ KandinskyV22ControlnetImg2ImgPipeline,
+ KandinskyV22PriorEmb2EmbPipeline,
+ UNet2DConditionModel,
+ VQModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class KandinskyV22ControlnetImg2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22ControlnetImg2ImgPipeline
+ params = ["image_embeds", "negative_image_embeds", "image", "hint"]
+ batch_params = ["image_embeds", "negative_image_embeds", "image", "hint"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "strength",
+ "guidance_scale",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 8,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "image_hint",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 32, 64, 64],
+ "down_block_types": [
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ "AttnDownEncoderBlock2D",
+ ],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": ["AttnUpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ ddim_config = {
+ "num_train_timesteps": 1000,
+ "beta_schedule": "linear",
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "clip_sample": False,
+ "set_alpha_to_one": False,
+ "steps_offset": 0,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ }
+
+ scheduler = DDIMScheduler(**ddim_config)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed + 1)).to(
+ device
+ )
+ # create init_image
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((256, 256))
+ # create hint
+ hint = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": init_image,
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "hint": hint,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "num_inference_steps": 10,
+ "guidance_scale": 7.0,
+ "strength": 0.2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_kandinsky_controlnet_img2img(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [0.54985034, 0.55509365, 0.52561504, 0.5570494, 0.5593818, 0.5263979, 0.50285643, 0.5069846, 0.51196736]
+ )
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1.75e-3)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=2e-1)
+
+
+@nightly
+@require_torch_gpu
+class KandinskyV22ControlnetImg2ImgPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_controlnet_img2img(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/kandinskyv22_controlnet_img2img_robotcat_fp16.npy"
+ )
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
+ )
+ init_image = init_image.resize((512, 512))
+
+ hint = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/hint_image_cat.png"
+ )
+ hint = torch.from_numpy(np.array(hint)).float() / 255.0
+ hint = hint.permute(2, 0, 1).unsqueeze(0)
+
+ prompt = "A robot, 4k photo"
+
+ pipe_prior = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ image=init_image,
+ strength=0.85,
+ generator=generator,
+ negative_prompt="",
+ ).to_tuple()
+
+ output = pipeline(
+ image=init_image,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ hint=hint,
+ generator=generator,
+ num_inference_steps=100,
+ height=512,
+ width=512,
+ strength=0.5,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_img2img.py b/tests/pipelines/kandinsky2_2/test_kandinsky_img2img.py
new file mode 100644
index 0000000..07a362b
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_img2img.py
@@ -0,0 +1,296 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+
+from diffusers import (
+ DDIMScheduler,
+ KandinskyV22Img2ImgPipeline,
+ KandinskyV22PriorPipeline,
+ UNet2DConditionModel,
+ VQModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 32
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 4,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "image",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ ddim_config = {
+ "num_train_timesteps": 1000,
+ "beta_schedule": "linear",
+ "beta_start": 0.00085,
+ "beta_end": 0.012,
+ "clip_sample": False,
+ "set_alpha_to_one": False,
+ "steps_offset": 0,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ }
+
+ scheduler = DDIMScheduler(**ddim_config)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed + 1)).to(
+ device
+ )
+ # create init_image
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((256, 256))
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": init_image,
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "num_inference_steps": 10,
+ "guidance_scale": 7.0,
+ "strength": 0.2,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyV22Img2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22Img2ImgPipeline
+ params = ["image_embeds", "negative_image_embeds", "image"]
+ batch_params = [
+ "image_embeds",
+ "negative_image_embeds",
+ "image",
+ ]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "strength",
+ "guidance_scale",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["image_embeds"]
+
+ def get_dummy_components(self):
+ dummies = Dummies()
+ return dummies.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummies = Dummies()
+ return dummies.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky_img2img(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.5712, 0.5443, 0.4725, 0.6195, 0.5184, 0.4651, 0.4473, 0.4590, 0.5016])
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=2e-1)
+
+
+@slow
+@require_torch_gpu
+class KandinskyV22Img2ImgPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_img2img(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/kandinskyv22_img2img_frog.npy"
+ )
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
+ )
+ prompt = "A red cartoon frog, 4k"
+
+ pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyV22Img2ImgPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ output = pipeline(
+ image=init_image,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ height=768,
+ width=768,
+ strength=0.2,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_inpaint.py b/tests/pipelines/kandinsky2_2/test_kandinsky_inpaint.py
new file mode 100644
index 0000000..6ec812c
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_inpaint.py
@@ -0,0 +1,351 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+
+from diffusers import (
+ DDIMScheduler,
+ KandinskyV22InpaintPipeline,
+ KandinskyV22PriorPipeline,
+ UNet2DConditionModel,
+ VQModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ is_flaky,
+ load_image,
+ load_numpy,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 32
+
+ @property
+ def dummy_unet(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "in_channels": 9,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 8,
+ "addition_embed_type": "image",
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "encoder_hid_dim": self.text_embedder_hidden_size,
+ "encoder_hid_dim_type": "image_proj",
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": None,
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ unet = self.dummy_unet
+ movq = self.dummy_movq
+
+ scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_schedule="linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ clip_sample=False,
+ set_alpha_to_one=False,
+ steps_offset=1,
+ prediction_type="epsilon",
+ thresholding=False,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed)).to(device)
+ negative_image_embeds = floats_tensor((1, self.text_embedder_hidden_size), rng=random.Random(seed + 1)).to(
+ device
+ )
+ # create init_image
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((256, 256))
+ # create mask
+ mask = np.zeros((64, 64), dtype=np.float32)
+ mask[:32, :32] = 1
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": init_image,
+ "mask_image": mask,
+ "image_embeds": image_embeds,
+ "negative_image_embeds": negative_image_embeds,
+ "generator": generator,
+ "height": 64,
+ "width": 64,
+ "num_inference_steps": 2,
+ "guidance_scale": 4.0,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyV22InpaintPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22InpaintPipeline
+ params = ["image_embeds", "negative_image_embeds", "image", "mask_image"]
+ batch_params = [
+ "image_embeds",
+ "negative_image_embeds",
+ "image",
+ "mask_image",
+ ]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "guidance_scale",
+ "num_inference_steps",
+ "return_dict",
+ "guidance_scale",
+ "num_images_per_prompt",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["image_embeds", "masked_image", "mask_image"]
+
+ def get_dummy_components(self):
+ dummies = Dummies()
+ return dummies.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummies = Dummies()
+ return dummies.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky_inpaint(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [0.50775903, 0.49527195, 0.48824543, 0.50192237, 0.48644906, 0.49373814, 0.4780598, 0.47234827, 0.48327848]
+ )
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ @is_flaky()
+ def test_model_cpu_offload_forward_pass(self):
+ super().test_inference_batch_single_identical(expected_max_diff=8e-4)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=5e-4)
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ super().test_sequential_cpu_offload_forward_pass(expected_max_diff=5e-4)
+
+ # override default test because we need to zero out mask too in order to make sure final latent is all zero
+ def test_callback_inputs(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_inputs_test(pipe, i, t, callback_kwargs):
+ missing_callback_inputs = set()
+ for v in pipe._callback_tensor_inputs:
+ if v not in callback_kwargs:
+ missing_callback_inputs.add(v)
+ self.assertTrue(
+ len(missing_callback_inputs) == 0, f"Missing callback tensor inputs: {missing_callback_inputs}"
+ )
+ last_i = pipe.num_timesteps - 1
+ if i == last_i:
+ callback_kwargs["latents"] = torch.zeros_like(callback_kwargs["latents"])
+ callback_kwargs["mask_image"] = torch.zeros_like(callback_kwargs["mask_image"])
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["callback_on_step_end"] = callback_inputs_test
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["output_type"] = "latent"
+
+ output = pipe(**inputs)[0]
+ assert output.abs().sum() == 0
+
+
+@slow
+@require_torch_gpu
+class KandinskyV22InpaintPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinsky_inpaint(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/kandinskyv22/kandinskyv22_inpaint_cat_with_hat_fp16.npy"
+ )
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
+ )
+ mask = np.zeros((768, 768), dtype=np.float32)
+ mask[:250, 250:-250] = 1
+
+ prompt = "a hat"
+
+ pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+ )
+ pipe_prior.to(torch_device)
+
+ pipeline = KandinskyV22InpaintPipeline.from_pretrained(
+ "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_emb, zero_image_emb = pipe_prior(
+ prompt,
+ generator=generator,
+ num_inference_steps=5,
+ negative_prompt="",
+ ).to_tuple()
+
+ output = pipeline(
+ image=init_image,
+ mask_image=mask,
+ image_embeds=image_emb,
+ negative_image_embeds=zero_image_emb,
+ generator=generator,
+ num_inference_steps=100,
+ height=768,
+ width=768,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_prior.py b/tests/pipelines/kandinsky2_2/test_kandinsky_prior.py
new file mode 100644
index 0000000..c19c574
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_prior.py
@@ -0,0 +1,278 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import unittest
+
+import numpy as np
+import torch
+from torch import nn
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import KandinskyV22PriorPipeline, PriorTransformer, UnCLIPScheduler
+from diffusers.utils.testing_utils import enable_full_determinism, skip_mps, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class Dummies:
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config)
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "num_attention_heads": 2,
+ "attention_head_dim": 12,
+ "embedding_dim": self.text_embedder_hidden_size,
+ "num_layers": 1,
+ }
+
+ model = PriorTransformer(**model_kwargs)
+ # clip_std and clip_mean is initialized to be 0 so PriorTransformer.post_process_latents will always return 0 - set clip_std to be 1 so it won't return 0
+ model.clip_std = nn.Parameter(torch.ones(model.clip_std.shape))
+ return model
+
+ @property
+ def dummy_image_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=self.text_embedder_hidden_size,
+ image_size=224,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_channels=3,
+ num_hidden_layers=5,
+ patch_size=14,
+ )
+
+ model = CLIPVisionModelWithProjection(config)
+ return model
+
+ @property
+ def dummy_image_processor(self):
+ image_processor = CLIPImageProcessor(
+ crop_size=224,
+ do_center_crop=True,
+ do_normalize=True,
+ do_resize=True,
+ image_mean=[0.48145466, 0.4578275, 0.40821073],
+ image_std=[0.26862954, 0.26130258, 0.27577711],
+ resample=3,
+ size=224,
+ )
+
+ return image_processor
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ image_encoder = self.dummy_image_encoder
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ image_processor = self.dummy_image_processor
+
+ scheduler = UnCLIPScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="sample",
+ num_train_timesteps=1000,
+ clip_sample=True,
+ clip_sample_range=10.0,
+ )
+
+ components = {
+ "prior": prior,
+ "image_encoder": image_encoder,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "image_processor": image_processor,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+
+class KandinskyV22PriorPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22PriorPipeline
+ params = ["prompt"]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "generator",
+ "num_inference_steps",
+ "latents",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ callback_cfg_params = ["prompt_embeds", "text_encoder_hidden_states", "text_mask"]
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ dummies = Dummies()
+ return dummies.get_dummy_components()
+
+ def get_dummy_inputs(self, device, seed=0):
+ dummies = Dummies()
+ return dummies.get_dummy_inputs(device=device, seed=seed)
+
+ def test_kandinsky_prior(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.image_embeds
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -10:]
+ image_from_tuple_slice = image_from_tuple[0, -10:]
+
+ assert image.shape == (1, 32)
+
+ expected_slice = np.array(
+ [-0.0532, 1.7120, 0.3656, -1.0852, -0.8946, -1.1756, 0.4348, 0.2482, 0.5146, -0.1156]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-3)
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
+
+ # override default test because no output_type "latent", use "pt" instead
+ def test_callback_inputs(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+
+ if not ("callback_on_step_end_tensor_inputs" in sig.parameters and "callback_on_step_end" in sig.parameters):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_inputs_test(pipe, i, t, callback_kwargs):
+ missing_callback_inputs = set()
+ for v in pipe._callback_tensor_inputs:
+ if v not in callback_kwargs:
+ missing_callback_inputs.add(v)
+ self.assertTrue(
+ len(missing_callback_inputs) == 0, f"Missing callback tensor inputs: {missing_callback_inputs}"
+ )
+ last_i = pipe.num_timesteps - 1
+ if i == last_i:
+ callback_kwargs["latents"] = torch.zeros_like(callback_kwargs["latents"])
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["callback_on_step_end"] = callback_inputs_test
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["num_inference_steps"] = 2
+ inputs["output_type"] = "pt"
+
+ output = pipe(**inputs)[0]
+ assert output.abs().sum() == 0
diff --git a/tests/pipelines/kandinsky2_2/test_kandinsky_prior_emb2emb.py b/tests/pipelines/kandinsky2_2/test_kandinsky_prior_emb2emb.py
new file mode 100644
index 0000000..7d66fb9
--- /dev/null
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_prior_emb2emb.py
@@ -0,0 +1,247 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from torch import nn
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import KandinskyV22PriorEmb2EmbPipeline, PriorTransformer, UnCLIPScheduler
+from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, skip_mps, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class KandinskyV22PriorEmb2EmbPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = KandinskyV22PriorEmb2EmbPipeline
+ params = ["prompt", "image"]
+ batch_params = ["prompt", "image"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "strength",
+ "generator",
+ "num_inference_steps",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config)
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "num_attention_heads": 2,
+ "attention_head_dim": 12,
+ "embedding_dim": self.text_embedder_hidden_size,
+ "num_layers": 1,
+ }
+
+ model = PriorTransformer(**model_kwargs)
+ # clip_std and clip_mean is initialized to be 0 so PriorTransformer.post_process_latents will always return 0 - set clip_std to be 1 so it won't return 0
+ model.clip_std = nn.Parameter(torch.ones(model.clip_std.shape))
+ return model
+
+ @property
+ def dummy_image_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=self.text_embedder_hidden_size,
+ image_size=224,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_channels=3,
+ num_hidden_layers=5,
+ patch_size=14,
+ )
+
+ model = CLIPVisionModelWithProjection(config)
+ return model
+
+ @property
+ def dummy_image_processor(self):
+ image_processor = CLIPImageProcessor(
+ crop_size=224,
+ do_center_crop=True,
+ do_normalize=True,
+ do_resize=True,
+ image_mean=[0.48145466, 0.4578275, 0.40821073],
+ image_std=[0.26862954, 0.26130258, 0.27577711],
+ resample=3,
+ size=224,
+ )
+
+ return image_processor
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ image_encoder = self.dummy_image_encoder
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ image_processor = self.dummy_image_processor
+
+ scheduler = UnCLIPScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="sample",
+ num_train_timesteps=1000,
+ clip_sample=True,
+ clip_sample_range=10.0,
+ )
+
+ components = {
+ "prior": prior,
+ "image_encoder": image_encoder,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "image_processor": image_processor,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((256, 256))
+
+ inputs = {
+ "prompt": "horse",
+ "image": init_image,
+ "strength": 0.5,
+ "generator": generator,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_kandinsky_prior_emb2emb(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.image_embeds
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -10:]
+ image_from_tuple_slice = image_from_tuple[0, -10:]
+
+ assert image.shape == (1, 32)
+
+ expected_slice = np.array(
+ [
+ 0.1071284,
+ 1.3330271,
+ 0.61260223,
+ -0.6691065,
+ -0.3846852,
+ -1.0303661,
+ 0.22716111,
+ 0.03348901,
+ 0.30040675,
+ -0.24805029,
+ ]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
diff --git a/tests/pipelines/kandinsky3/__init__.py b/tests/pipelines/kandinsky3/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/kandinsky3/test_kandinsky3.py b/tests/pipelines/kandinsky3/test_kandinsky3.py
new file mode 100644
index 0000000..63b4165
--- /dev/null
+++ b/tests/pipelines/kandinsky3/test_kandinsky3.py
@@ -0,0 +1,233 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoTokenizer, T5EncoderModel
+
+from diffusers import (
+ AutoPipelineForImage2Image,
+ AutoPipelineForText2Image,
+ Kandinsky3Pipeline,
+ Kandinsky3UNet,
+ VQModel,
+)
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_image,
+ require_torch_gpu,
+ slow,
+)
+
+from ..pipeline_params import (
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class Kandinsky3PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = Kandinsky3Pipeline
+ params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS
+ test_xformers_attention = False
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = Kandinsky3UNet(
+ in_channels=4,
+ time_embedding_dim=4,
+ groups=2,
+ attention_head_dim=4,
+ layers_per_block=3,
+ block_out_channels=(32, 64),
+ cross_attention_dim=4,
+ encoder_hid_dim=32,
+ )
+ scheduler = DDPMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="squaredcos_cap_v2",
+ clip_sample=True,
+ thresholding=False,
+ )
+ torch.manual_seed(0)
+ movq = self.dummy_movq
+ torch.manual_seed(0)
+ text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ torch.manual_seed(0)
+ tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ "width": 16,
+ "height": 16,
+ }
+ return inputs
+
+ def test_kandinsky3(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 16, 16, 3)
+
+ expected_slice = np.array([0.3768, 0.4373, 0.4865, 0.4890, 0.4299, 0.5122, 0.4921, 0.4924, 0.5599])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+
+@slow
+@require_torch_gpu
+class Kandinsky3PipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinskyV3(self):
+ pipe = AutoPipelineForText2Image.from_pretrained(
+ "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]
+
+ assert image.size == (1024, 1024)
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png"
+ )
+
+ image_processor = VaeImageProcessor()
+
+ image_np = image_processor.pil_to_numpy(image)
+ expected_image_np = image_processor.pil_to_numpy(expected_image)
+
+ self.assertTrue(np.allclose(image_np, expected_image_np, atol=5e-2))
+
+ def test_kandinskyV3_img2img(self):
+ pipe = AutoPipelineForImage2Image.from_pretrained(
+ "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png"
+ )
+ w, h = 512, 512
+ image = image.resize((w, h), resample=Image.BICUBIC, reducing_gap=1)
+ prompt = "A painting of the inside of a subway train with tiny raccoons."
+
+ image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]
+
+ assert image.size == (512, 512)
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/i2i.png"
+ )
+
+ image_processor = VaeImageProcessor()
+
+ image_np = image_processor.pil_to_numpy(image)
+ expected_image_np = image_processor.pil_to_numpy(expected_image)
+
+ self.assertTrue(np.allclose(image_np, expected_image_np, atol=5e-2))
diff --git a/tests/pipelines/kandinsky3/test_kandinsky3_img2img.py b/tests/pipelines/kandinsky3/test_kandinsky3_img2img.py
new file mode 100644
index 0000000..fbaaaf1
--- /dev/null
+++ b/tests/pipelines/kandinsky3/test_kandinsky3_img2img.py
@@ -0,0 +1,225 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoTokenizer, T5EncoderModel
+
+from diffusers import (
+ AutoPipelineForImage2Image,
+ Kandinsky3Img2ImgPipeline,
+ Kandinsky3UNet,
+ VQModel,
+)
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ require_torch_gpu,
+ slow,
+)
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class Kandinsky3Img2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = Kandinsky3Img2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS
+ test_xformers_attention = False
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "num_images_per_prompt",
+ "generator",
+ "output_type",
+ "return_dict",
+ ]
+ )
+
+ @property
+ def dummy_movq_kwargs(self):
+ return {
+ "block_out_channels": [32, 64],
+ "down_block_types": ["DownEncoderBlock2D", "AttnDownEncoderBlock2D"],
+ "in_channels": 3,
+ "latent_channels": 4,
+ "layers_per_block": 1,
+ "norm_num_groups": 8,
+ "norm_type": "spatial",
+ "num_vq_embeddings": 12,
+ "out_channels": 3,
+ "up_block_types": [
+ "AttnUpDecoderBlock2D",
+ "UpDecoderBlock2D",
+ ],
+ "vq_embed_dim": 4,
+ }
+
+ @property
+ def dummy_movq(self):
+ torch.manual_seed(0)
+ model = VQModel(**self.dummy_movq_kwargs)
+ return model
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = Kandinsky3UNet(
+ in_channels=4,
+ time_embedding_dim=4,
+ groups=2,
+ attention_head_dim=4,
+ layers_per_block=3,
+ block_out_channels=(32, 64),
+ cross_attention_dim=4,
+ encoder_hid_dim=32,
+ )
+ scheduler = DDPMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="squaredcos_cap_v2",
+ clip_sample=True,
+ thresholding=False,
+ )
+ torch.manual_seed(0)
+ movq = self.dummy_movq
+ torch.manual_seed(0)
+ text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ torch.manual_seed(0)
+ tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "movq": movq,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ # create init_image
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB")
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": init_image,
+ "generator": generator,
+ "strength": 0.75,
+ "num_inference_steps": 10,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_kandinsky3_img2img(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [0.576259, 0.6132097, 0.41703486, 0.603196, 0.62062526, 0.4655338, 0.5434324, 0.5660727, 0.65433365]
+ )
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+
+@slow
+@require_torch_gpu
+class Kandinsky3Img2ImgPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_kandinskyV3_img2img(self):
+ pipe = AutoPipelineForImage2Image.from_pretrained(
+ "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png"
+ )
+ w, h = 512, 512
+ image = image.resize((w, h), resample=Image.BICUBIC, reducing_gap=1)
+ prompt = "A painting of the inside of a subway train with tiny raccoons."
+
+ image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]
+
+ assert image.size == (512, 512)
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/i2i.png"
+ )
+
+ image_processor = VaeImageProcessor()
+
+ image_np = image_processor.pil_to_numpy(image)
+ expected_image_np = image_processor.pil_to_numpy(expected_image)
+
+ self.assertTrue(np.allclose(image_np, expected_image_np, atol=5e-2))
diff --git a/tests/pipelines/latent_consistency_models/__init__.py b/tests/pipelines/latent_consistency_models/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/latent_consistency_models/test_latent_consistency_models.py b/tests/pipelines/latent_consistency_models/test_latent_consistency_models.py
new file mode 100644
index 0000000..eaf8fa2
--- /dev/null
+++ b/tests/pipelines/latent_consistency_models/test_latent_consistency_models.py
@@ -0,0 +1,259 @@
+import gc
+import inspect
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ LatentConsistencyModelPipeline,
+ LCMScheduler,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class LatentConsistencyModelPipelineFastTests(
+ IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = LatentConsistencyModelPipeline
+ params = TEXT_TO_IMAGE_PARAMS - {"negative_prompt", "negative_prompt_embeds"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS - {"negative_prompt"}
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=2,
+ time_cond_proj_dim=32,
+ )
+ scheduler = LCMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=64,
+ layer_norm_eps=1e-05,
+ num_attention_heads=8,
+ num_hidden_layers=3,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ "requires_safety_checker": False,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_lcm_onestep(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = LatentConsistencyModelPipeline(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["num_inference_steps"] = 1
+ output = pipe(**inputs)
+ image = output.images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.1441, 0.5304, 0.5452, 0.1361, 0.4011, 0.4370, 0.5326, 0.3492, 0.3637])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_lcm_multistep(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = LatentConsistencyModelPipeline(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = pipe(**inputs)
+ image = output.images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.1403, 0.5072, 0.5316, 0.1202, 0.3865, 0.4211, 0.5363, 0.3557, 0.3645])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = LatentConsistencyModelPipeline(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = pipe(**inputs)
+ image = output.images
+ assert image.shape == (1, 64, 64, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.1403, 0.5072, 0.5316, 0.1202, 0.3865, 0.4211, 0.5363, 0.3557, 0.3645])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=5e-4)
+
+ # skip because lcm pipeline apply cfg differently
+ def test_callback_cfg(self):
+ pass
+
+ # override default test because the final latent variable is "denoised" instead of "latents"
+ def test_callback_inputs(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+
+ if not ("callback_on_step_end_tensor_inputs" in sig.parameters and "callback_on_step_end" in sig.parameters):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_inputs_test(pipe, i, t, callback_kwargs):
+ missing_callback_inputs = set()
+ for v in pipe._callback_tensor_inputs:
+ if v not in callback_kwargs:
+ missing_callback_inputs.add(v)
+ self.assertTrue(
+ len(missing_callback_inputs) == 0, f"Missing callback tensor inputs: {missing_callback_inputs}"
+ )
+ last_i = pipe.num_timesteps - 1
+ if i == last_i:
+ callback_kwargs["denoised"] = torch.zeros_like(callback_kwargs["denoised"])
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["callback_on_step_end"] = callback_inputs_test
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["output_type"] = "latent"
+
+ output = pipe(**inputs)[0]
+ assert output.abs().sum() == 0
+
+
+@slow
+@require_torch_gpu
+class LatentConsistencyModelPipelineSlowTests(unittest.TestCase):
+ def setUp(self):
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_lcm_onestep(self):
+ pipe = LatentConsistencyModelPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", safety_checker=None)
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 1
+ image = pipe(**inputs).images
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+ expected_slice = np.array([0.1025, 0.0911, 0.0984, 0.0981, 0.0901, 0.0918, 0.1055, 0.0940, 0.0730])
+ assert np.abs(image_slice - expected_slice).max() < 1e-3
+
+ def test_lcm_multistep(self):
+ pipe = LatentConsistencyModelPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", safety_checker=None)
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+ expected_slice = np.array([0.01855, 0.01855, 0.01489, 0.01392, 0.01782, 0.01465, 0.01831, 0.02539, 0.0])
+ assert np.abs(image_slice - expected_slice).max() < 1e-3
diff --git a/tests/pipelines/latent_consistency_models/test_latent_consistency_models_img2img.py b/tests/pipelines/latent_consistency_models/test_latent_consistency_models_img2img.py
new file mode 100644
index 0000000..cfd596d
--- /dev/null
+++ b/tests/pipelines/latent_consistency_models/test_latent_consistency_models_img2img.py
@@ -0,0 +1,277 @@
+import gc
+import inspect
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ LatentConsistencyModelImg2ImgPipeline,
+ LCMScheduler,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class LatentConsistencyModelImg2ImgPipelineFastTests(
+ IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = LatentConsistencyModelImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width", "negative_prompt", "negative_prompt_embeds"}
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents", "negative_prompt"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=2,
+ time_cond_proj_dim=32,
+ )
+ scheduler = LCMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=64,
+ layer_norm_eps=1e-05,
+ num_attention_heads=8,
+ num_hidden_layers=3,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ "requires_safety_checker": False,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image / 2 + 0.5
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_lcm_onestep(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["num_inference_steps"] = 1
+ output = pipe(**inputs)
+ image = output.images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.4388, 0.3717, 0.2202, 0.7213, 0.6370, 0.3664, 0.5815, 0.6080, 0.4977])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_lcm_multistep(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = pipe(**inputs)
+ image = output.images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.4150, 0.3719, 0.2479, 0.6333, 0.6024, 0.3778, 0.5036, 0.5420, 0.4678])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = pipe(**inputs)
+ image = output.images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.3994, 0.3471, 0.2540, 0.7030, 0.6193, 0.3645, 0.5777, 0.5850, 0.4965])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=5e-4)
+
+ # override default test because the final latent variable is "denoised" instead of "latents"
+ def test_callback_inputs(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+
+ if not ("callback_on_step_end_tensor_inputs" in sig.parameters and "callback_on_step_end" in sig.parameters):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_inputs_test(pipe, i, t, callback_kwargs):
+ missing_callback_inputs = set()
+ for v in pipe._callback_tensor_inputs:
+ if v not in callback_kwargs:
+ missing_callback_inputs.add(v)
+ self.assertTrue(
+ len(missing_callback_inputs) == 0, f"Missing callback tensor inputs: {missing_callback_inputs}"
+ )
+ last_i = pipe.num_timesteps - 1
+ if i == last_i:
+ callback_kwargs["denoised"] = torch.zeros_like(callback_kwargs["denoised"])
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["callback_on_step_end"] = callback_inputs_test
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["output_type"] = "latent"
+
+ output = pipe(**inputs)[0]
+ assert output.abs().sum() == 0
+
+
+@slow
+@require_torch_gpu
+class LatentConsistencyModelImg2ImgPipelineSlowTests(unittest.TestCase):
+ def setUp(self):
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/sketch-mountains-input.png"
+ )
+ init_image = init_image.resize((512, 512))
+
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "np",
+ "image": init_image,
+ }
+ return inputs
+
+ def test_lcm_onestep(self):
+ pipe = LatentConsistencyModelImg2ImgPipeline.from_pretrained(
+ "SimianLuo/LCM_Dreamshaper_v7", safety_checker=None
+ )
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 1
+ image = pipe(**inputs).images
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+ expected_slice = np.array([0.1950, 0.1961, 0.2308, 0.1786, 0.1837, 0.2320, 0.1898, 0.1885, 0.2309])
+ assert np.abs(image_slice - expected_slice).max() < 1e-3
+
+ def test_lcm_multistep(self):
+ pipe = LatentConsistencyModelImg2ImgPipeline.from_pretrained(
+ "SimianLuo/LCM_Dreamshaper_v7", safety_checker=None
+ )
+ pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1].flatten()
+ expected_slice = np.array([0.3756, 0.3816, 0.3767, 0.3718, 0.3739, 0.3735, 0.3863, 0.3803, 0.3563])
+ assert np.abs(image_slice - expected_slice).max() < 1e-3
diff --git a/tests/pipelines/latent_diffusion/__init__.py b/tests/pipelines/latent_diffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/latent_diffusion/test_latent_diffusion.py b/tests/pipelines/latent_diffusion/test_latent_diffusion.py
new file mode 100644
index 0000000..4faa0e7
--- /dev/null
+++ b/tests/pipelines/latent_diffusion/test_latent_diffusion.py
@@ -0,0 +1,207 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import AutoencoderKL, DDIMScheduler, LDMTextToImagePipeline, UNet2DConditionModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class LDMTextToImagePipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = LDMTextToImagePipeline
+ params = TEXT_TO_IMAGE_PARAMS - {
+ "negative_prompt",
+ "negative_prompt_embeds",
+ "cross_attention_kwargs",
+ "prompt_embeds",
+ }
+ required_optional_params = PipelineTesterMixin.required_optional_params - {
+ "num_images_per_prompt",
+ "callback",
+ "callback_steps",
+ }
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=(32, 64),
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownEncoderBlock2D", "DownEncoderBlock2D"),
+ up_block_types=("UpDecoderBlock2D", "UpDecoderBlock2D"),
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vqvae": vae,
+ "bert": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_inference_text2img(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ pipe = LDMTextToImagePipeline(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 16, 16, 3)
+ expected_slice = np.array([0.6101, 0.6156, 0.5622, 0.4895, 0.6661, 0.3804, 0.5748, 0.6136, 0.5014])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+
+@nightly
+@require_torch_gpu
+class LDMTextToImagePipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, dtype=torch.float32, seed=0):
+ generator = torch.manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 32, 32))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_ldm_default_ddim(self):
+ pipe = LDMTextToImagePipeline.from_pretrained("CompVis/ldm-text2im-large-256").to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.51825, 0.52850, 0.52543, 0.54258, 0.52304, 0.52569, 0.54363, 0.55276, 0.56878])
+ max_diff = np.abs(expected_slice - image_slice).max()
+ assert max_diff < 1e-3
+
+
+@nightly
+@require_torch_gpu
+class LDMTextToImagePipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, dtype=torch.float32, seed=0):
+ generator = torch.manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 32, 32))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_ldm_default_ddim(self):
+ pipe = LDMTextToImagePipeline.from_pretrained("CompVis/ldm-text2im-large-256").to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main/ldm_text2img/ldm_large_256_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/latent_diffusion/test_latent_diffusion_superresolution.py b/tests/pipelines/latent_diffusion/test_latent_diffusion_superresolution.py
new file mode 100644
index 0000000..a9df2c1
--- /dev/null
+++ b/tests/pipelines/latent_diffusion/test_latent_diffusion_superresolution.py
@@ -0,0 +1,138 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import DDIMScheduler, LDMSuperResolutionPipeline, UNet2DModel, VQModel
+from diffusers.utils import PIL_INTERPOLATION
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ nightly,
+ require_torch,
+ torch_device,
+)
+
+
+enable_full_determinism()
+
+
+class LDMSuperResolutionPipelineFastTests(unittest.TestCase):
+ @property
+ def dummy_image(self):
+ batch_size = 1
+ num_channels = 3
+ sizes = (32, 32)
+
+ image = floats_tensor((batch_size, num_channels) + sizes, rng=random.Random(0)).to(torch_device)
+ return image
+
+ @property
+ def dummy_uncond_unet(self):
+ torch.manual_seed(0)
+ model = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=6,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ return model
+
+ @property
+ def dummy_vq_model(self):
+ torch.manual_seed(0)
+ model = VQModel(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=3,
+ )
+ return model
+
+ def test_inference_superresolution(self):
+ device = "cpu"
+ unet = self.dummy_uncond_unet
+ scheduler = DDIMScheduler()
+ vqvae = self.dummy_vq_model
+
+ ldm = LDMSuperResolutionPipeline(unet=unet, vqvae=vqvae, scheduler=scheduler)
+ ldm.to(device)
+ ldm.set_progress_bar_config(disable=None)
+
+ init_image = self.dummy_image.to(device)
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image = ldm(image=init_image, generator=generator, num_inference_steps=2, output_type="numpy").images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.8678, 0.8245, 0.6381, 0.6830, 0.4385, 0.5599, 0.4641, 0.6201, 0.5150])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ @unittest.skipIf(torch_device != "cuda", "This test requires a GPU")
+ def test_inference_superresolution_fp16(self):
+ unet = self.dummy_uncond_unet
+ scheduler = DDIMScheduler()
+ vqvae = self.dummy_vq_model
+
+ # put models in fp16
+ unet = unet.half()
+ vqvae = vqvae.half()
+
+ ldm = LDMSuperResolutionPipeline(unet=unet, vqvae=vqvae, scheduler=scheduler)
+ ldm.to(torch_device)
+ ldm.set_progress_bar_config(disable=None)
+
+ init_image = self.dummy_image.to(torch_device)
+
+ image = ldm(init_image, num_inference_steps=2, output_type="numpy").images
+
+ assert image.shape == (1, 64, 64, 3)
+
+
+@nightly
+@require_torch
+class LDMSuperResolutionPipelineIntegrationTests(unittest.TestCase):
+ def test_inference_superresolution(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/vq_diffusion/teddy_bear_pool.png"
+ )
+ init_image = init_image.resize((64, 64), resample=PIL_INTERPOLATION["lanczos"])
+
+ ldm = LDMSuperResolutionPipeline.from_pretrained("duongna/ldm-super-resolution", device_map="auto")
+ ldm.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ image = ldm(image=init_image, generator=generator, num_inference_steps=20, output_type="numpy").images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 256, 256, 3)
+ expected_slice = np.array([0.7644, 0.7679, 0.7642, 0.7633, 0.7666, 0.7560, 0.7425, 0.7257, 0.6907])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/musicldm/__init__.py b/tests/pipelines/musicldm/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/musicldm/test_musicldm.py b/tests/pipelines/musicldm/test_musicldm.py
new file mode 100644
index 0000000..779b0cb
--- /dev/null
+++ b/tests/pipelines/musicldm/test_musicldm.py
@@ -0,0 +1,465 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ ClapAudioConfig,
+ ClapConfig,
+ ClapFeatureExtractor,
+ ClapModel,
+ ClapTextConfig,
+ RobertaTokenizer,
+ SpeechT5HifiGan,
+ SpeechT5HifiGanConfig,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ LMSDiscreteScheduler,
+ MusicLDMPipeline,
+ PNDMScheduler,
+ UNet2DConditionModel,
+)
+from diffusers.utils import is_xformers_available
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, torch_device
+
+from ..pipeline_params import TEXT_TO_AUDIO_BATCH_PARAMS, TEXT_TO_AUDIO_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class MusicLDMPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = MusicLDMPipeline
+ params = TEXT_TO_AUDIO_PARAMS
+ batch_params = TEXT_TO_AUDIO_BATCH_PARAMS
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "num_waveforms_per_prompt",
+ "generator",
+ "latents",
+ "output_type",
+ "return_dict",
+ "callback",
+ "callback_steps",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=(32, 64),
+ class_embed_type="simple_projection",
+ projection_class_embeddings_input_dim=32,
+ class_embeddings_concat=True,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=1,
+ out_channels=1,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_branch_config = ClapTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=16,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=2,
+ num_hidden_layers=2,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ audio_branch_config = ClapAudioConfig(
+ spec_size=64,
+ window_size=4,
+ num_mel_bins=64,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ depths=[2, 2],
+ num_attention_heads=[2, 2],
+ num_hidden_layers=2,
+ hidden_size=192,
+ patch_size=2,
+ patch_stride=2,
+ patch_embed_input_channels=4,
+ )
+ text_encoder_config = ClapConfig.from_text_audio_configs(
+ text_config=text_branch_config, audio_config=audio_branch_config, projection_dim=32
+ )
+ text_encoder = ClapModel(text_encoder_config)
+ tokenizer = RobertaTokenizer.from_pretrained("hf-internal-testing/tiny-random-roberta", model_max_length=77)
+ feature_extractor = ClapFeatureExtractor.from_pretrained(
+ "hf-internal-testing/tiny-random-ClapModel", hop_length=7900
+ )
+
+ torch.manual_seed(0)
+ vocoder_config = SpeechT5HifiGanConfig(
+ model_in_dim=8,
+ sampling_rate=16000,
+ upsample_initial_channel=16,
+ upsample_rates=[2, 2],
+ upsample_kernel_sizes=[4, 4],
+ resblock_kernel_sizes=[3, 7],
+ resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5]],
+ normalize_before=False,
+ )
+
+ vocoder = SpeechT5HifiGan(vocoder_config)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "feature_extractor": feature_extractor,
+ "vocoder": vocoder,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ }
+ return inputs
+
+ def test_musicldm_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = musicldm_pipe(**inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 256
+
+ audio_slice = audio[:10]
+ expected_slice = np.array(
+ [-0.0027, -0.0036, -0.0037, -0.0020, -0.0035, -0.0019, -0.0037, -0.0020, -0.0038, -0.0019]
+ )
+
+ assert np.abs(audio_slice - expected_slice).max() < 1e-4
+
+ def test_musicldm_prompt_embeds(self):
+ components = self.get_dummy_components()
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = musicldm_pipe(**inputs)
+ audio_1 = output.audios[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = musicldm_pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=musicldm_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ prompt_embeds = musicldm_pipe.text_encoder.get_text_features(text_inputs)
+
+ inputs["prompt_embeds"] = prompt_embeds
+
+ # forward
+ output = musicldm_pipe(**inputs)
+ audio_2 = output.audios[0]
+
+ assert np.abs(audio_1 - audio_2).max() < 1e-2
+
+ def test_musicldm_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = musicldm_pipe(**inputs)
+ audio_1 = output.audios[0]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ embeds = []
+ for p in [prompt, negative_prompt]:
+ text_inputs = musicldm_pipe.tokenizer(
+ p,
+ padding="max_length",
+ max_length=musicldm_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ text_embeds = musicldm_pipe.text_encoder.get_text_features(
+ text_inputs,
+ )
+ embeds.append(text_embeds)
+
+ inputs["prompt_embeds"], inputs["negative_prompt_embeds"] = embeds
+
+ # forward
+ output = musicldm_pipe(**inputs)
+ audio_2 = output.audios[0]
+
+ assert np.abs(audio_1 - audio_2).max() < 1e-2
+
+ def test_musicldm_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "egg cracking"
+ output = musicldm_pipe(**inputs, negative_prompt=negative_prompt)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 256
+
+ audio_slice = audio[:10]
+ expected_slice = np.array(
+ [-0.0027, -0.0036, -0.0037, -0.0019, -0.0035, -0.0018, -0.0037, -0.0021, -0.0038, -0.0018]
+ )
+
+ assert np.abs(audio_slice - expected_slice).max() < 1e-4
+
+ def test_musicldm_num_waveforms_per_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A hammer hitting a wooden surface"
+
+ # test num_waveforms_per_prompt=1 (default)
+ audios = musicldm_pipe(prompt, num_inference_steps=2).audios
+
+ assert audios.shape == (1, 256)
+
+ # test num_waveforms_per_prompt=1 (default) for batch of prompts
+ batch_size = 2
+ audios = musicldm_pipe([prompt] * batch_size, num_inference_steps=2).audios
+
+ assert audios.shape == (batch_size, 256)
+
+ # test num_waveforms_per_prompt for single prompt
+ num_waveforms_per_prompt = 2
+ audios = musicldm_pipe(prompt, num_inference_steps=2, num_waveforms_per_prompt=num_waveforms_per_prompt).audios
+
+ assert audios.shape == (num_waveforms_per_prompt, 256)
+
+ # test num_waveforms_per_prompt for batch of prompts
+ batch_size = 2
+ audios = musicldm_pipe(
+ [prompt] * batch_size, num_inference_steps=2, num_waveforms_per_prompt=num_waveforms_per_prompt
+ ).audios
+
+ assert audios.shape == (batch_size * num_waveforms_per_prompt, 256)
+
+ def test_musicldm_audio_length_in_s(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+ vocoder_sampling_rate = musicldm_pipe.vocoder.config.sampling_rate
+
+ inputs = self.get_dummy_inputs(device)
+ output = musicldm_pipe(audio_length_in_s=0.016, **inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) / vocoder_sampling_rate == 0.016
+
+ output = musicldm_pipe(audio_length_in_s=0.032, **inputs)
+ audio = output.audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) / vocoder_sampling_rate == 0.032
+
+ def test_musicldm_vocoder_model_in_dim(self):
+ components = self.get_dummy_components()
+ musicldm_pipe = MusicLDMPipeline(**components)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ prompt = ["hey"]
+
+ output = musicldm_pipe(prompt, num_inference_steps=1)
+ audio_shape = output.audios.shape
+ assert audio_shape == (1, 256)
+
+ config = musicldm_pipe.vocoder.config
+ config.model_in_dim *= 2
+ musicldm_pipe.vocoder = SpeechT5HifiGan(config).to(torch_device)
+ output = musicldm_pipe(prompt, num_inference_steps=1)
+ audio_shape = output.audios.shape
+ # waveform shape is unchanged, we just have 2x the number of mel channels in the spectrogram
+ assert audio_shape == (1, 256)
+
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(test_mean_pixel_difference=False)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical()
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ # The method component.dtype returns the dtype of the first parameter registered in the model, not the
+ # dtype of the entire model. In the case of CLAP, the first parameter is a float64 constant (logit scale)
+ model_dtypes = {key: component.dtype for key, component in components.items() if hasattr(component, "dtype")}
+
+ # Without the logit scale parameters, everything is float32
+ model_dtypes.pop("text_encoder")
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes.values()))
+
+ # the CLAP sub-models are float32
+ model_dtypes["clap_text_branch"] = components["text_encoder"].text_model.dtype
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes.values()))
+
+ # Once we send to fp16, all params are in half-precision, including the logit scale
+ pipe.to(dtype=torch.float16)
+ model_dtypes = {key: component.dtype for key, component in components.items() if hasattr(component, "dtype")}
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes.values()))
+
+
+@nightly
+@require_torch_gpu
+class MusicLDMPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 8, 128, 16))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "A hammer hitting a wooden surface",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 2.5,
+ }
+ return inputs
+
+ def test_musicldm(self):
+ musicldm_pipe = MusicLDMPipeline.from_pretrained("cvssp/musicldm")
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 25
+ audio = musicldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81952
+
+ # check the portion of the generated audio with the largest dynamic range (reduces flakiness)
+ audio_slice = audio[8680:8690]
+ expected_slice = np.array(
+ [-0.1042, -0.1068, -0.1235, -0.1387, -0.1428, -0.136, -0.1213, -0.1097, -0.0967, -0.0945]
+ )
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 1e-3
+
+ def test_musicldm_lms(self):
+ musicldm_pipe = MusicLDMPipeline.from_pretrained("cvssp/musicldm")
+ musicldm_pipe.scheduler = LMSDiscreteScheduler.from_config(musicldm_pipe.scheduler.config)
+ musicldm_pipe = musicldm_pipe.to(torch_device)
+ musicldm_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ audio = musicldm_pipe(**inputs).audios[0]
+
+ assert audio.ndim == 1
+ assert len(audio) == 81952
+
+ # check the portion of the generated audio with the largest dynamic range (reduces flakiness)
+ audio_slice = audio[58020:58030]
+ expected_slice = np.array([0.3592, 0.3477, 0.4084, 0.4665, 0.5048, 0.5891, 0.6461, 0.5579, 0.4595, 0.4403])
+ max_diff = np.abs(expected_slice - audio_slice).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/paint_by_example/__init__.py b/tests/pipelines/paint_by_example/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/paint_by_example/test_paint_by_example.py b/tests/pipelines/paint_by_example/test_paint_by_example.py
new file mode 100644
index 0000000..7771977
--- /dev/null
+++ b/tests/pipelines/paint_by_example/test_paint_by_example.py
@@ -0,0 +1,220 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPImageProcessor, CLIPVisionConfig
+
+from diffusers import AutoencoderKL, PaintByExamplePipeline, PNDMScheduler, UNet2DConditionModel
+from diffusers.pipelines.paint_by_example import PaintByExampleImageEncoder
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..pipeline_params import IMAGE_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS, IMAGE_GUIDED_IMAGE_INPAINTING_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class PaintByExamplePipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = PaintByExamplePipeline
+ params = IMAGE_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = IMAGE_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset([]) # TO_DO: update the image_prams once refactored VaeImageProcessor.preprocess
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=9,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=32,
+ projection_dim=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ image_size=32,
+ patch_size=4,
+ )
+ image_encoder = PaintByExampleImageEncoder(config, proj_size=32)
+ feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "image_encoder": image_encoder,
+ "safety_checker": None,
+ "feature_extractor": feature_extractor,
+ }
+ return components
+
+ def convert_to_pt(self, image):
+ image = np.array(image.convert("RGB"))
+ image = image[None].transpose(0, 3, 1, 2)
+ image = torch.from_numpy(image).to(dtype=torch.float32) / 127.5 - 1.0
+ return image
+
+ def get_dummy_inputs(self, device="cpu", seed=0):
+ # TODO: use tensor inputs instead of PIL, this is here just to leave the old expected_slices untouched
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+ mask_image = Image.fromarray(np.uint8(image + 4)).convert("RGB").resize((64, 64))
+ example_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((32, 32))
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "example_image": example_image,
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_paint_by_example_inpaint(self):
+ components = self.get_dummy_components()
+
+ # make sure here that pndm scheduler skips prk
+ pipe = PaintByExamplePipeline(**components)
+ pipe = pipe.to("cpu")
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ output = pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4686, 0.5687, 0.4007, 0.5218, 0.5741, 0.4482, 0.4940, 0.4629, 0.4503])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_paint_by_example_image_tensor(self):
+ device = "cpu"
+ inputs = self.get_dummy_inputs()
+ inputs.pop("mask_image")
+ image = self.convert_to_pt(inputs.pop("image"))
+ mask_image = image.clamp(0, 1) / 2
+
+ # make sure here that pndm scheduler skips prk
+ pipe = PaintByExamplePipeline(**self.get_dummy_components())
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(image=image, mask_image=mask_image[:, 0], **inputs)
+ out_1 = output.images
+
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ mask_image = mask_image.cpu().permute(0, 2, 3, 1)[0]
+
+ image = Image.fromarray(np.uint8(image)).convert("RGB")
+ mask_image = Image.fromarray(np.uint8(mask_image)).convert("RGB")
+
+ output = pipe(**self.get_dummy_inputs())
+ out_2 = output.images
+
+ assert out_1.shape == (1, 64, 64, 3)
+ assert np.abs(out_1.flatten() - out_2.flatten()).max() < 5e-2
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+
+@nightly
+@require_torch_gpu
+class PaintByExamplePipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_paint_by_example(self):
+ # make sure here that pndm scheduler skips prk
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/paint_by_example/dog_in_bucket.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/paint_by_example/mask.png"
+ )
+ example_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/paint_by_example/panda.jpg"
+ )
+
+ pipe = PaintByExamplePipeline.from_pretrained("Fantasy-Studio/Paint-by-Example")
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(321)
+ output = pipe(
+ image=init_image,
+ mask_image=mask_image,
+ example_image=example_image,
+ generator=generator,
+ guidance_scale=5.0,
+ num_inference_steps=50,
+ output_type="np",
+ )
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.4834, 0.4811, 0.4874, 0.5122, 0.5081, 0.5144, 0.5291, 0.5290, 0.5374])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/pia/__init__.py b/tests/pipelines/pia/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/pia/test_pia.py b/tests/pipelines/pia/test_pia.py
new file mode 100644
index 0000000..2813dc7
--- /dev/null
+++ b/tests/pipelines/pia/test_pia.py
@@ -0,0 +1,311 @@
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ MotionAdapter,
+ PIAPipeline,
+ UNet2DConditionModel,
+ UNetMotionModel,
+)
+from diffusers.utils import is_xformers_available, logging
+from diffusers.utils.testing_utils import floats_tensor, torch_device
+
+from ..test_pipelines_common import IPAdapterTesterMixin, PipelineTesterMixin
+
+
+def to_np(tensor):
+ if isinstance(tensor, torch.Tensor):
+ tensor = tensor.detach().cpu().numpy()
+
+ return tensor
+
+
+class PIAPipelineFastTests(IPAdapterTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = PIAPipeline
+ params = frozenset(
+ [
+ "prompt",
+ "height",
+ "width",
+ "guidance_scale",
+ "negative_prompt",
+ "prompt_embeds",
+ "negative_prompt_embeds",
+ "cross_attention_kwargs",
+ ]
+ )
+ batch_params = frozenset(["prompt", "image", "generator"])
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "return_dict",
+ "callback_on_step_end",
+ "callback_on_step_end_tensor_inputs",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=2,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="linear",
+ clip_sample=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ motion_adapter = MotionAdapter(
+ block_out_channels=(32, 64),
+ motion_layers_per_block=2,
+ motion_norm_num_groups=2,
+ motion_num_attention_heads=4,
+ conv_in_channels=9,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "motion_adapter": motion_adapter,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ inputs = {
+ "image": image,
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 7.5,
+ "output_type": "pt",
+ }
+ return inputs
+
+ def test_motion_unet_loading(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+
+ assert isinstance(pipe.unet, UNetMotionModel)
+
+ @unittest.skip("Attention slicing is not enabled in this pipeline")
+ def test_attention_slicing_forward_pass(self):
+ pass
+
+ def test_inference_batch_single_identical(
+ self,
+ batch_size=2,
+ expected_max_diff=1e-4,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for components in pipe.components.values():
+ if hasattr(components, "set_default_attn_processor"):
+ components.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is has been used in self.get_dummy_inputs
+ inputs["generator"] = self.get_generator(0)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batched_inputs.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+ batched_inputs[name][-1] = 100 * "very long"
+
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ if "generator" in inputs:
+ batched_inputs["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_inputs["batch_size"] = batch_size
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ output = pipe(**inputs)
+ output_batch = pipe(**batched_inputs)
+
+ assert output_batch[0].shape[0] == batch_size
+
+ max_diff = np.abs(to_np(output_batch[0][0]) - to_np(output[0][0])).max()
+ assert max_diff < expected_max_diff
+
+ @unittest.skipIf(torch_device != "cuda", reason="CUDA and CPU are required to switch devices")
+ def test_to_device(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.to("cpu")
+ # pipeline creates a new motion UNet under the hood. So we need to check the device from pipe.components
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cpu" for device in model_devices))
+
+ output_cpu = pipe(**self.get_dummy_inputs("cpu"))[0]
+ self.assertTrue(np.isnan(output_cpu).sum() == 0)
+
+ pipe.to("cuda")
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cuda" for device in model_devices))
+
+ output_cuda = pipe(**self.get_dummy_inputs("cuda"))[0]
+ self.assertTrue(np.isnan(to_np(output_cuda)).sum() == 0)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ # pipeline creates a new motion UNet under the hood. So we need to check the dtype from pipe.components
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes))
+
+ pipe.to(dtype=torch.float16)
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes))
+
+ def test_prompt_embeds(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs.pop("prompt")
+ inputs["prompt_embeds"] = torch.randn((1, 4, 32), device=torch_device)
+ pipe(**inputs)
+
+ def test_free_init(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.to(torch_device)
+
+ inputs_normal = self.get_dummy_inputs(torch_device)
+ frames_normal = pipe(**inputs_normal).frames[0]
+
+ pipe.enable_free_init(
+ num_iters=2,
+ use_fast_sampling=True,
+ method="butterworth",
+ order=4,
+ spatial_stop_frequency=0.25,
+ temporal_stop_frequency=0.25,
+ )
+ inputs_enable_free_init = self.get_dummy_inputs(torch_device)
+ frames_enable_free_init = pipe(**inputs_enable_free_init).frames[0]
+
+ pipe.disable_free_init()
+ inputs_disable_free_init = self.get_dummy_inputs(torch_device)
+ frames_disable_free_init = pipe(**inputs_disable_free_init).frames[0]
+
+ sum_enabled = np.abs(to_np(frames_normal) - to_np(frames_enable_free_init)).sum()
+ max_diff_disabled = np.abs(to_np(frames_normal) - to_np(frames_disable_free_init)).max()
+ self.assertGreater(
+ sum_enabled, 1e1, "Enabling of FreeInit should lead to results different from the default pipeline results"
+ )
+ self.assertLess(
+ max_diff_disabled,
+ 1e-4,
+ "Disabling of FreeInit should lead to results similar to the default pipeline results",
+ )
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_without_offload = pipe(**inputs).frames[0]
+ output_without_offload = (
+ output_without_offload.cpu() if torch.is_tensor(output_without_offload) else output_without_offload
+ )
+
+ pipe.enable_xformers_memory_efficient_attention()
+ inputs = self.get_dummy_inputs(torch_device)
+ output_with_offload = pipe(**inputs).frames[0]
+ output_with_offload = (
+ output_with_offload.cpu() if torch.is_tensor(output_with_offload) else output_without_offload
+ )
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, 1e-4, "XFormers attention should not affect the inference results")
diff --git a/tests/pipelines/pipeline_params.py b/tests/pipelines/pipeline_params.py
new file mode 100644
index 0000000..4e2c4dc
--- /dev/null
+++ b/tests/pipelines/pipeline_params.py
@@ -0,0 +1,129 @@
+# These are canonical sets of parameters for different types of pipelines.
+# They are set on subclasses of `PipelineTesterMixin` as `params` and
+# `batch_params`.
+#
+# If a pipeline's set of arguments has minor changes from one of the common sets
+# of arguments, do not make modifications to the existing common sets of arguments.
+# I.e. a text to image pipeline with non-configurable height and width arguments
+# should set its attribute as `params = TEXT_TO_IMAGE_PARAMS - {'height', 'width'}`.
+
+TEXT_TO_IMAGE_PARAMS = frozenset(
+ [
+ "prompt",
+ "height",
+ "width",
+ "guidance_scale",
+ "negative_prompt",
+ "prompt_embeds",
+ "negative_prompt_embeds",
+ "cross_attention_kwargs",
+ ]
+)
+
+TEXT_TO_IMAGE_BATCH_PARAMS = frozenset(["prompt", "negative_prompt"])
+
+TEXT_TO_IMAGE_IMAGE_PARAMS = frozenset([])
+
+IMAGE_TO_IMAGE_IMAGE_PARAMS = frozenset(["image"])
+
+IMAGE_VARIATION_PARAMS = frozenset(
+ [
+ "image",
+ "height",
+ "width",
+ "guidance_scale",
+ ]
+)
+
+IMAGE_VARIATION_BATCH_PARAMS = frozenset(["image"])
+
+TEXT_GUIDED_IMAGE_VARIATION_PARAMS = frozenset(
+ [
+ "prompt",
+ "image",
+ "height",
+ "width",
+ "guidance_scale",
+ "negative_prompt",
+ "prompt_embeds",
+ "negative_prompt_embeds",
+ ]
+)
+
+TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS = frozenset(["prompt", "image", "negative_prompt"])
+
+TEXT_GUIDED_IMAGE_INPAINTING_PARAMS = frozenset(
+ [
+ # Text guided image variation with an image mask
+ "prompt",
+ "image",
+ "mask_image",
+ "height",
+ "width",
+ "guidance_scale",
+ "negative_prompt",
+ "prompt_embeds",
+ "negative_prompt_embeds",
+ ]
+)
+
+TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["prompt", "image", "mask_image", "negative_prompt"])
+
+IMAGE_INPAINTING_PARAMS = frozenset(
+ [
+ # image variation with an image mask
+ "image",
+ "mask_image",
+ "height",
+ "width",
+ "guidance_scale",
+ ]
+)
+
+IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["image", "mask_image"])
+
+IMAGE_GUIDED_IMAGE_INPAINTING_PARAMS = frozenset(
+ [
+ "example_image",
+ "image",
+ "mask_image",
+ "height",
+ "width",
+ "guidance_scale",
+ ]
+)
+
+IMAGE_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["example_image", "image", "mask_image"])
+
+CLASS_CONDITIONED_IMAGE_GENERATION_PARAMS = frozenset(["class_labels"])
+
+CLASS_CONDITIONED_IMAGE_GENERATION_BATCH_PARAMS = frozenset(["class_labels"])
+
+UNCONDITIONAL_IMAGE_GENERATION_PARAMS = frozenset(["batch_size"])
+
+UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS = frozenset([])
+
+UNCONDITIONAL_AUDIO_GENERATION_PARAMS = frozenset(["batch_size"])
+
+UNCONDITIONAL_AUDIO_GENERATION_BATCH_PARAMS = frozenset([])
+
+TEXT_TO_AUDIO_PARAMS = frozenset(
+ [
+ "prompt",
+ "audio_length_in_s",
+ "guidance_scale",
+ "negative_prompt",
+ "prompt_embeds",
+ "negative_prompt_embeds",
+ "cross_attention_kwargs",
+ ]
+)
+
+TEXT_TO_AUDIO_BATCH_PARAMS = frozenset(["prompt", "negative_prompt"])
+TOKENS_TO_AUDIO_GENERATION_PARAMS = frozenset(["input_tokens"])
+
+TOKENS_TO_AUDIO_GENERATION_BATCH_PARAMS = frozenset(["input_tokens"])
+
+TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS = frozenset(["prompt_embeds"])
+
+VIDEO_TO_VIDEO_BATCH_PARAMS = frozenset(["prompt", "negative_prompt", "video"])
diff --git a/tests/pipelines/pixart_alpha/__init__.py b/tests/pipelines/pixart_alpha/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/pixart_alpha/test_pixart.py b/tests/pipelines/pixart_alpha/test_pixart.py
new file mode 100644
index 0000000..3d6db5c
--- /dev/null
+++ b/tests/pipelines/pixart_alpha/test_pixart.py
@@ -0,0 +1,437 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from transformers import AutoTokenizer, T5EncoderModel
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ PixArtAlphaPipeline,
+ Transformer2DModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, to_np
+
+
+enable_full_determinism()
+
+
+class PixArtAlphaPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = PixArtAlphaPipeline
+ params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ required_optional_params = PipelineTesterMixin.required_optional_params
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ transformer = Transformer2DModel(
+ sample_size=8,
+ num_layers=2,
+ patch_size=2,
+ attention_head_dim=8,
+ num_attention_heads=3,
+ caption_channels=32,
+ in_channels=4,
+ cross_attention_dim=24,
+ out_channels=8,
+ attention_bias=True,
+ activation_fn="gelu-approximate",
+ num_embeds_ada_norm=1000,
+ norm_type="ada_norm_single",
+ norm_elementwise_affine=False,
+ norm_eps=1e-6,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL()
+
+ scheduler = DDIMScheduler()
+ text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
+
+ components = {
+ "transformer": transformer.eval(),
+ "vae": vae.eval(),
+ "scheduler": scheduler,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 5.0,
+ "use_resolution_binning": False,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ # TODO(PVP, Sayak) need to fix later
+ return
+
+ def test_save_load_optional_components(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = inputs["prompt"]
+ generator = inputs["generator"]
+ num_inference_steps = inputs["num_inference_steps"]
+ output_type = inputs["output_type"]
+
+ (
+ prompt_embeds,
+ prompt_attention_mask,
+ negative_prompt_embeds,
+ negative_prompt_attention_mask,
+ ) = pipe.encode_prompt(prompt)
+
+ # inputs with prompt converted to embeddings
+ inputs = {
+ "prompt_embeds": prompt_embeds,
+ "prompt_attention_mask": prompt_attention_mask,
+ "negative_prompt": None,
+ "negative_prompt_embeds": negative_prompt_embeds,
+ "negative_prompt_attention_mask": negative_prompt_attention_mask,
+ "generator": generator,
+ "num_inference_steps": num_inference_steps,
+ "output_type": output_type,
+ "use_resolution_binning": False,
+ }
+
+ # set all optional components to None
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ generator = inputs["generator"]
+ num_inference_steps = inputs["num_inference_steps"]
+ output_type = inputs["output_type"]
+
+ # inputs with prompt converted to embeddings
+ inputs = {
+ "prompt_embeds": prompt_embeds,
+ "prompt_attention_mask": prompt_attention_mask,
+ "negative_prompt": None,
+ "negative_prompt_embeds": negative_prompt_embeds,
+ "negative_prompt_attention_mask": negative_prompt_attention_mask,
+ "generator": generator,
+ "num_inference_steps": num_inference_steps,
+ "output_type": output_type,
+ "use_resolution_binning": False,
+ }
+
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, 1e-4)
+
+ def test_inference(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ self.assertEqual(image.shape, (1, 8, 8, 3))
+ expected_slice = np.array([0.6319, 0.3526, 0.3806, 0.6327, 0.4639, 0.483, 0.2583, 0.5331, 0.4852])
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_inference_non_square_images(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs, height=32, width=48).images
+ image_slice = image[0, -3:, -3:, -1]
+ self.assertEqual(image.shape, (1, 32, 48, 3))
+
+ expected_slice = np.array([0.6493, 0.537, 0.4081, 0.4762, 0.3695, 0.4711, 0.3026, 0.5218, 0.5263])
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_inference_with_embeddings_and_multiple_images(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = inputs["prompt"]
+ generator = inputs["generator"]
+ num_inference_steps = inputs["num_inference_steps"]
+ output_type = inputs["output_type"]
+
+ prompt_embeds, prompt_attn_mask, negative_prompt_embeds, neg_prompt_attn_mask = pipe.encode_prompt(prompt)
+
+ # inputs with prompt converted to embeddings
+ inputs = {
+ "prompt_embeds": prompt_embeds,
+ "prompt_attention_mask": prompt_attn_mask,
+ "negative_prompt": None,
+ "negative_prompt_embeds": negative_prompt_embeds,
+ "negative_prompt_attention_mask": neg_prompt_attn_mask,
+ "generator": generator,
+ "num_inference_steps": num_inference_steps,
+ "output_type": output_type,
+ "num_images_per_prompt": 2,
+ "use_resolution_binning": False,
+ }
+
+ # set all optional components to None
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ generator = inputs["generator"]
+ num_inference_steps = inputs["num_inference_steps"]
+ output_type = inputs["output_type"]
+
+ # inputs with prompt converted to embeddings
+ inputs = {
+ "prompt_embeds": prompt_embeds,
+ "prompt_attention_mask": prompt_attn_mask,
+ "negative_prompt": None,
+ "negative_prompt_embeds": negative_prompt_embeds,
+ "negative_prompt_attention_mask": neg_prompt_attn_mask,
+ "generator": generator,
+ "num_inference_steps": num_inference_steps,
+ "output_type": output_type,
+ "num_images_per_prompt": 2,
+ "use_resolution_binning": False,
+ }
+
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, 1e-4)
+
+ def test_inference_with_multiple_images_per_prompt(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["num_images_per_prompt"] = 2
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ self.assertEqual(image.shape, (2, 8, 8, 3))
+ expected_slice = np.array([0.6319, 0.3526, 0.3806, 0.6327, 0.4639, 0.483, 0.2583, 0.5331, 0.4852])
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_raises_warning_for_mask_feature(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs.update({"mask_feature": True})
+
+ with self.assertWarns(FutureWarning) as warning_ctx:
+ _ = pipe(**inputs).images
+
+ assert "mask_feature" in str(warning_ctx.warning)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-3)
+
+
+@slow
+@require_torch_gpu
+class PixArtAlphaPipelineIntegrationTests(unittest.TestCase):
+ ckpt_id_1024 = "PixArt-alpha/PixArt-XL-2-1024-MS"
+ ckpt_id_512 = "PixArt-alpha/PixArt-XL-2-512x512"
+ prompt = "A small cactus with a happy face in the Sahara desert."
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_pixart_1024(self):
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ pipe = PixArtAlphaPipeline.from_pretrained(self.ckpt_id_1024, torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
+ prompt = self.prompt
+
+ image = pipe(prompt, generator=generator, num_inference_steps=2, output_type="np").images
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.0742, 0.0835, 0.2114, 0.0295, 0.0784, 0.2361, 0.1738, 0.2251, 0.3589])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice)
+ self.assertLessEqual(max_diff, 1e-4)
+
+ def test_pixart_512(self):
+ generator = torch.Generator("cpu").manual_seed(0)
+
+ pipe = PixArtAlphaPipeline.from_pretrained(self.ckpt_id_512, torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
+
+ prompt = self.prompt
+
+ image = pipe(prompt, generator=generator, num_inference_steps=2, output_type="np").images
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.3477, 0.3882, 0.4541, 0.3413, 0.3821, 0.4463, 0.4001, 0.4409, 0.4958])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice)
+ self.assertLessEqual(max_diff, 1e-4)
+
+ def test_pixart_1024_without_resolution_binning(self):
+ generator = torch.manual_seed(0)
+
+ pipe = PixArtAlphaPipeline.from_pretrained(self.ckpt_id_1024, torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
+
+ prompt = self.prompt
+ height, width = 1024, 768
+ num_inference_steps = 2
+
+ image = pipe(
+ prompt,
+ height=height,
+ width=width,
+ generator=generator,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ ).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ generator = torch.manual_seed(0)
+ no_res_bin_image = pipe(
+ prompt,
+ height=height,
+ width=width,
+ generator=generator,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ use_resolution_binning=False,
+ ).images
+ no_res_bin_image_slice = no_res_bin_image[0, -3:, -3:, -1]
+
+ assert not np.allclose(image_slice, no_res_bin_image_slice, atol=1e-4, rtol=1e-4)
+
+ def test_pixart_512_without_resolution_binning(self):
+ generator = torch.manual_seed(0)
+
+ pipe = PixArtAlphaPipeline.from_pretrained(self.ckpt_id_512, torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
+
+ prompt = self.prompt
+ height, width = 512, 768
+ num_inference_steps = 2
+
+ image = pipe(
+ prompt,
+ height=height,
+ width=width,
+ generator=generator,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ ).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ generator = torch.manual_seed(0)
+ no_res_bin_image = pipe(
+ prompt,
+ height=height,
+ width=width,
+ generator=generator,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ use_resolution_binning=False,
+ ).images
+ no_res_bin_image_slice = no_res_bin_image[0, -3:, -3:, -1]
+
+ assert not np.allclose(image_slice, no_res_bin_image_slice, atol=1e-4, rtol=1e-4)
diff --git a/tests/pipelines/pndm/__init__.py b/tests/pipelines/pndm/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/pndm/test_pndm.py b/tests/pipelines/pndm/test_pndm.py
new file mode 100644
index 0000000..d4cb6a5
--- /dev/null
+++ b/tests/pipelines/pndm/test_pndm.py
@@ -0,0 +1,87 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import PNDMPipeline, PNDMScheduler, UNet2DModel
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch, torch_device
+
+
+enable_full_determinism()
+
+
+class PNDMPipelineFastTests(unittest.TestCase):
+ @property
+ def dummy_uncond_unet(self):
+ torch.manual_seed(0)
+ model = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ return model
+
+ def test_inference(self):
+ unet = self.dummy_uncond_unet
+ scheduler = PNDMScheduler()
+
+ pndm = PNDMPipeline(unet=unet, scheduler=scheduler)
+ pndm.to(torch_device)
+ pndm.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ image = pndm(generator=generator, num_inference_steps=20, output_type="numpy").images
+
+ generator = torch.manual_seed(0)
+ image_from_tuple = pndm(generator=generator, num_inference_steps=20, output_type="numpy", return_dict=False)[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+
+@nightly
+@require_torch
+class PNDMPipelineIntegrationTests(unittest.TestCase):
+ def test_inference_cifar10(self):
+ model_id = "google/ddpm-cifar10-32"
+
+ unet = UNet2DModel.from_pretrained(model_id)
+ scheduler = PNDMScheduler()
+
+ pndm = PNDMPipeline(unet=unet, scheduler=scheduler)
+ pndm.to(torch_device)
+ pndm.set_progress_bar_config(disable=None)
+ generator = torch.manual_seed(0)
+ image = pndm(generator=generator, output_type="numpy").images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.1564, 0.14645, 0.1406, 0.14715, 0.12425, 0.14045, 0.13115, 0.12175, 0.125])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/semantic_stable_diffusion/__init__.py b/tests/pipelines/semantic_stable_diffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/semantic_stable_diffusion/test_semantic_diffusion.py b/tests/pipelines/semantic_stable_diffusion/test_semantic_diffusion.py
new file mode 100644
index 0000000..0c9bb63
--- /dev/null
+++ b/tests/pipelines/semantic_stable_diffusion/test_semantic_diffusion.py
@@ -0,0 +1,606 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import AutoencoderKL, DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler, UNet2DConditionModel
+from diffusers.pipelines.semantic_stable_diffusion import SemanticStableDiffusionPipeline as StableDiffusionPipeline
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+
+enable_full_determinism()
+
+
+class SafeDiffusionPipelineFastTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @property
+ def dummy_image(self):
+ batch_size = 1
+ num_channels = 3
+ sizes = (32, 32)
+
+ image = floats_tensor((batch_size, num_channels) + sizes, rng=random.Random(0)).to(torch_device)
+ return image
+
+ @property
+ def dummy_cond_unet(self):
+ torch.manual_seed(0)
+ model = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ return model
+
+ @property
+ def dummy_vae(self):
+ torch.manual_seed(0)
+ model = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ return model
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config)
+
+ @property
+ def dummy_extractor(self):
+ def extract(*args, **kwargs):
+ class Out:
+ def __init__(self):
+ self.pixel_values = torch.ones([0])
+
+ def to(self, device):
+ self.pixel_values.to(device)
+ return self
+
+ return Out()
+
+ return extract
+
+ def test_semantic_diffusion_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=6.0,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5753, 0.6114, 0.5001, 0.5034, 0.5470, 0.4729, 0.4971, 0.4867, 0.4867])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_semantic_diffusion_pndm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=6.0,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5122, 0.5712, 0.4825, 0.5053, 0.5646, 0.4769, 0.5179, 0.4894, 0.4994])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_semantic_diffusion_no_safety_checker(self):
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-lms-pipe", safety_checker=None
+ )
+ assert isinstance(pipe, StableDiffusionPipeline)
+ assert isinstance(pipe.scheduler, LMSDiscreteScheduler)
+ assert pipe.safety_checker is None
+
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ # check that there's no error when saving a pipeline with one of the models being None
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe = StableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ # sanity check that the pipeline still works
+ assert pipe.safety_checker is None
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ @unittest.skipIf(torch_device != "cuda", "This test requires a GPU")
+ def test_semantic_diffusion_fp16(self):
+ """Test that stable diffusion works with fp16"""
+ unet = self.dummy_cond_unet
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # put models in fp16
+ unet = unet.half()
+ vae = vae.half()
+ bert = bert.half()
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ image = sd_pipe([prompt], num_inference_steps=2, output_type="np").images
+
+ assert image.shape == (1, 64, 64, 3)
+
+
+@nightly
+@require_torch_gpu
+class SemanticDiffusionPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_positive_guidance(self):
+ torch_device = "cuda"
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "a photo of a cat"
+ edit = {
+ "editing_prompt": ["sunglasses"],
+ "reverse_editing_direction": [False],
+ "edit_warmup_steps": 10,
+ "edit_guidance_scale": 6,
+ "edit_threshold": 0.95,
+ "edit_momentum_scale": 0.5,
+ "edit_mom_beta": 0.6,
+ }
+
+ seed = 3
+ guidance_scale = 7
+
+ # no sega enabled
+ generator = torch.Generator(torch_device)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.34673113,
+ 0.38492733,
+ 0.37597352,
+ 0.34086335,
+ 0.35650748,
+ 0.35579205,
+ 0.3384763,
+ 0.34340236,
+ 0.3573271,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # with sega enabled
+ # generator = torch.manual_seed(seed)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ **edit,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.41887826,
+ 0.37728766,
+ 0.30138272,
+ 0.41416335,
+ 0.41664985,
+ 0.36283392,
+ 0.36191246,
+ 0.43364465,
+ 0.43001732,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_negative_guidance(self):
+ torch_device = "cuda"
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "an image of a crowded boulevard, realistic, 4k"
+ edit = {
+ "editing_prompt": "crowd, crowded, people",
+ "reverse_editing_direction": True,
+ "edit_warmup_steps": 10,
+ "edit_guidance_scale": 8.3,
+ "edit_threshold": 0.9,
+ "edit_momentum_scale": 0.5,
+ "edit_mom_beta": 0.6,
+ }
+
+ seed = 9
+ guidance_scale = 7
+
+ # no sega enabled
+ generator = torch.Generator(torch_device)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.43497998,
+ 0.91814065,
+ 0.7540739,
+ 0.55580205,
+ 0.8467265,
+ 0.5389691,
+ 0.62574506,
+ 0.58897763,
+ 0.50926757,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # with sega enabled
+ # generator = torch.manual_seed(seed)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ **edit,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.3089719,
+ 0.30500144,
+ 0.29016042,
+ 0.30630964,
+ 0.325687,
+ 0.29419225,
+ 0.2908091,
+ 0.28723598,
+ 0.27696294,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_multi_cond_guidance(self):
+ torch_device = "cuda"
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "a castle next to a river"
+ edit = {
+ "editing_prompt": ["boat on a river, boat", "monet, impression, sunrise"],
+ "reverse_editing_direction": False,
+ "edit_warmup_steps": [15, 18],
+ "edit_guidance_scale": 6,
+ "edit_threshold": [0.9, 0.8],
+ "edit_momentum_scale": 0.5,
+ "edit_mom_beta": 0.6,
+ }
+
+ seed = 48
+ guidance_scale = 7
+
+ # no sega enabled
+ generator = torch.Generator(torch_device)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.75163555,
+ 0.76037145,
+ 0.61785,
+ 0.9189673,
+ 0.8627701,
+ 0.85189694,
+ 0.8512813,
+ 0.87012076,
+ 0.8312857,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # with sega enabled
+ # generator = torch.manual_seed(seed)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ **edit,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.73553365,
+ 0.7537271,
+ 0.74341905,
+ 0.66480356,
+ 0.6472925,
+ 0.63039416,
+ 0.64812905,
+ 0.6749717,
+ 0.6517102,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_guidance_fp16(self):
+ torch_device = "cuda"
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "a photo of a cat"
+ edit = {
+ "editing_prompt": ["sunglasses"],
+ "reverse_editing_direction": [False],
+ "edit_warmup_steps": 10,
+ "edit_guidance_scale": 6,
+ "edit_threshold": 0.95,
+ "edit_momentum_scale": 0.5,
+ "edit_mom_beta": 0.6,
+ }
+
+ seed = 3
+ guidance_scale = 7
+
+ # no sega enabled
+ generator = torch.Generator(torch_device)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.34887695,
+ 0.3876953,
+ 0.375,
+ 0.34423828,
+ 0.3581543,
+ 0.35717773,
+ 0.3383789,
+ 0.34570312,
+ 0.359375,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # with sega enabled
+ # generator = torch.manual_seed(seed)
+ generator.manual_seed(seed)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ **edit,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [
+ 0.42285156,
+ 0.36914062,
+ 0.29077148,
+ 0.42041016,
+ 0.41918945,
+ 0.35498047,
+ 0.3618164,
+ 0.4423828,
+ 0.43115234,
+ ]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/shap_e/__init__.py b/tests/pipelines/shap_e/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/shap_e/test_shap_e.py b/tests/pipelines/shap_e/test_shap_e.py
new file mode 100644
index 0000000..11595fe
--- /dev/null
+++ b/tests/pipelines/shap_e/test_shap_e.py
@@ -0,0 +1,255 @@
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import HeunDiscreteScheduler, PriorTransformer, ShapEPipeline
+from diffusers.pipelines.shap_e import ShapERenderer
+from diffusers.utils.testing_utils import load_numpy, nightly, require_torch_gpu, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+class ShapEPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = ShapEPipeline
+ params = ["prompt"]
+ batch_params = ["prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "guidance_scale",
+ "frame_size",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 16
+
+ @property
+ def time_input_dim(self):
+ return 16
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def renderer_dim(self):
+ return 8
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config)
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "num_attention_heads": 2,
+ "attention_head_dim": 16,
+ "embedding_dim": self.time_input_dim,
+ "num_embeddings": 32,
+ "embedding_proj_dim": self.text_embedder_hidden_size,
+ "time_embed_dim": self.time_embed_dim,
+ "num_layers": 1,
+ "clip_embed_dim": self.time_input_dim * 2,
+ "additional_embeddings": 0,
+ "time_embed_act_fn": "gelu",
+ "norm_in_type": "layer",
+ "encoder_hid_proj_type": None,
+ "added_emb_type": None,
+ }
+
+ model = PriorTransformer(**model_kwargs)
+ return model
+
+ @property
+ def dummy_renderer(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "param_shapes": (
+ (self.renderer_dim, 93),
+ (self.renderer_dim, 8),
+ (self.renderer_dim, 8),
+ (self.renderer_dim, 8),
+ ),
+ "d_latent": self.time_input_dim,
+ "d_hidden": self.renderer_dim,
+ "n_output": 12,
+ "background": (
+ 0.1,
+ 0.1,
+ 0.1,
+ ),
+ }
+ model = ShapERenderer(**model_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ shap_e_renderer = self.dummy_renderer
+
+ scheduler = HeunDiscreteScheduler(
+ beta_schedule="exp",
+ num_train_timesteps=1024,
+ prediction_type="sample",
+ use_karras_sigmas=True,
+ clip_sample=True,
+ clip_sample_range=1.0,
+ )
+ components = {
+ "prior": prior,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "shap_e_renderer": shap_e_renderer,
+ "scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "num_inference_steps": 1,
+ "frame_size": 32,
+ "output_type": "latent",
+ }
+ return inputs
+
+ def test_shap_e(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images[0]
+ image = image.cpu().numpy()
+ image_slice = image[-3:, -3:]
+
+ assert image.shape == (32, 16)
+
+ expected_slice = np.array([-1.0000, -0.6241, 1.0000, -0.8978, -0.6866, 0.7876, -0.7473, -0.2874, 0.6103])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_inference_batch_consistent(self):
+ # NOTE: Larger batch sizes cause this test to timeout, only test on smaller batches
+ self._test_inference_batch_consistent(batch_sizes=[1, 2])
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(batch_size=2, expected_max_diff=6e-3)
+
+ def test_num_images_per_prompt(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ batch_size = 1
+ num_images_per_prompt = 2
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ for key in inputs.keys():
+ if key in self.batch_params:
+ inputs[key] = batch_size * [inputs[key]]
+
+ images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt)[0]
+
+ assert images.shape[0] == batch_size * num_images_per_prompt
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-3)
+
+ @unittest.skip("Key error is raised with accelerate")
+ def test_sequential_cpu_offload_forward_pass(self):
+ pass
+
+
+@nightly
+@require_torch_gpu
+class ShapEPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_shap_e(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/shap_e/test_shap_e_np_out.npy"
+ )
+ pipe = ShapEPipeline.from_pretrained("openai/shap-e")
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+
+ images = pipe(
+ "a shark",
+ generator=generator,
+ guidance_scale=15.0,
+ num_inference_steps=64,
+ frame_size=64,
+ output_type="np",
+ ).images[0]
+
+ assert images.shape == (20, 64, 64, 3)
+
+ assert_mean_pixel_difference(images, expected_image)
diff --git a/tests/pipelines/shap_e/test_shap_e_img2img.py b/tests/pipelines/shap_e/test_shap_e_img2img.py
new file mode 100644
index 0000000..c666b01
--- /dev/null
+++ b/tests/pipelines/shap_e/test_shap_e_img2img.py
@@ -0,0 +1,284 @@
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPImageProcessor, CLIPVisionConfig, CLIPVisionModel
+
+from diffusers import HeunDiscreteScheduler, PriorTransformer, ShapEImg2ImgPipeline
+from diffusers.pipelines.shap_e import ShapERenderer
+from diffusers.utils.testing_utils import (
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+class ShapEImg2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = ShapEImg2ImgPipeline
+ params = ["image"]
+ batch_params = ["image"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "guidance_scale",
+ "frame_size",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 16
+
+ @property
+ def time_input_dim(self):
+ return 16
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def renderer_dim(self):
+ return 8
+
+ @property
+ def dummy_image_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=self.text_embedder_hidden_size,
+ image_size=32,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=24,
+ num_attention_heads=2,
+ num_channels=3,
+ num_hidden_layers=5,
+ patch_size=1,
+ )
+
+ model = CLIPVisionModel(config)
+ return model
+
+ @property
+ def dummy_image_processor(self):
+ image_processor = CLIPImageProcessor(
+ crop_size=224,
+ do_center_crop=True,
+ do_normalize=True,
+ do_resize=True,
+ image_mean=[0.48145466, 0.4578275, 0.40821073],
+ image_std=[0.26862954, 0.26130258, 0.27577711],
+ resample=3,
+ size=224,
+ )
+
+ return image_processor
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "num_attention_heads": 2,
+ "attention_head_dim": 16,
+ "embedding_dim": self.time_input_dim,
+ "num_embeddings": 32,
+ "embedding_proj_dim": self.text_embedder_hidden_size,
+ "time_embed_dim": self.time_embed_dim,
+ "num_layers": 1,
+ "clip_embed_dim": self.time_input_dim * 2,
+ "additional_embeddings": 0,
+ "time_embed_act_fn": "gelu",
+ "norm_in_type": "layer",
+ "embedding_proj_norm_type": "layer",
+ "encoder_hid_proj_type": None,
+ "added_emb_type": None,
+ }
+
+ model = PriorTransformer(**model_kwargs)
+ return model
+
+ @property
+ def dummy_renderer(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "param_shapes": (
+ (self.renderer_dim, 93),
+ (self.renderer_dim, 8),
+ (self.renderer_dim, 8),
+ (self.renderer_dim, 8),
+ ),
+ "d_latent": self.time_input_dim,
+ "d_hidden": self.renderer_dim,
+ "n_output": 12,
+ "background": (
+ 0.1,
+ 0.1,
+ 0.1,
+ ),
+ }
+ model = ShapERenderer(**model_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ image_encoder = self.dummy_image_encoder
+ image_processor = self.dummy_image_processor
+ shap_e_renderer = self.dummy_renderer
+
+ scheduler = HeunDiscreteScheduler(
+ beta_schedule="exp",
+ num_train_timesteps=1024,
+ prediction_type="sample",
+ use_karras_sigmas=True,
+ clip_sample=True,
+ clip_sample_range=1.0,
+ )
+ components = {
+ "prior": prior,
+ "image_encoder": image_encoder,
+ "image_processor": image_processor,
+ "shap_e_renderer": shap_e_renderer,
+ "scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ input_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": input_image,
+ "generator": generator,
+ "num_inference_steps": 1,
+ "frame_size": 32,
+ "output_type": "latent",
+ }
+ return inputs
+
+ def test_shap_e(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images[0]
+ image_slice = image[-3:, -3:].cpu().numpy()
+
+ assert image.shape == (32, 16)
+
+ expected_slice = np.array(
+ [-1.0, 0.40668195, 0.57322013, -0.9469888, 0.4283227, 0.30348337, -0.81094897, 0.74555075, 0.15342723]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_inference_batch_consistent(self):
+ # NOTE: Larger batch sizes cause this test to timeout, only test on smaller batches
+ self._test_inference_batch_consistent(batch_sizes=[2])
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ batch_size=2,
+ expected_max_diff=6e-3,
+ )
+
+ def test_num_images_per_prompt(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ batch_size = 1
+ num_images_per_prompt = 2
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ for key in inputs.keys():
+ if key in self.batch_params:
+ inputs[key] = batch_size * [inputs[key]]
+
+ images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt)[0]
+
+ assert images.shape[0] == batch_size * num_images_per_prompt
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-3)
+
+ @unittest.skip("Key error is raised with accelerate")
+ def test_sequential_cpu_offload_forward_pass(self):
+ pass
+
+
+@nightly
+@require_torch_gpu
+class ShapEImg2ImgPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_shap_e_img2img(self):
+ input_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/shap_e/corgi.png"
+ )
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/shap_e/test_shap_e_img2img_out.npy"
+ )
+ pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img")
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+
+ images = pipe(
+ input_image,
+ generator=generator,
+ guidance_scale=3.0,
+ num_inference_steps=64,
+ frame_size=64,
+ output_type="np",
+ ).images[0]
+
+ assert images.shape == (20, 64, 64, 3)
+
+ assert_mean_pixel_difference(images, expected_image)
diff --git a/tests/pipelines/stable_cascade/__init__.py b/tests/pipelines/stable_cascade/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_cascade/test_stable_cascade_combined.py b/tests/pipelines/stable_cascade/test_stable_cascade_combined.py
new file mode 100644
index 0000000..e717c77
--- /dev/null
+++ b/tests/pipelines/stable_cascade/test_stable_cascade_combined.py
@@ -0,0 +1,246 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import DDPMWuerstchenScheduler, StableCascadeCombinedPipeline
+from diffusers.models import StableCascadeUNet
+from diffusers.pipelines.wuerstchen import PaellaVQModel
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableCascadeCombinedPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableCascadeCombinedPipeline
+ params = ["prompt"]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "prior_guidance_scale",
+ "decoder_guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "prior_num_inference_steps",
+ "output_type",
+ ]
+ test_xformers_attention = True
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "conditioning_dim": 128,
+ "block_out_channels": (128, 128),
+ "num_attention_heads": (2, 2),
+ "down_num_layers_per_block": (1, 1),
+ "up_num_layers_per_block": (1, 1),
+ "clip_image_in_channels": 768,
+ "switch_level": (False,),
+ "clip_text_in_channels": self.text_embedder_hidden_size,
+ "clip_text_pooled_in_channels": self.text_embedder_hidden_size,
+ }
+
+ model = StableCascadeUNet(**model_kwargs)
+ return model.eval()
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ projection_dim=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config).eval()
+
+ @property
+ def dummy_vqgan(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "bottleneck_blocks": 1,
+ "num_vq_embeddings": 2,
+ }
+ model = PaellaVQModel(**model_kwargs)
+ return model.eval()
+
+ @property
+ def dummy_decoder(self):
+ torch.manual_seed(0)
+ model_kwargs = {
+ "in_channels": 4,
+ "out_channels": 4,
+ "conditioning_dim": 128,
+ "block_out_channels": (16, 32, 64, 128),
+ "num_attention_heads": (-1, -1, 1, 2),
+ "down_num_layers_per_block": (1, 1, 1, 1),
+ "up_num_layers_per_block": (1, 1, 1, 1),
+ "down_blocks_repeat_mappers": (1, 1, 1, 1),
+ "up_blocks_repeat_mappers": (3, 3, 2, 2),
+ "block_types_per_layer": (
+ ("SDCascadeResBlock", "SDCascadeTimestepBlock"),
+ ("SDCascadeResBlock", "SDCascadeTimestepBlock"),
+ ("SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"),
+ ("SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"),
+ ),
+ "switch_level": None,
+ "clip_text_pooled_in_channels": 32,
+ "dropout": (0.1, 0.1, 0.1, 0.1),
+ }
+
+ model = StableCascadeUNet(**model_kwargs)
+ return model.eval()
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+
+ scheduler = DDPMWuerstchenScheduler()
+ tokenizer = self.dummy_tokenizer
+ text_encoder = self.dummy_text_encoder
+ decoder = self.dummy_decoder
+ vqgan = self.dummy_vqgan
+ prior_text_encoder = self.dummy_text_encoder
+ prior_tokenizer = self.dummy_tokenizer
+
+ components = {
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "decoder": decoder,
+ "scheduler": scheduler,
+ "vqgan": vqgan,
+ "prior_text_encoder": prior_text_encoder,
+ "prior_tokenizer": prior_tokenizer,
+ "prior_prior": prior,
+ "prior_scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "prior_guidance_scale": 4.0,
+ "decoder_guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "prior_num_inference_steps": 2,
+ "output_type": "np",
+ "height": 128,
+ "width": 128,
+ }
+ return inputs
+
+ def test_stable_cascade(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(**self.get_dummy_inputs(device), return_dict=False)[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[-3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+
+ expected_slice = np.array([0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0])
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=2e-2)
+
+ @unittest.skip(reason="fp16 not supported")
+ def test_float16_inference(self):
+ super().test_float16_inference()
+
+ @unittest.skip(reason="no callback test for combined pipeline")
+ def test_callback_inputs(self):
+ super().test_callback_inputs()
+
+ # def test_callback_cfg(self):
+ # pass
+ # pass
diff --git a/tests/pipelines/stable_cascade/test_stable_cascade_decoder.py b/tests/pipelines/stable_cascade/test_stable_cascade_decoder.py
new file mode 100644
index 0000000..7656744
--- /dev/null
+++ b/tests/pipelines/stable_cascade/test_stable_cascade_decoder.py
@@ -0,0 +1,249 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import DDPMWuerstchenScheduler, StableCascadeDecoderPipeline
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.models import StableCascadeUNet
+from diffusers.pipelines.wuerstchen import PaellaVQModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_image,
+ load_pt,
+ require_torch_gpu,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableCascadeDecoderPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableCascadeDecoderPipeline
+ params = ["prompt"]
+ batch_params = ["image_embeddings", "prompt", "negative_prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "num_inference_steps",
+ "latents",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["image_embeddings", "text_encoder_hidden_states"]
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ projection_dim=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config).eval()
+
+ @property
+ def dummy_vqgan(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "bottleneck_blocks": 1,
+ "num_vq_embeddings": 2,
+ }
+ model = PaellaVQModel(**model_kwargs)
+ return model.eval()
+
+ @property
+ def dummy_decoder(self):
+ torch.manual_seed(0)
+ model_kwargs = {
+ "in_channels": 4,
+ "out_channels": 4,
+ "conditioning_dim": 128,
+ "block_out_channels": [16, 32, 64, 128],
+ "num_attention_heads": [-1, -1, 1, 2],
+ "down_num_layers_per_block": [1, 1, 1, 1],
+ "up_num_layers_per_block": [1, 1, 1, 1],
+ "down_blocks_repeat_mappers": [1, 1, 1, 1],
+ "up_blocks_repeat_mappers": [3, 3, 2, 2],
+ "block_types_per_layer": [
+ ["SDCascadeResBlock", "SDCascadeTimestepBlock"],
+ ["SDCascadeResBlock", "SDCascadeTimestepBlock"],
+ ["SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"],
+ ["SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"],
+ ],
+ "switch_level": None,
+ "clip_text_pooled_in_channels": 32,
+ "dropout": [0.1, 0.1, 0.1, 0.1],
+ }
+ model = StableCascadeUNet(**model_kwargs)
+ return model.eval()
+
+ def get_dummy_components(self):
+ decoder = self.dummy_decoder
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ vqgan = self.dummy_vqgan
+
+ scheduler = DDPMWuerstchenScheduler()
+
+ components = {
+ "decoder": decoder,
+ "vqgan": vqgan,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "latent_dim_scale": 4.0,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image_embeddings": torch.ones((1, 4, 4, 4), device=device),
+ "prompt": "horse",
+ "generator": generator,
+ "guidance_scale": 2.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_wuerstchen_decoder(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(**self.get_dummy_inputs(device), return_dict=False)
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
+
+ @unittest.skip(reason="fp16 not supported")
+ def test_float16_inference(self):
+ super().test_float16_inference()
+
+
+@slow
+@require_torch_gpu
+class StableCascadeDecoderPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_cascade_decoder(self):
+ pipe = StableCascadeDecoderPipeline.from_pretrained(
+ "diffusers/StableCascade-decoder", torch_dtype=torch.bfloat16
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_embedding = load_pt(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/image_embedding.pt"
+ )
+
+ image = pipe(
+ prompt=prompt, image_embeddings=image_embedding, num_inference_steps=10, generator=generator
+ ).images[0]
+
+ assert image.size == (1024, 1024)
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/t2i.png"
+ )
+
+ image_processor = VaeImageProcessor()
+
+ image_np = image_processor.pil_to_numpy(image)
+ expected_image_np = image_processor.pil_to_numpy(expected_image)
+
+ self.assertTrue(np.allclose(image_np, expected_image_np, atol=53e-2))
diff --git a/tests/pipelines/stable_cascade/test_stable_cascade_prior.py b/tests/pipelines/stable_cascade/test_stable_cascade_prior.py
new file mode 100644
index 0000000..c0ee8cc
--- /dev/null
+++ b/tests/pipelines/stable_cascade/test_stable_cascade_prior.py
@@ -0,0 +1,308 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import DDPMWuerstchenScheduler, StableCascadePriorPipeline
+from diffusers.loaders import AttnProcsLayers
+from diffusers.models import StableCascadeUNet
+from diffusers.models.attention_processor import LoRAAttnProcessor, LoRAAttnProcessor2_0
+from diffusers.utils.import_utils import is_peft_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_pt,
+ require_peft_backend,
+ require_torch_gpu,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+
+if is_peft_available():
+ from peft import LoraConfig
+ from peft.tuners.tuners_utils import BaseTunerLayer
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+def create_prior_lora_layers(unet: nn.Module):
+ lora_attn_procs = {}
+ for name in unet.attn_processors.keys():
+ lora_attn_processor_class = (
+ LoRAAttnProcessor2_0 if hasattr(F, "scaled_dot_product_attention") else LoRAAttnProcessor
+ )
+ lora_attn_procs[name] = lora_attn_processor_class(
+ hidden_size=unet.config.c,
+ )
+ unet_lora_layers = AttnProcsLayers(lora_attn_procs)
+ return lora_attn_procs, unet_lora_layers
+
+
+class StableCascadePriorPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableCascadePriorPipeline
+ params = ["prompt"]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "generator",
+ "num_inference_steps",
+ "latents",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["text_encoder_hidden_states"]
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config).eval()
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "conditioning_dim": 128,
+ "block_out_channels": (128, 128),
+ "num_attention_heads": (2, 2),
+ "down_num_layers_per_block": (1, 1),
+ "up_num_layers_per_block": (1, 1),
+ "switch_level": (False,),
+ "clip_image_in_channels": 768,
+ "clip_text_in_channels": self.text_embedder_hidden_size,
+ "clip_text_pooled_in_channels": self.text_embedder_hidden_size,
+ "dropout": (0.1, 0.1),
+ }
+
+ model = StableCascadeUNet(**model_kwargs)
+ return model.eval()
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+
+ scheduler = DDPMWuerstchenScheduler()
+
+ components = {
+ "prior": prior,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_wuerstchen_prior(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.image_embeddings
+
+ image_from_tuple = pipe(**self.get_dummy_inputs(device), return_dict=False)[0]
+
+ image_slice = image[0, 0, 0, -10:]
+ image_from_tuple_slice = image_from_tuple[0, 0, 0, -10:]
+ assert image.shape == (1, 16, 24, 24)
+
+ expected_slice = np.array(
+ [
+ 96.139565,
+ -20.213179,
+ -116.40341,
+ -191.57129,
+ 39.350136,
+ 74.80767,
+ 39.782352,
+ -184.67352,
+ -46.426907,
+ 168.41783,
+ ]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 5e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-1)
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
+
+ @unittest.skip(reason="fp16 not supported")
+ def test_float16_inference(self):
+ super().test_float16_inference()
+
+ def check_if_lora_correctly_set(self, model) -> bool:
+ """
+ Checks if the LoRA layers are correctly set with peft
+ """
+ for module in model.modules():
+ if isinstance(module, BaseTunerLayer):
+ return True
+ return False
+
+ def get_lora_components(self):
+ prior = self.dummy_prior
+
+ prior_lora_config = LoraConfig(
+ r=4, lora_alpha=4, target_modules=["to_q", "to_k", "to_v", "to_out.0"], init_lora_weights=False
+ )
+
+ prior_lora_attn_procs, prior_lora_layers = create_prior_lora_layers(prior)
+
+ lora_components = {
+ "prior_lora_layers": prior_lora_layers,
+ "prior_lora_attn_procs": prior_lora_attn_procs,
+ }
+
+ return prior, prior_lora_config, lora_components
+
+ @require_peft_backend
+ @unittest.skip(reason="no lora support for now")
+ def test_inference_with_prior_lora(self):
+ _, prior_lora_config, _ = self.get_lora_components()
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output_no_lora = pipe(**self.get_dummy_inputs(device))
+ image_embed = output_no_lora.image_embeddings
+ self.assertTrue(image_embed.shape == (1, 16, 24, 24))
+
+ pipe.prior.add_adapter(prior_lora_config)
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.prior), "Lora not correctly set in prior")
+
+ output_lora = pipe(**self.get_dummy_inputs(device))
+ lora_image_embed = output_lora.image_embeddings
+
+ self.assertTrue(image_embed.shape == lora_image_embed.shape)
+
+
+@slow
+@require_torch_gpu
+class StableCascadePriorPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_cascade_prior(self):
+ pipe = StableCascadePriorPipeline.from_pretrained("diffusers/StableCascade-prior", torch_dtype=torch.bfloat16)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ output = pipe(prompt, num_inference_steps=10, generator=generator)
+ image_embedding = output.image_embeddings
+
+ expected_image_embedding = load_pt(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/image_embedding.pt"
+ )
+
+ assert image_embedding.shape == (1, 16, 24, 24)
+
+ self.assertTrue(
+ np.allclose(
+ image_embedding.cpu().float().numpy(), expected_image_embedding.cpu().float().numpy(), atol=5e-2
+ )
+ )
diff --git a/tests/pipelines/stable_diffusion/__init__.py b/tests/pipelines/stable_diffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion.py b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion.py
new file mode 100644
index 0000000..229b166
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion.py
@@ -0,0 +1,376 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+
+import numpy as np
+
+from diffusers import (
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ LMSDiscreteScheduler,
+ OnnxStableDiffusionPipeline,
+ PNDMScheduler,
+)
+from diffusers.utils.testing_utils import is_onnx_available, nightly, require_onnxruntime, require_torch_gpu
+
+from ..test_pipelines_onnx_common import OnnxPipelineTesterMixin
+
+
+if is_onnx_available():
+ import onnxruntime as ort
+
+
+class OnnxStableDiffusionPipelineFastTests(OnnxPipelineTesterMixin, unittest.TestCase):
+ hub_checkpoint = "hf-internal-testing/tiny-random-OnnxStableDiffusionPipeline"
+
+ def get_dummy_inputs(self, seed=0):
+ generator = np.random.RandomState(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_pipeline_default_ddim(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.65072, 0.58492, 0.48219, 0.55521, 0.53180, 0.55939, 0.50697, 0.39800, 0.46455])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_pipeline_pndm(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config, skip_prk_steps=True)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.65863, 0.59425, 0.49326, 0.56313, 0.53875, 0.56627, 0.51065, 0.39777, 0.46330])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_pipeline_lms(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.53755, 0.60786, 0.47402, 0.49488, 0.51869, 0.49819, 0.47985, 0.38957, 0.44279])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_pipeline_euler(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.53755, 0.60786, 0.47402, 0.49488, 0.51869, 0.49819, 0.47985, 0.38957, 0.44279])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_pipeline_euler_ancestral(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.53817, 0.60812, 0.47384, 0.49530, 0.51894, 0.49814, 0.47984, 0.38958, 0.44271])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_pipeline_dpm_multistep(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.53895, 0.60808, 0.47933, 0.49608, 0.51886, 0.49950, 0.48053, 0.38957, 0.44200])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_prompt_embeds(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ inputs = self.get_dummy_inputs()
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="np",
+ )
+ text_inputs = text_inputs["input_ids"]
+
+ prompt_embeds = pipe.text_encoder(input_ids=text_inputs.astype(np.int32))[0]
+
+ inputs["prompt_embeds"] = prompt_embeds
+
+ # forward
+ output = pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_negative_prompt_embeds(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ inputs = self.get_dummy_inputs()
+ prompt = 3 * [inputs.pop("prompt")]
+
+ embeds = []
+ for p in [prompt, negative_prompt]:
+ text_inputs = pipe.tokenizer(
+ p,
+ padding="max_length",
+ max_length=pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="np",
+ )
+ text_inputs = text_inputs["input_ids"]
+
+ embeds.append(pipe.text_encoder(input_ids=text_inputs.astype(np.int32))[0])
+
+ inputs["prompt_embeds"], inputs["negative_prompt_embeds"] = embeds
+
+ # forward
+ output = pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+
+@nightly
+@require_onnxruntime
+@require_torch_gpu
+class OnnxStableDiffusionPipelineIntegrationTests(unittest.TestCase):
+ @property
+ def gpu_provider(self):
+ return (
+ "CUDAExecutionProvider",
+ {
+ "gpu_mem_limit": "15000000000", # 15GB
+ "arena_extend_strategy": "kSameAsRequested",
+ },
+ )
+
+ @property
+ def gpu_options(self):
+ options = ort.SessionOptions()
+ options.enable_mem_pattern = False
+ return options
+
+ def test_inference_default_pndm(self):
+ # using the PNDM scheduler by default
+ sd_pipe = OnnxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ revision="onnx",
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ np.random.seed(0)
+ output = sd_pipe([prompt], guidance_scale=6.0, num_inference_steps=10, output_type="np")
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0452, 0.0390, 0.0087, 0.0350, 0.0617, 0.0364, 0.0544, 0.0523, 0.0720])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_ddim(self):
+ ddim_scheduler = DDIMScheduler.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", subfolder="scheduler", revision="onnx"
+ )
+ sd_pipe = OnnxStableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ revision="onnx",
+ scheduler=ddim_scheduler,
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "open neural network exchange"
+ generator = np.random.RandomState(0)
+ output = sd_pipe([prompt], guidance_scale=7.5, num_inference_steps=10, generator=generator, output_type="np")
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.2867, 0.1974, 0.1481, 0.7294, 0.7251, 0.6667, 0.4194, 0.5642, 0.6486])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_k_lms(self):
+ lms_scheduler = LMSDiscreteScheduler.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", subfolder="scheduler", revision="onnx"
+ )
+ sd_pipe = OnnxStableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ revision="onnx",
+ scheduler=lms_scheduler,
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "open neural network exchange"
+ generator = np.random.RandomState(0)
+ output = sd_pipe([prompt], guidance_scale=7.5, num_inference_steps=10, generator=generator, output_type="np")
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.2306, 0.1959, 0.1593, 0.6549, 0.6394, 0.5408, 0.5065, 0.6010, 0.6161])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_intermediate_state(self):
+ number_of_steps = 0
+
+ def test_callback_fn(step: int, timestep: int, latents: np.ndarray) -> None:
+ test_callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 0:
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.6772, -0.3835, -1.2456, 0.1905, -1.0974, 0.6967, -1.9353, 0.0178, 1.0167]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 1e-3
+ elif step == 5:
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.3351, 0.2241, -0.1837, -0.2325, -0.6577, 0.3393, -0.0241, 0.5899, 1.3875]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 1e-3
+
+ test_callback_fn.has_been_called = False
+
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ revision="onnx",
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "Andromeda galaxy in a bottle"
+
+ generator = np.random.RandomState(0)
+ pipe(
+ prompt=prompt,
+ num_inference_steps=5,
+ guidance_scale=7.5,
+ generator=generator,
+ callback=test_callback_fn,
+ callback_steps=1,
+ )
+ assert test_callback_fn.has_been_called
+ assert number_of_steps == 6
+
+ def test_stable_diffusion_no_safety_checker(self):
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ revision="onnx",
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ assert isinstance(pipe, OnnxStableDiffusionPipeline)
+ assert pipe.safety_checker is None
+
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ # check that there's no error when saving a pipeline with one of the models being None
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe = OnnxStableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ # sanity check that the pipeline still works
+ assert pipe.safety_checker is None
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
diff --git a/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_img2img.py b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_img2img.py
new file mode 100644
index 0000000..33b461b
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_img2img.py
@@ -0,0 +1,245 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+
+from diffusers import (
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ LMSDiscreteScheduler,
+ OnnxStableDiffusionImg2ImgPipeline,
+ PNDMScheduler,
+)
+from diffusers.utils.testing_utils import (
+ floats_tensor,
+ is_onnx_available,
+ load_image,
+ nightly,
+ require_onnxruntime,
+ require_torch_gpu,
+)
+
+from ..test_pipelines_onnx_common import OnnxPipelineTesterMixin
+
+
+if is_onnx_available():
+ import onnxruntime as ort
+
+
+class OnnxStableDiffusionImg2ImgPipelineFastTests(OnnxPipelineTesterMixin, unittest.TestCase):
+ hub_checkpoint = "hf-internal-testing/tiny-random-OnnxStableDiffusionPipeline"
+
+ def get_dummy_inputs(self, seed=0):
+ image = floats_tensor((1, 3, 128, 128), rng=random.Random(seed))
+ generator = np.random.RandomState(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "strength": 0.75,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_pipeline_default_ddim(self):
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.69643, 0.58484, 0.50314, 0.58760, 0.55368, 0.59643, 0.51529, 0.41217, 0.49087])
+ assert np.abs(image_slice - expected_slice).max() < 1e-1
+
+ def test_pipeline_pndm(self):
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config, skip_prk_steps=True)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.61737, 0.54642, 0.53183, 0.54465, 0.52742, 0.60525, 0.49969, 0.40655, 0.48154])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_lms(self):
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ # warmup pass to apply optimizations
+ _ = pipe(**self.get_dummy_inputs())
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.52761, 0.59977, 0.49033, 0.49619, 0.54282, 0.50311, 0.47600, 0.40918, 0.45203])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_euler(self):
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.52911, 0.60004, 0.49229, 0.49805, 0.54502, 0.50680, 0.47777, 0.41028, 0.45304])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_euler_ancestral(self):
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.52911, 0.60004, 0.49229, 0.49805, 0.54502, 0.50680, 0.47777, 0.41028, 0.45304])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_dpm_multistep(self):
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+ expected_slice = np.array([0.65331, 0.58277, 0.48204, 0.56059, 0.53665, 0.56235, 0.50969, 0.40009, 0.46552])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+
+@nightly
+@require_onnxruntime
+@require_torch_gpu
+class OnnxStableDiffusionImg2ImgPipelineIntegrationTests(unittest.TestCase):
+ @property
+ def gpu_provider(self):
+ return (
+ "CUDAExecutionProvider",
+ {
+ "gpu_mem_limit": "15000000000", # 15GB
+ "arena_extend_strategy": "kSameAsRequested",
+ },
+ )
+
+ @property
+ def gpu_options(self):
+ options = ort.SessionOptions()
+ options.enable_mem_pattern = False
+ return options
+
+ def test_inference_default_pndm(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/img2img/sketch-mountains-input.jpg"
+ )
+ init_image = init_image.resize((768, 512))
+ # using the PNDM scheduler by default
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ revision="onnx",
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A fantasy landscape, trending on artstation"
+
+ generator = np.random.RandomState(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ strength=0.75,
+ guidance_scale=7.5,
+ num_inference_steps=10,
+ generator=generator,
+ output_type="np",
+ )
+ images = output.images
+ image_slice = images[0, 255:258, 383:386, -1]
+
+ assert images.shape == (1, 512, 768, 3)
+ expected_slice = np.array([0.4909, 0.5059, 0.5372, 0.4623, 0.4876, 0.5049, 0.4820, 0.4956, 0.5019])
+ # TODO: lower the tolerance after finding the cause of onnxruntime reproducibility issues
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 2e-2
+
+ def test_inference_k_lms(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/img2img/sketch-mountains-input.jpg"
+ )
+ init_image = init_image.resize((768, 512))
+ lms_scheduler = LMSDiscreteScheduler.from_pretrained(
+ "runwayml/stable-diffusion-v1-5", subfolder="scheduler", revision="onnx"
+ )
+ pipe = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ revision="onnx",
+ scheduler=lms_scheduler,
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A fantasy landscape, trending on artstation"
+
+ generator = np.random.RandomState(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ strength=0.75,
+ guidance_scale=7.5,
+ num_inference_steps=20,
+ generator=generator,
+ output_type="np",
+ )
+ images = output.images
+ image_slice = images[0, 255:258, 383:386, -1]
+
+ assert images.shape == (1, 512, 768, 3)
+ expected_slice = np.array([0.8043, 0.926, 0.9581, 0.8119, 0.8954, 0.913, 0.7209, 0.7463, 0.7431])
+ # TODO: lower the tolerance after finding the cause of onnxruntime reproducibility issues
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 2e-2
diff --git a/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_inpaint.py b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_inpaint.py
new file mode 100644
index 0000000..6426547
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_inpaint.py
@@ -0,0 +1,141 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+
+from diffusers import LMSDiscreteScheduler, OnnxStableDiffusionInpaintPipeline
+from diffusers.utils.testing_utils import (
+ is_onnx_available,
+ load_image,
+ nightly,
+ require_onnxruntime,
+ require_torch_gpu,
+)
+
+from ..test_pipelines_onnx_common import OnnxPipelineTesterMixin
+
+
+if is_onnx_available():
+ import onnxruntime as ort
+
+
+class OnnxStableDiffusionPipelineFastTests(OnnxPipelineTesterMixin, unittest.TestCase):
+ # FIXME: add fast tests
+ pass
+
+
+@nightly
+@require_onnxruntime
+@require_torch_gpu
+class OnnxStableDiffusionInpaintPipelineIntegrationTests(unittest.TestCase):
+ @property
+ def gpu_provider(self):
+ return (
+ "CUDAExecutionProvider",
+ {
+ "gpu_mem_limit": "15000000000", # 15GB
+ "arena_extend_strategy": "kSameAsRequested",
+ },
+ )
+
+ @property
+ def gpu_options(self):
+ options = ort.SessionOptions()
+ options.enable_mem_pattern = False
+ return options
+
+ def test_inference_default_pndm(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/in_paint/overture-creations-5sI6fQgYIuo.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/in_paint/overture-creations-5sI6fQgYIuo_mask.png"
+ )
+ pipe = OnnxStableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting",
+ revision="onnx",
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A red cat sitting on a park bench"
+
+ generator = np.random.RandomState(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ guidance_scale=7.5,
+ num_inference_steps=10,
+ generator=generator,
+ output_type="np",
+ )
+ images = output.images
+ image_slice = images[0, 255:258, 255:258, -1]
+
+ assert images.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.2514, 0.3007, 0.3517, 0.1790, 0.2382, 0.3167, 0.1944, 0.2273, 0.2464])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_k_lms(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/in_paint/overture-creations-5sI6fQgYIuo.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/in_paint/overture-creations-5sI6fQgYIuo_mask.png"
+ )
+ lms_scheduler = LMSDiscreteScheduler.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", subfolder="scheduler", revision="onnx"
+ )
+ pipe = OnnxStableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting",
+ revision="onnx",
+ scheduler=lms_scheduler,
+ safety_checker=None,
+ feature_extractor=None,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A red cat sitting on a park bench"
+
+ generator = np.random.RandomState(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ guidance_scale=7.5,
+ num_inference_steps=20,
+ generator=generator,
+ output_type="np",
+ )
+ images = output.images
+ image_slice = images[0, 255:258, 255:258, -1]
+
+ assert images.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0086, 0.0077, 0.0083, 0.0093, 0.0107, 0.0139, 0.0094, 0.0097, 0.0125])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
diff --git a/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_upscale.py b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_upscale.py
new file mode 100644
index 0000000..56c10ad
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_onnx_stable_diffusion_upscale.py
@@ -0,0 +1,227 @@
+# coding=utf-8
+# Copyright 2022 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+
+from diffusers import (
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ LMSDiscreteScheduler,
+ OnnxStableDiffusionUpscalePipeline,
+ PNDMScheduler,
+)
+from diffusers.utils.testing_utils import (
+ floats_tensor,
+ is_onnx_available,
+ load_image,
+ nightly,
+ require_onnxruntime,
+ require_torch_gpu,
+)
+
+from ..test_pipelines_onnx_common import OnnxPipelineTesterMixin
+
+
+if is_onnx_available():
+ import onnxruntime as ort
+
+
+class OnnxStableDiffusionUpscalePipelineFastTests(OnnxPipelineTesterMixin, unittest.TestCase):
+ # TODO: is there an appropriate internal test set?
+ hub_checkpoint = "ssube/stable-diffusion-x4-upscaler-onnx"
+
+ def get_dummy_inputs(self, seed=0):
+ image = floats_tensor((1, 3, 128, 128), rng=random.Random(seed))
+ generator = np.random.RandomState(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_pipeline_default_ddpm(self):
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ # started as 128, should now be 512
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.6957, 0.7002, 0.7186, 0.6881, 0.6693, 0.6910, 0.7445, 0.7274, 0.7056])
+ assert np.abs(image_slice - expected_slice).max() < 1e-1
+
+ def test_pipeline_pndm(self):
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config, skip_prk_steps=True)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.7349, 0.7347, 0.7034, 0.7696, 0.7876, 0.7597, 0.7916, 0.8085, 0.8036])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_dpm_multistep(self):
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.7659278, 0.76437664, 0.75579107, 0.7691116, 0.77666986, 0.7727672, 0.7758664, 0.7812226, 0.76942515]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_euler(self):
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.6974782, 0.68902093, 0.70135885, 0.7583618, 0.7804545, 0.7854912, 0.78667426, 0.78743863, 0.78070223]
+ )
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_pipeline_euler_ancestral(self):
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(self.hub_checkpoint, provider="CPUExecutionProvider")
+ pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.77424496, 0.773601, 0.7645288, 0.7769598, 0.7772739, 0.7738688, 0.78187233, 0.77879584, 0.767043]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+
+@nightly
+@require_onnxruntime
+@require_torch_gpu
+class OnnxStableDiffusionUpscalePipelineIntegrationTests(unittest.TestCase):
+ @property
+ def gpu_provider(self):
+ return (
+ "CUDAExecutionProvider",
+ {
+ "gpu_mem_limit": "15000000000", # 15GB
+ "arena_extend_strategy": "kSameAsRequested",
+ },
+ )
+
+ @property
+ def gpu_options(self):
+ options = ort.SessionOptions()
+ options.enable_mem_pattern = False
+ return options
+
+ def test_inference_default_ddpm(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/img2img/sketch-mountains-input.jpg"
+ )
+ init_image = init_image.resize((128, 128))
+ # using the PNDM scheduler by default
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(
+ "ssube/stable-diffusion-x4-upscaler-onnx",
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A fantasy landscape, trending on artstation"
+
+ generator = np.random.RandomState(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ guidance_scale=7.5,
+ num_inference_steps=10,
+ generator=generator,
+ output_type="np",
+ )
+ images = output.images
+ image_slice = images[0, 255:258, 383:386, -1]
+
+ assert images.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.4883, 0.4947, 0.4980, 0.4975, 0.4982, 0.4980, 0.5000, 0.5006, 0.4972])
+ # TODO: lower the tolerance after finding the cause of onnxruntime reproducibility issues
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 2e-2
+
+ def test_inference_k_lms(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/img2img/sketch-mountains-input.jpg"
+ )
+ init_image = init_image.resize((128, 128))
+ lms_scheduler = LMSDiscreteScheduler.from_pretrained(
+ "ssube/stable-diffusion-x4-upscaler-onnx", subfolder="scheduler"
+ )
+ pipe = OnnxStableDiffusionUpscalePipeline.from_pretrained(
+ "ssube/stable-diffusion-x4-upscaler-onnx",
+ scheduler=lms_scheduler,
+ provider=self.gpu_provider,
+ sess_options=self.gpu_options,
+ )
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A fantasy landscape, trending on artstation"
+
+ generator = np.random.RandomState(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ guidance_scale=7.5,
+ num_inference_steps=20,
+ generator=generator,
+ output_type="np",
+ )
+ images = output.images
+ image_slice = images[0, 255:258, 383:386, -1]
+
+ assert images.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.50173753, 0.50223356, 0.502039, 0.50233036, 0.5023725, 0.5022601, 0.5018758, 0.50234085, 0.50241566]
+ )
+ # TODO: lower the tolerance after finding the cause of onnxruntime reproducibility issues
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 2e-2
diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion.py b/tests/pipelines/stable_diffusion/test_stable_diffusion.py
new file mode 100644
index 0000000..82afaca
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_stable_diffusion.py
@@ -0,0 +1,1425 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import gc
+import tempfile
+import time
+import traceback
+import unittest
+
+import numpy as np
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import (
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTokenizer,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ LCMScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionPipeline,
+ UNet2DConditionModel,
+ logging,
+)
+from diffusers.models.attention_processor import AttnProcessor
+from diffusers.utils.testing_utils import (
+ CaptureLogger,
+ enable_full_determinism,
+ load_image,
+ load_numpy,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_python39_or_higher,
+ require_torch_2,
+ require_torch_gpu,
+ run_test_in_subprocess,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+# Will be run via run_test_in_subprocess
+def _test_stable_diffusion_compile(in_queue, out_queue, timeout):
+ error = None
+ try:
+ inputs = in_queue.get(timeout=timeout)
+ torch_device = inputs.pop("torch_device")
+ seed = inputs.pop("seed")
+ inputs["generator"] = torch.Generator(device=torch_device).manual_seed(seed)
+
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ sd_pipe.scheduler = DDIMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+
+ sd_pipe.unet.to(memory_format=torch.channels_last)
+ sd_pipe.unet = torch.compile(sd_pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.38019, 0.28647, 0.27321, 0.40377, 0.38290, 0.35446, 0.39218, 0.38165, 0.42239])
+
+ assert np.abs(image_slice - expected_slice).max() < 5e-3
+ except Exception:
+ error = f"{traceback.format_exc()}"
+
+ results = {"error": error}
+ out_queue.put(results, timeout=timeout)
+ out_queue.join()
+
+
+class StableDiffusionPipelineFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=1,
+ sample_size=32,
+ time_cond_proj_dim=time_cond_proj_dim,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ norm_num_groups=2,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=64,
+ layer_norm_eps=1e-05,
+ num_attention_heads=8,
+ num_hidden_layers=3,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3203, 0.4555, 0.4711, 0.3505, 0.3973, 0.4650, 0.5137, 0.3392, 0.4045])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3454, 0.5349, 0.5185, 0.2808, 0.4509, 0.4612, 0.4655, 0.3601, 0.4315])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3454, 0.5349, 0.5185, 0.2808, 0.4509, 0.4612, 0.4655, 0.3601, 0.4315])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = sd_pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=sd_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ prompt_embeds = sd_pipe.text_encoder(text_inputs)[0]
+
+ inputs["prompt_embeds"] = prompt_embeds
+
+ # forward
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ embeds = []
+ for p in [prompt, negative_prompt]:
+ text_inputs = sd_pipe.tokenizer(
+ p,
+ padding="max_length",
+ max_length=sd_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ embeds.append(sd_pipe.text_encoder(text_inputs)[0])
+
+ inputs["prompt_embeds"], inputs["negative_prompt_embeds"] = embeds
+
+ # forward
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_prompt_embeds_with_plain_negative_prompt_list(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = negative_prompt
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = sd_pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=sd_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ prompt_embeds = sd_pipe.text_encoder(text_inputs)[0]
+
+ inputs["prompt_embeds"] = prompt_embeds
+
+ # forward
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_ddim_factor_8(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs, height=136, width=136)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 136, 136, 3)
+ expected_slice = np.array([0.4346, 0.5621, 0.5016, 0.3926, 0.4533, 0.4134, 0.5625, 0.5632, 0.5265])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_pndm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe.scheduler = PNDMScheduler(skip_prk_steps=True)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3411, 0.5032, 0.4704, 0.3135, 0.4323, 0.4740, 0.5150, 0.3498, 0.4022])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_no_safety_checker(self):
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-lms-pipe", safety_checker=None
+ )
+ assert isinstance(pipe, StableDiffusionPipeline)
+ assert isinstance(pipe.scheduler, LMSDiscreteScheduler)
+ assert pipe.safety_checker is None
+
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ # check that there's no error when saving a pipeline with one of the models being None
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe = StableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ # sanity check that the pipeline still works
+ assert pipe.safety_checker is None
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ def test_stable_diffusion_k_lms(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3149, 0.5246, 0.4796, 0.3218, 0.4469, 0.4729, 0.5151, 0.3597, 0.3954])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_k_euler_ancestral(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3151, 0.5243, 0.4794, 0.3217, 0.4468, 0.4728, 0.5152, 0.3598, 0.3954])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_k_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe.scheduler = EulerDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3149, 0.5246, 0.4796, 0.3218, 0.4469, 0.4729, 0.5151, 0.3597, 0.3954])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_vae_slicing(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = LMSDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ image_count = 4
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * image_count
+ output_1 = sd_pipe(**inputs)
+
+ # make sure sliced vae decode yields the same result
+ sd_pipe.enable_vae_slicing()
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * image_count
+ output_2 = sd_pipe(**inputs)
+
+ # there is a small discrepancy at image borders vs. full batch decode
+ assert np.abs(output_2.images.flatten() - output_1.images.flatten()).max() < 3e-3
+
+ def test_stable_diffusion_vae_tiling(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+
+ # make sure here that pndm scheduler skips prk
+ components["safety_checker"] = None
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+
+ # Test that tiled decode at 512x512 yields the same result as the non-tiled decode
+ generator = torch.Generator(device=device).manual_seed(0)
+ output_1 = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+
+ # make sure tiled vae decode yields the same result
+ sd_pipe.enable_vae_tiling()
+ generator = torch.Generator(device=device).manual_seed(0)
+ output_2 = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+
+ assert np.abs(output_2.images.flatten() - output_1.images.flatten()).max() < 5e-1
+
+ # test that tiled decode works with various shapes
+ shapes = [(1, 4, 73, 97), (1, 4, 97, 73), (1, 4, 49, 65), (1, 4, 65, 49)]
+ for shape in shapes:
+ zeros = torch.zeros(shape).to(device)
+ sd_pipe.vae.decode(zeros)
+
+ def test_stable_diffusion_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "french fries"
+ output = sd_pipe(**inputs, negative_prompt=negative_prompt)
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.3458, 0.5120, 0.4800, 0.3116, 0.4348, 0.4802, 0.5237, 0.3467, 0.3991])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_long_prompt(self):
+ components = self.get_dummy_components()
+ components["scheduler"] = LMSDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ do_classifier_free_guidance = True
+ negative_prompt = None
+ num_images_per_prompt = 1
+ logger = logging.get_logger("diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion")
+ logger.setLevel(logging.WARNING)
+
+ prompt = 100 * "@"
+ with CaptureLogger(logger) as cap_logger:
+ negative_text_embeddings, text_embeddings = sd_pipe.encode_prompt(
+ prompt, torch_device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt
+ )
+ if negative_text_embeddings is not None:
+ text_embeddings = torch.cat([negative_text_embeddings, text_embeddings])
+
+ # 100 - 77 + 1 (BOS token) + 1 (EOS token) = 25
+ assert cap_logger.out.count("@") == 25
+
+ negative_prompt = "Hello"
+ with CaptureLogger(logger) as cap_logger_2:
+ negative_text_embeddings_2, text_embeddings_2 = sd_pipe.encode_prompt(
+ prompt, torch_device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt
+ )
+ if negative_text_embeddings_2 is not None:
+ text_embeddings_2 = torch.cat([negative_text_embeddings_2, text_embeddings_2])
+
+ assert cap_logger.out == cap_logger_2.out
+
+ prompt = 25 * "@"
+ with CaptureLogger(logger) as cap_logger_3:
+ negative_text_embeddings_3, text_embeddings_3 = sd_pipe.encode_prompt(
+ prompt, torch_device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt
+ )
+ if negative_text_embeddings_3 is not None:
+ text_embeddings_3 = torch.cat([negative_text_embeddings_3, text_embeddings_3])
+
+ assert text_embeddings_3.shape == text_embeddings_2.shape == text_embeddings.shape
+ assert text_embeddings.shape[1] == 77
+ assert cap_logger_3.out == ""
+
+ def test_stable_diffusion_height_width_opt(self):
+ components = self.get_dummy_components()
+ components["scheduler"] = LMSDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "hey"
+
+ output = sd_pipe(prompt, num_inference_steps=1, output_type="np")
+ image_shape = output.images[0].shape[:2]
+ assert image_shape == (64, 64)
+
+ output = sd_pipe(prompt, num_inference_steps=1, height=96, width=96, output_type="np")
+ image_shape = output.images[0].shape[:2]
+ assert image_shape == (96, 96)
+
+ config = dict(sd_pipe.unet.config)
+ config["sample_size"] = 96
+ sd_pipe.unet = UNet2DConditionModel.from_config(config).to(torch_device)
+ output = sd_pipe(prompt, num_inference_steps=1, output_type="np")
+ image_shape = output.images[0].shape[:2]
+ assert image_shape == (192, 192)
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_freeu_enabled(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "hey"
+ output = sd_pipe(prompt, num_inference_steps=1, output_type="np", generator=torch.manual_seed(0)).images
+
+ sd_pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+ output_freeu = sd_pipe(prompt, num_inference_steps=1, output_type="np", generator=torch.manual_seed(0)).images
+
+ assert not np.allclose(
+ output[0, -3:, -3:, -1], output_freeu[0, -3:, -3:, -1]
+ ), "Enabling of FreeU should lead to different results."
+
+ def test_freeu_disabled(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "hey"
+ output = sd_pipe(prompt, num_inference_steps=1, output_type="np", generator=torch.manual_seed(0)).images
+
+ sd_pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+ sd_pipe.disable_freeu()
+
+ freeu_keys = {"s1", "s2", "b1", "b2"}
+ for upsample_block in sd_pipe.unet.up_blocks:
+ for key in freeu_keys:
+ assert getattr(upsample_block, key) is None, f"Disabling of FreeU should have set {key} to None."
+
+ output_no_freeu = sd_pipe(
+ prompt, num_inference_steps=1, output_type="np", generator=torch.manual_seed(0)
+ ).images
+
+ assert np.allclose(
+ output[0, -3:, -3:, -1], output_no_freeu[0, -3:, -3:, -1]
+ ), "Disabling of FreeU should lead to results similar to the default pipeline results."
+
+ def test_fused_qkv_projections(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ original_image_slice = image[0, -3:, -3:, -1]
+
+ sd_pipe.fuse_qkv_projections()
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice_fused = image[0, -3:, -3:, -1]
+
+ sd_pipe.unfuse_qkv_projections()
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice_disabled = image[0, -3:, -3:, -1]
+
+ assert np.allclose(
+ original_image_slice, image_slice_fused, atol=1e-2, rtol=1e-2
+ ), "Fusion of QKV projections shouldn't affect the outputs."
+ assert np.allclose(
+ image_slice_fused, image_slice_disabled, atol=1e-2, rtol=1e-2
+ ), "Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."
+ assert np.allclose(
+ original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2
+ ), "Original outputs should match when fused QKV projections are disabled."
+
+ def test_pipeline_interrupt(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "hey"
+ num_inference_steps = 3
+
+ # store intermediate latents from the generation process
+ class PipelineState:
+ def __init__(self):
+ self.state = []
+
+ def apply(self, pipe, i, t, callback_kwargs):
+ self.state.append(callback_kwargs["latents"])
+ return callback_kwargs
+
+ pipe_state = PipelineState()
+ sd_pipe(
+ prompt,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=pipe_state.apply,
+ ).images
+
+ # interrupt generation at step index
+ interrupt_step_idx = 1
+
+ def callback_on_step_end(pipe, i, t, callback_kwargs):
+ if i == interrupt_step_idx:
+ pipe._interrupt = True
+
+ return callback_kwargs
+
+ output_interrupted = sd_pipe(
+ prompt,
+ num_inference_steps=num_inference_steps,
+ output_type="latent",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=callback_on_step_end,
+ ).images
+
+ # fetch intermediate latents at the interrupted step
+ # from the completed generation process
+ intermediate_latent = pipe_state.state[interrupt_step_idx]
+
+ # compare the intermediate latent to the output of the interrupted process
+ # they should be the same
+ assert torch.allclose(intermediate_latent, output_interrupted, atol=1e-4)
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionPipelineSlowTests(unittest.TestCase):
+ def setUp(self):
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_1_1_pndm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-1")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.4363, 0.4355, 0.3667, 0.4066, 0.3970, 0.3866, 0.4394, 0.4356, 0.4059])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ def test_stable_diffusion_v1_4_with_freeu(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 25
+
+ sd_pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+ image = sd_pipe(**inputs).images
+ image = image[0, -3:, -3:, -1].flatten()
+ expected_image = [0.0721, 0.0588, 0.0268, 0.0384, 0.0636, 0.0, 0.0429, 0.0344, 0.0309]
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_1_4_pndm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.5740, 0.4784, 0.3162, 0.6358, 0.5831, 0.5505, 0.5082, 0.5631, 0.5575])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ def test_stable_diffusion_ddim(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ sd_pipe.scheduler = DDIMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.38019, 0.28647, 0.27321, 0.40377, 0.38290, 0.35446, 0.39218, 0.38165, 0.42239])
+ assert np.abs(image_slice - expected_slice).max() < 1e-4
+
+ def test_stable_diffusion_lms(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.10542, 0.09620, 0.07332, 0.09015, 0.09382, 0.07597, 0.08496, 0.07806, 0.06455])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ def test_stable_diffusion_dpm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ sd_pipe.scheduler = DPMSolverMultistepScheduler.from_config(
+ sd_pipe.scheduler.config,
+ final_sigmas_type="sigma_min",
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.03503, 0.03494, 0.01087, 0.03128, 0.02552, 0.00803, 0.00742, 0.00372, 0.00000])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ def test_stable_diffusion_attention_slicing(self):
+ torch.cuda.reset_peak_memory_stats()
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
+ pipe.unet.set_default_attn_processor()
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ # enable attention slicing
+ pipe.enable_attention_slicing()
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image_sliced = pipe(**inputs).images
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+ # make sure that less than 3.75 GB is allocated
+ assert mem_bytes < 3.75 * 10**9
+
+ # disable slicing
+ pipe.disable_attention_slicing()
+ pipe.unet.set_default_attn_processor()
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image = pipe(**inputs).images
+
+ # make sure that more than 3.75 GB is allocated
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes > 3.75 * 10**9
+ max_diff = numpy_cosine_similarity_distance(image_sliced.flatten(), image.flatten())
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_vae_slicing(self):
+ torch.cuda.reset_peak_memory_stats()
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ # enable vae slicing
+ pipe.enable_vae_slicing()
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ inputs["prompt"] = [inputs["prompt"]] * 4
+ inputs["latents"] = torch.cat([inputs["latents"]] * 4)
+ image_sliced = pipe(**inputs).images
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+ # make sure that less than 4 GB is allocated
+ assert mem_bytes < 4e9
+
+ # disable vae slicing
+ pipe.disable_vae_slicing()
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ inputs["prompt"] = [inputs["prompt"]] * 4
+ inputs["latents"] = torch.cat([inputs["latents"]] * 4)
+ image = pipe(**inputs).images
+
+ # make sure that more than 4 GB is allocated
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes > 4e9
+ # There is a small discrepancy at the image borders vs. a fully batched version.
+ max_diff = numpy_cosine_similarity_distance(image_sliced.flatten(), image.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_vae_tiling(self):
+ torch.cuda.reset_peak_memory_stats()
+ model_id = "CompVis/stable-diffusion-v1-4"
+ pipe = StableDiffusionPipeline.from_pretrained(
+ model_id, revision="fp16", torch_dtype=torch.float16, safety_checker=None
+ )
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+ pipe.unet = pipe.unet.to(memory_format=torch.channels_last)
+ pipe.vae = pipe.vae.to(memory_format=torch.channels_last)
+
+ prompt = "a photograph of an astronaut riding a horse"
+
+ # enable vae tiling
+ pipe.enable_vae_tiling()
+ pipe.enable_model_cpu_offload()
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output_chunked = pipe(
+ [prompt],
+ width=1024,
+ height=1024,
+ generator=generator,
+ guidance_scale=7.5,
+ num_inference_steps=2,
+ output_type="numpy",
+ )
+ image_chunked = output_chunked.images
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+
+ # disable vae tiling
+ pipe.disable_vae_tiling()
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(
+ [prompt],
+ width=1024,
+ height=1024,
+ generator=generator,
+ guidance_scale=7.5,
+ num_inference_steps=2,
+ output_type="numpy",
+ )
+ image = output.images
+
+ assert mem_bytes < 1e10
+ max_diff = numpy_cosine_similarity_distance(image_chunked.flatten(), image.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_fp16_vs_autocast(self):
+ # this test makes sure that the original model with autocast
+ # and the new model with fp16 yield the same result
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image_fp16 = pipe(**inputs).images
+
+ with torch.autocast(torch_device):
+ inputs = self.get_inputs(torch_device)
+ image_autocast = pipe(**inputs).images
+
+ # Make sure results are close enough
+ diff = np.abs(image_fp16.flatten() - image_autocast.flatten())
+ # They ARE different since ops are not run always at the same precision
+ # however, they should be extremely close.
+ assert diff.mean() < 2e-2
+
+ def test_stable_diffusion_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.5693, -0.3018, -0.9746, 0.0518, -0.8770, 0.7559, -1.7402, 0.1022, 1.1582]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.1958, -0.2993, -1.0166, -0.5005, -0.4810, 0.6162, -0.9492, 0.6621, 1.4492]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ callback_fn.has_been_called = False
+
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == inputs["num_inference_steps"]
+
+ def test_stable_diffusion_low_cpu_mem_usage(self):
+ pipeline_id = "CompVis/stable-diffusion-v1-4"
+
+ start_time = time.time()
+ pipeline_low_cpu_mem_usage = StableDiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
+ pipeline_low_cpu_mem_usage.to(torch_device)
+ low_cpu_mem_usage_time = time.time() - start_time
+
+ start_time = time.time()
+ _ = StableDiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16, low_cpu_mem_usage=False)
+ normal_load_time = time.time() - start_time
+
+ assert 2 * low_cpu_mem_usage_time < normal_load_time
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.8 GB is allocated
+ assert mem_bytes < 2.8 * 10**9
+
+ def test_stable_diffusion_pipeline_with_model_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+
+ # Normal inference
+
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ torch_dtype=torch.float16,
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ outputs = pipe(**inputs)
+ mem_bytes = torch.cuda.max_memory_allocated()
+
+ # With model offloading
+
+ # Reload but don't move to cuda
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ torch_dtype=torch.float16,
+ )
+ pipe.unet.set_default_attn_processor()
+
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+
+ outputs_offloaded = pipe(**inputs)
+ mem_bytes_offloaded = torch.cuda.max_memory_allocated()
+
+ images = outputs.images
+ offloaded_images = outputs_offloaded.images
+
+ max_diff = numpy_cosine_similarity_distance(images.flatten(), offloaded_images.flatten())
+ assert max_diff < 1e-3
+ assert mem_bytes_offloaded < mem_bytes
+ assert mem_bytes_offloaded < 3.5 * 10**9
+ for module in pipe.text_encoder, pipe.unet, pipe.vae:
+ assert module.device == torch.device("cpu")
+
+ # With attention slicing
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe.enable_attention_slicing()
+ _ = pipe(**inputs)
+ mem_bytes_slicing = torch.cuda.max_memory_allocated()
+
+ assert mem_bytes_slicing < mem_bytes_offloaded
+ assert mem_bytes_slicing < 3 * 10**9
+
+ def test_stable_diffusion_textual_inversion(self):
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ pipe.load_textual_inversion("sd-concepts-library/low-poly-hd-logos-icons")
+
+ a111_file = hf_hub_download("hf-internal-testing/text_inv_embedding_a1111_format", "winter_style.pt")
+ a111_file_neg = hf_hub_download(
+ "hf-internal-testing/text_inv_embedding_a1111_format", "winter_style_negative.pt"
+ )
+ pipe.load_textual_inversion(a111_file)
+ pipe.load_textual_inversion(a111_file_neg)
+ pipe.to("cuda")
+
+ generator = torch.Generator(device="cpu").manual_seed(1)
+
+ prompt = "An logo of a turtle in strong Style-Winter with "
+ neg_prompt = "Style-Winter-neg"
+
+ image = pipe(prompt=prompt, negative_prompt=neg_prompt, generator=generator, output_type="np").images[0]
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text_inv/winter_logo_style.npy"
+ )
+
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 8e-1
+
+ def test_stable_diffusion_textual_inversion_with_model_cpu_offload(self):
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ pipe.enable_model_cpu_offload()
+ pipe.load_textual_inversion("sd-concepts-library/low-poly-hd-logos-icons")
+
+ a111_file = hf_hub_download("hf-internal-testing/text_inv_embedding_a1111_format", "winter_style.pt")
+ a111_file_neg = hf_hub_download(
+ "hf-internal-testing/text_inv_embedding_a1111_format", "winter_style_negative.pt"
+ )
+ pipe.load_textual_inversion(a111_file)
+ pipe.load_textual_inversion(a111_file_neg)
+
+ generator = torch.Generator(device="cpu").manual_seed(1)
+
+ prompt = "An logo of a turtle in strong Style-Winter with "
+ neg_prompt = "Style-Winter-neg"
+
+ image = pipe(prompt=prompt, negative_prompt=neg_prompt, generator=generator, output_type="np").images[0]
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text_inv/winter_logo_style.npy"
+ )
+
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 8e-1
+
+ def test_stable_diffusion_textual_inversion_with_sequential_cpu_offload(self):
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ pipe.enable_sequential_cpu_offload()
+ pipe.load_textual_inversion("sd-concepts-library/low-poly-hd-logos-icons")
+
+ a111_file = hf_hub_download("hf-internal-testing/text_inv_embedding_a1111_format", "winter_style.pt")
+ a111_file_neg = hf_hub_download(
+ "hf-internal-testing/text_inv_embedding_a1111_format", "winter_style_negative.pt"
+ )
+ pipe.load_textual_inversion(a111_file)
+ pipe.load_textual_inversion(a111_file_neg)
+
+ generator = torch.Generator(device="cpu").manual_seed(1)
+
+ prompt = "An logo of a turtle in strong Style-Winter with "
+ neg_prompt = "Style-Winter-neg"
+
+ image = pipe(prompt=prompt, negative_prompt=neg_prompt, generator=generator, output_type="np").images[0]
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text_inv/winter_logo_style.npy"
+ )
+
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 8e-1
+
+ @require_python39_or_higher
+ @require_torch_2
+ def test_stable_diffusion_compile(self):
+ seed = 0
+ inputs = self.get_inputs(torch_device, seed=seed)
+ # Can't pickle a Generator object
+ del inputs["generator"]
+ inputs["torch_device"] = torch_device
+ inputs["seed"] = seed
+ run_test_in_subprocess(test_case=self, target_func=_test_stable_diffusion_compile, inputs=inputs)
+
+ def test_stable_diffusion_lcm(self):
+ unet = UNet2DConditionModel.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", subfolder="unet")
+ sd_pipe = StableDiffusionPipeline.from_pretrained("Lykon/dreamshaper-7", unet=unet).to(torch_device)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 6
+ inputs["output_type"] = "pil"
+
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/lcm_full/stable_diffusion_lcm.png"
+ )
+
+ image = sd_pipe.image_processor.pil_to_numpy(image)
+ expected_image = sd_pipe.image_processor.pil_to_numpy(expected_image)
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+
+ assert max_diff < 1e-2
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionPipelineCkptTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_download_from_hub(self):
+ ckpt_paths = [
+ "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt",
+ "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix_base.ckpt",
+ ]
+
+ for ckpt_path in ckpt_paths:
+ pipe = StableDiffusionPipeline.from_single_file(ckpt_path, torch_dtype=torch.float16)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to("cuda")
+
+ image_out = pipe("test", num_inference_steps=1, output_type="np").images[0]
+
+ assert image_out.shape == (512, 512, 3)
+
+ def test_download_local(self):
+ ckpt_filename = hf_hub_download("runwayml/stable-diffusion-v1-5", filename="v1-5-pruned-emaonly.ckpt")
+ config_filename = hf_hub_download("runwayml/stable-diffusion-v1-5", filename="v1-inference.yaml")
+
+ pipe = StableDiffusionPipeline.from_single_file(
+ ckpt_filename, config_files={"v1": config_filename}, torch_dtype=torch.float16
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to("cuda")
+
+ image_out = pipe("test", num_inference_steps=1, output_type="np").images[0]
+
+ assert image_out.shape == (512, 512, 3)
+
+ def test_download_ckpt_diff_format_is_same(self):
+ ckpt_path = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt"
+
+ sf_pipe = StableDiffusionPipeline.from_single_file(ckpt_path)
+ sf_pipe.scheduler = DDIMScheduler.from_config(sf_pipe.scheduler.config)
+ sf_pipe.unet.set_attn_processor(AttnProcessor())
+ sf_pipe.to("cuda")
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_single_file = sf_pipe("a turtle", num_inference_steps=2, generator=generator, output_type="np").images[0]
+
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_attn_processor(AttnProcessor())
+ pipe.to("cuda")
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image = pipe("a turtle", num_inference_steps=2, generator=generator, output_type="np").images[0]
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_single_file.flatten())
+
+ assert max_diff < 1e-3
+
+ def test_single_file_component_configs(self):
+ pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+
+ ckpt_path = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt"
+ single_file_pipe = StableDiffusionPipeline.from_single_file(ckpt_path, load_safety_checker=True)
+
+ for param_name, param_value in single_file_pipe.text_encoder.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder.config.to_dict()[param_name] == param_value
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.safety_checker.config.to_dict().items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.safety_checker.config.to_dict()[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "guidance_scale": 7.5,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_1_4_pndm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_text2img/stable_diffusion_1_4_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_1_5_pndm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_text2img/stable_diffusion_1_5_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_ddim(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to(torch_device)
+ sd_pipe.scheduler = DDIMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_text2img/stable_diffusion_1_4_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 3e-3
+
+ def test_stable_diffusion_lms(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to(torch_device)
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_text2img/stable_diffusion_1_4_lms.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_euler(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to(torch_device)
+ sd_pipe.scheduler = EulerDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_text2img/stable_diffusion_1_4_euler.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py
new file mode 100644
index 0000000..4483fd8
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_img2img.py
@@ -0,0 +1,735 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import traceback
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ AutoencoderTiny,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ HeunDiscreteScheduler,
+ LCMScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionImg2ImgPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_python39_or_higher,
+ require_torch_2,
+ require_torch_gpu,
+ run_test_in_subprocess,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+# Will be run via run_test_in_subprocess
+def _test_img2img_compile(in_queue, out_queue, timeout):
+ error = None
+ try:
+ inputs = in_queue.get(timeout=timeout)
+ torch_device = inputs.pop("torch_device")
+ seed = inputs.pop("seed")
+ inputs["generator"] = torch.Generator(device=torch_device).manual_seed(seed)
+
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 768, 3)
+ expected_slice = np.array([0.0606, 0.0570, 0.0805, 0.0579, 0.0628, 0.0623, 0.0843, 0.1115, 0.0806])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+ except Exception:
+ error = f"{traceback.format_exc()}"
+
+ results = {"error": error}
+ out_queue.put(results, timeout=timeout)
+ out_queue.join()
+
+
+class StableDiffusionImg2ImgPipelineFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ time_cond_proj_dim=time_cond_proj_dim,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_tiny_autoencoder(self):
+ return AutoencoderTiny(in_channels=3, out_channels=3, latent_channels=4)
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image / 2 + 0.5
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_img2img_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.4555, 0.3216, 0.4049, 0.4620, 0.4618, 0.4126, 0.4122, 0.4629, 0.4579])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_default_case_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.5709, 0.4614, 0.4587, 0.5978, 0.5298, 0.6910, 0.6240, 0.5212, 0.5454])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_default_case_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.5709, 0.4614, 0.4587, 0.5978, 0.5298, 0.6910, 0.6240, 0.5212, 0.5454])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "french fries"
+ output = sd_pipe(**inputs, negative_prompt=negative_prompt)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.4593, 0.3408, 0.4232, 0.4749, 0.4476, 0.4115, 0.4357, 0.4733, 0.4663])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_multiple_init_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * 2
+ inputs["image"] = inputs["image"].repeat(2, 1, 1, 1)
+ image = sd_pipe(**inputs).images
+ image_slice = image[-1, -3:, -3:, -1]
+
+ assert image.shape == (2, 32, 32, 3)
+ expected_slice = np.array([0.4241, 0.5576, 0.5711, 0.4792, 0.4311, 0.5952, 0.5827, 0.5138, 0.5109])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_k_lms(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = LMSDiscreteScheduler(
+ beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
+ )
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.4398, 0.4949, 0.4337, 0.6580, 0.5555, 0.4338, 0.5769, 0.5955, 0.5175])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_tiny_autoencoder(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe.vae = self.get_dummy_tiny_autoencoder()
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.00669, 0.00669, 0.0, 0.00693, 0.00858, 0.0, 0.00567, 0.00515, 0.00125])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ @skip_mps
+ def test_save_load_local(self):
+ return super().test_save_load_local()
+
+ @skip_mps
+ def test_dict_tuple_outputs_equivalent(self):
+ return super().test_dict_tuple_outputs_equivalent()
+
+ @skip_mps
+ def test_save_load_optional_components(self):
+ return super().test_save_load_optional_components()
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ return super().test_attention_slicing_forward_pass(expected_max_diff=5e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+ def test_pipeline_interrupt(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = "hey"
+ num_inference_steps = 3
+
+ # store intermediate latents from the generation process
+ class PipelineState:
+ def __init__(self):
+ self.state = []
+
+ def apply(self, pipe, i, t, callback_kwargs):
+ self.state.append(callback_kwargs["latents"])
+ return callback_kwargs
+
+ pipe_state = PipelineState()
+ sd_pipe(
+ prompt,
+ image=inputs["image"],
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=pipe_state.apply,
+ ).images
+
+ # interrupt generation at step index
+ interrupt_step_idx = 1
+
+ def callback_on_step_end(pipe, i, t, callback_kwargs):
+ if i == interrupt_step_idx:
+ pipe._interrupt = True
+
+ return callback_kwargs
+
+ output_interrupted = sd_pipe(
+ prompt,
+ image=inputs["image"],
+ num_inference_steps=num_inference_steps,
+ output_type="latent",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=callback_on_step_end,
+ ).images
+
+ # fetch intermediate latents at the interrupted step
+ # from the completed generation process
+ intermediate_latent = pipe_state.state[interrupt_step_idx]
+
+ # compare the intermediate latent to the output of the interrupted process
+ # they should be the same
+ assert torch.allclose(intermediate_latent, output_interrupted, atol=1e-4)
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionImg2ImgPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/sketch-mountains-input.png"
+ )
+ inputs = {
+ "prompt": "a fantasy landscape, concept art, high resolution",
+ "image": init_image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "strength": 0.75,
+ "guidance_scale": 7.5,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_img2img_default(self):
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 768, 3)
+ expected_slice = np.array([0.4300, 0.4662, 0.4930, 0.3990, 0.4307, 0.4525, 0.3719, 0.4064, 0.3923])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_k_lms(self):
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 768, 3)
+ expected_slice = np.array([0.0389, 0.0346, 0.0415, 0.0290, 0.0218, 0.0210, 0.0408, 0.0567, 0.0271])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_ddim(self):
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", safety_checker=None)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 768, 3)
+ expected_slice = np.array([0.0593, 0.0607, 0.0851, 0.0582, 0.0636, 0.0721, 0.0751, 0.0981, 0.0781])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_img2img_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 96)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([-0.4958, 0.5107, 1.1045, 2.7539, 4.6680, 3.8320, 1.5049, 1.8633, 2.6523])
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 96)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([-0.4956, 0.5078, 1.0918, 2.7520, 4.6484, 3.8125, 1.5146, 1.8633, 2.6367])
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ callback_fn.has_been_called = False
+
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == 2
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.2 GB is allocated
+ assert mem_bytes < 2.2 * 10**9
+
+ def test_stable_diffusion_pipeline_with_model_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+
+ # Normal inference
+
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ safety_checker=None,
+ torch_dtype=torch.float16,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe(**inputs)
+ mem_bytes = torch.cuda.max_memory_allocated()
+
+ # With model offloading
+
+ # Reload but don't move to cuda
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ safety_checker=None,
+ torch_dtype=torch.float16,
+ )
+
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+ _ = pipe(**inputs)
+ mem_bytes_offloaded = torch.cuda.max_memory_allocated()
+
+ assert mem_bytes_offloaded < mem_bytes
+ for module in pipe.text_encoder, pipe.unet, pipe.vae:
+ assert module.device == torch.device("cpu")
+
+ def test_img2img_2nd_order(self):
+ sd_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe.scheduler = HeunDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 10
+ inputs["strength"] = 0.75
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/img2img/img2img_heun.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 5e-2
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 11
+ inputs["strength"] = 0.75
+ image_other = sd_pipe(**inputs).images[0]
+
+ mean_diff = np.abs(image - image_other).mean()
+
+ # images should be very similar
+ assert mean_diff < 5e-2
+
+ def test_stable_diffusion_img2img_pipeline_multiple_of_8(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/img2img/sketch-mountains-input.jpg"
+ )
+ # resize to resolution that is divisible by 8 but not 16 or 32
+ init_image = init_image.resize((760, 504))
+
+ model_id = "CompVis/stable-diffusion-v1-4"
+ pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
+ model_id,
+ safety_checker=None,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ prompt = "A fantasy landscape, trending on artstation"
+
+ generator = torch.manual_seed(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ strength=0.75,
+ guidance_scale=7.5,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ image_slice = image[255:258, 383:386, -1]
+
+ assert image.shape == (504, 760, 3)
+ expected_slice = np.array([0.9393, 0.9500, 0.9399, 0.9438, 0.9458, 0.9400, 0.9455, 0.9414, 0.9423])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
+
+ def test_img2img_safety_checker_works(self):
+ sd_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 20
+ # make sure the safety checker is activated
+ inputs["prompt"] = "naked, sex, porn"
+ out = sd_pipe(**inputs)
+
+ assert out.nsfw_content_detected[0], f"Safety checker should work for prompt: {inputs['prompt']}"
+ assert np.abs(out.images[0]).sum() < 1e-5 # should be all zeros
+
+ @require_python39_or_higher
+ @require_torch_2
+ def test_img2img_compile(self):
+ seed = 0
+ inputs = self.get_inputs(torch_device, seed=seed)
+ # Can't pickle a Generator object
+ del inputs["generator"]
+ inputs["torch_device"] = torch_device
+ inputs["seed"] = seed
+ run_test_in_subprocess(test_case=self, target_func=_test_img2img_compile, inputs=inputs)
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionImg2ImgPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/sketch-mountains-input.png"
+ )
+ inputs = {
+ "prompt": "a fantasy landscape, concept art, high resolution",
+ "image": init_image,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "strength": 0.75,
+ "guidance_scale": 7.5,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_img2img_pndm(self):
+ sd_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/stable_diffusion_1_5_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_img2img_ddim(self):
+ sd_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe.scheduler = DDIMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/stable_diffusion_1_5_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_img2img_lms(self):
+ sd_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/stable_diffusion_1_5_lms.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_img2img_dpm(self):
+ sd_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe.scheduler = DPMSolverMultistepScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 30
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/stable_diffusion_1_5_dpm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py
new file mode 100644
index 0000000..218ac3e
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py
@@ -0,0 +1,1668 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import traceback
+import unittest
+
+import numpy as np
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AsymmetricAutoencoderKL,
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ LCMScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionInpaintPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.models.attention_processor import AttnProcessor
+from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_inpaint import prepare_mask_and_masked_image
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_python39_or_higher,
+ require_torch_2,
+ require_torch_gpu,
+ run_test_in_subprocess,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+# Will be run via run_test_in_subprocess
+def _test_inpaint_compile(in_queue, out_queue, timeout):
+ error = None
+ try:
+ inputs = in_queue.get(timeout=timeout)
+ torch_device = inputs.pop("torch_device")
+ seed = inputs.pop("seed")
+ inputs["generator"] = torch.Generator(device=torch_device).manual_seed(seed)
+
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0689, 0.0699, 0.0790, 0.0536, 0.0470, 0.0488, 0.041, 0.0508, 0.04179])
+ assert np.abs(expected_slice - image_slice).max() < 3e-3
+ except Exception:
+ error = f"{traceback.format_exc()}"
+
+ results = {"error": error}
+ out_queue.put(results, timeout=timeout)
+ out_queue.join()
+
+
+class StableDiffusionInpaintPipelineFastTests(
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset([])
+ # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union({"mask", "masked_image_latents"})
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ time_cond_proj_dim=time_cond_proj_dim,
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=9,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0, img_res=64, output_pil=True):
+ # TODO: use tensor inputs instead of PIL, this is here just to leave the old expected_slices untouched
+ if output_pil:
+ # Get random floats in [0, 1] as image
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ mask_image = torch.ones_like(image)
+ # Convert image and mask_image to [0, 255]
+ image = 255 * image
+ mask_image = 255 * mask_image
+ # Convert to PIL image
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((img_res, img_res))
+ mask_image = Image.fromarray(np.uint8(mask_image)).convert("RGB").resize((img_res, img_res))
+ else:
+ # Get random floats in [0, 1] as image with spatial size (img_res, img_res)
+ image = floats_tensor((1, 3, img_res, img_res), rng=random.Random(seed)).to(device)
+ # Convert image to [-1, 1]
+ init_image = 2.0 * image - 1.0
+ mask_image = torch.ones((1, 1, img_res, img_res), device=device)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_inpaint(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4703, 0.5697, 0.3879, 0.5470, 0.6042, 0.4413, 0.5078, 0.4728, 0.4469])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_inpaint_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4931, 0.5988, 0.4569, 0.5556, 0.6650, 0.5087, 0.5966, 0.5358, 0.5269])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_inpaint_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4931, 0.5988, 0.4569, 0.5556, 0.6650, 0.5087, 0.5966, 0.5358, 0.5269])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_inpaint_image_tensor(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ out_pil = output.images
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["image"] = torch.tensor(np.array(inputs["image"]) / 127.5 - 1).permute(2, 0, 1).unsqueeze(0)
+ inputs["mask_image"] = torch.tensor(np.array(inputs["mask_image"]) / 255).permute(2, 0, 1)[:1].unsqueeze(0)
+ output = sd_pipe(**inputs)
+ out_tensor = output.images
+
+ assert out_pil.shape == (1, 64, 64, 3)
+ assert np.abs(out_pil.flatten() - out_tensor.flatten()).max() < 5e-2
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_stable_diffusion_inpaint_strength_zero_test(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+
+ # check that the pipeline raises value error when num_inference_steps is < 1
+ inputs["strength"] = 0.01
+ with self.assertRaises(ValueError):
+ sd_pipe(**inputs).images
+
+ def test_stable_diffusion_inpaint_mask_latents(self):
+ device = "cpu"
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # normal mask + normal image
+ ## `image`: pil, `mask_image``: pil, `masked_image_latents``: None
+ inputs = self.get_dummy_inputs(device)
+ inputs["strength"] = 0.9
+ out_0 = sd_pipe(**inputs).images
+
+ # image latents + mask latents
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe.image_processor.preprocess(inputs["image"]).to(sd_pipe.device)
+ mask = sd_pipe.mask_processor.preprocess(inputs["mask_image"]).to(sd_pipe.device)
+ masked_image = image * (mask < 0.5)
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_latents = (
+ sd_pipe.vae.encode(image).latent_dist.sample(generator=generator) * sd_pipe.vae.config.scaling_factor
+ )
+ torch.randn((1, 4, 32, 32), generator=generator)
+ mask_latents = (
+ sd_pipe.vae.encode(masked_image).latent_dist.sample(generator=generator)
+ * sd_pipe.vae.config.scaling_factor
+ )
+ inputs["image"] = image_latents
+ inputs["masked_image_latents"] = mask_latents
+ inputs["mask_image"] = mask
+ inputs["strength"] = 0.9
+ generator = torch.Generator(device=device).manual_seed(0)
+ torch.randn((1, 4, 32, 32), generator=generator)
+ inputs["generator"] = generator
+ out_1 = sd_pipe(**inputs).images
+ assert np.abs(out_0 - out_1).max() < 1e-2
+
+ def test_pipeline_interrupt(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = "hey"
+ num_inference_steps = 3
+
+ # store intermediate latents from the generation process
+ class PipelineState:
+ def __init__(self):
+ self.state = []
+
+ def apply(self, pipe, i, t, callback_kwargs):
+ self.state.append(callback_kwargs["latents"])
+ return callback_kwargs
+
+ pipe_state = PipelineState()
+ sd_pipe(
+ prompt,
+ image=inputs["image"],
+ mask_image=inputs["mask_image"],
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=pipe_state.apply,
+ ).images
+
+ # interrupt generation at step index
+ interrupt_step_idx = 1
+
+ def callback_on_step_end(pipe, i, t, callback_kwargs):
+ if i == interrupt_step_idx:
+ pipe._interrupt = True
+
+ return callback_kwargs
+
+ output_interrupted = sd_pipe(
+ prompt,
+ image=inputs["image"],
+ mask_image=inputs["mask_image"],
+ num_inference_steps=num_inference_steps,
+ output_type="latent",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=callback_on_step_end,
+ ).images
+
+ # fetch intermediate latents at the interrupted step
+ # from the completed generation process
+ intermediate_latent = pipe_state.state[interrupt_step_idx]
+
+ # compare the intermediate latent to the output of the interrupted process
+ # they should be the same
+ assert torch.allclose(intermediate_latent, output_interrupted, atol=1e-4)
+
+
+class StableDiffusionSimpleInpaintPipelineFastTests(StableDiffusionInpaintPipelineFastTests):
+ pipeline_class = StableDiffusionInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset([])
+ # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ time_cond_proj_dim=time_cond_proj_dim,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs_2images(self, device, seed=0, img_res=64):
+ # Get random floats in [0, 1] as image with spatial size (img_res, img_res)
+ image1 = floats_tensor((1, 3, img_res, img_res), rng=random.Random(seed)).to(device)
+ image2 = floats_tensor((1, 3, img_res, img_res), rng=random.Random(seed + 22)).to(device)
+ # Convert images to [-1, 1]
+ init_image1 = 2.0 * image1 - 1.0
+ init_image2 = 2.0 * image2 - 1.0
+
+ # empty mask
+ mask_image = torch.zeros((1, 1, img_res, img_res), device=device)
+
+ if str(device).startswith("mps"):
+ generator1 = torch.manual_seed(seed)
+ generator2 = torch.manual_seed(seed)
+ else:
+ generator1 = torch.Generator(device=device).manual_seed(seed)
+ generator2 = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": ["A painting of a squirrel eating a burger"] * 2,
+ "image": [init_image1, init_image2],
+ "mask_image": [mask_image] * 2,
+ "generator": [generator1, generator2],
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_inpaint(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.6584, 0.5424, 0.5649, 0.5449, 0.5897, 0.6111, 0.5404, 0.5463, 0.5214])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_inpaint_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.6240, 0.5355, 0.5649, 0.5378, 0.5374, 0.6242, 0.5132, 0.5347, 0.5396])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_inpaint_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.6240, 0.5355, 0.5649, 0.5378, 0.5374, 0.6242, 0.5132, 0.5347, 0.5396])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_inpaint_2_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # test to confirm if we pass two same image, we will get same output
+ inputs = self.get_dummy_inputs(device)
+ gen1 = torch.Generator(device=device).manual_seed(0)
+ gen2 = torch.Generator(device=device).manual_seed(0)
+ for name in ["prompt", "image", "mask_image"]:
+ inputs[name] = [inputs[name]] * 2
+ inputs["generator"] = [gen1, gen2]
+ images = sd_pipe(**inputs).images
+
+ assert images.shape == (2, 64, 64, 3)
+
+ image_slice1 = images[0, -3:, -3:, -1]
+ image_slice2 = images[1, -3:, -3:, -1]
+ assert np.abs(image_slice1.flatten() - image_slice2.flatten()).max() < 1e-4
+
+ # test to confirm that if we pass two different images, we will get different output
+ inputs = self.get_dummy_inputs_2images(device)
+ images = sd_pipe(**inputs).images
+ assert images.shape == (2, 64, 64, 3)
+
+ image_slice1 = images[0, -3:, -3:, -1]
+ image_slice2 = images[1, -3:, -3:, -1]
+ assert np.abs(image_slice1.flatten() - image_slice2.flatten()).max() > 1e-2
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionInpaintPipelineSlowTests(unittest.TestCase):
+ def setUp(self):
+ super().setUp()
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_mask.png"
+ )
+ inputs = {
+ "prompt": "Face of a yellow cat, high resolution, sitting on a park bench",
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_inpaint_ddim(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0427, 0.0460, 0.0483, 0.0460, 0.0584, 0.0521, 0.1549, 0.1695, 0.1794])
+
+ assert np.abs(expected_slice - image_slice).max() < 6e-4
+
+ def test_stable_diffusion_inpaint_fp16(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.1509, 0.1245, 0.1672, 0.1655, 0.1519, 0.1226, 0.1462, 0.1567, 0.2451])
+ assert np.abs(expected_slice - image_slice).max() < 1e-1
+
+ def test_stable_diffusion_inpaint_pndm(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0425, 0.0273, 0.0344, 0.1694, 0.1727, 0.1812, 0.3256, 0.3311, 0.3272])
+
+ assert np.abs(expected_slice - image_slice).max() < 5e-3
+
+ def test_stable_diffusion_inpaint_k_lms(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.9314, 0.7575, 0.9432, 0.8885, 0.9028, 0.7298, 0.9811, 0.9667, 0.7633])
+
+ assert np.abs(expected_slice - image_slice).max() < 6e-3
+
+ def test_stable_diffusion_inpaint_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.2 GB is allocated
+ assert mem_bytes < 2.2 * 10**9
+
+ @require_python39_or_higher
+ @require_torch_2
+ def test_inpaint_compile(self):
+ seed = 0
+ inputs = self.get_inputs(torch_device, seed=seed)
+ # Can't pickle a Generator object
+ del inputs["generator"]
+ inputs["torch_device"] = torch_device
+ inputs["seed"] = seed
+ run_test_in_subprocess(test_case=self, target_func=_test_inpaint_compile, inputs=inputs)
+
+ def test_stable_diffusion_inpaint_pil_input_resolution_test(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ # change input image to a random size (one that would cause a tensor mismatch error)
+ inputs["image"] = inputs["image"].resize((127, 127))
+ inputs["mask_image"] = inputs["mask_image"].resize((127, 127))
+ inputs["height"] = 128
+ inputs["width"] = 128
+ image = pipe(**inputs).images
+ # verify that the returned image has the same height and width as the input height and width
+ assert image.shape == (1, inputs["height"], inputs["width"], 3)
+
+ def test_stable_diffusion_inpaint_strength_test(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ # change input strength
+ inputs["strength"] = 0.75
+ image = pipe(**inputs).images
+ # verify that the returned image has the same height and width as the input height and width
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+ expected_slice = np.array([0.2728, 0.2803, 0.2665, 0.2511, 0.2774, 0.2586, 0.2391, 0.2392, 0.2582])
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_simple_inpaint_ddim(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None)
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.3757, 0.3875, 0.4445, 0.4353, 0.3780, 0.4513, 0.3965, 0.3984, 0.4362])
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_download_local(self):
+ filename = hf_hub_download("runwayml/stable-diffusion-inpainting", filename="sd-v1-5-inpainting.ckpt")
+
+ pipe = StableDiffusionInpaintPipeline.from_single_file(filename, torch_dtype=torch.float16)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to("cuda")
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 1
+ image_out = pipe(**inputs).images[0]
+
+ assert image_out.shape == (512, 512, 3)
+
+ def test_download_ckpt_diff_format_is_same(self):
+ ckpt_path = "https://huggingface.co/runwayml/stable-diffusion-inpainting/blob/main/sd-v1-5-inpainting.ckpt"
+
+ pipe = StableDiffusionInpaintPipeline.from_single_file(ckpt_path)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_attn_processor(AttnProcessor())
+ pipe.to("cuda")
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ image_ckpt = pipe(**inputs).images[0]
+
+ pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_attn_processor(AttnProcessor())
+ pipe.to("cuda")
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ image = pipe(**inputs).images[0]
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_ckpt.flatten())
+
+ assert max_diff < 1e-4
+
+ def test_single_file_component_configs(self):
+ pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting", variant="fp16")
+
+ ckpt_path = "https://huggingface.co/runwayml/stable-diffusion-inpainting/blob/main/sd-v1-5-inpainting.ckpt"
+ single_file_pipe = StableDiffusionInpaintPipeline.from_single_file(ckpt_path, load_safety_checker=True)
+
+ for param_name, param_value in single_file_pipe.text_encoder.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder.config.to_dict()[param_name] == param_value
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} is differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} is differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.safety_checker.config.to_dict().items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.safety_checker.config.to_dict()[param_name] == param_value
+ ), f"{param_name} is differs between single file loading and pretrained loading"
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionInpaintPipelineAsymmetricAutoencoderKLSlowTests(unittest.TestCase):
+ def setUp(self):
+ super().setUp()
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_mask.png"
+ )
+ inputs = {
+ "prompt": "Face of a yellow cat, high resolution, sitting on a park bench",
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_inpaint_ddim(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.vae = vae
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0522, 0.0604, 0.0596, 0.0449, 0.0493, 0.0427, 0.1186, 0.1289, 0.1442])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_inpaint_fp16(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained(
+ "cross-attention/asymmetric-autoencoder-kl-x-1-5", torch_dtype=torch.float16
+ )
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.vae = vae
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.1343, 0.1406, 0.1440, 0.1504, 0.1729, 0.0989, 0.1807, 0.2822, 0.1179])
+
+ assert np.abs(expected_slice - image_slice).max() < 5e-2
+
+ def test_stable_diffusion_inpaint_pndm(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.vae = vae
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0966, 0.1083, 0.1148, 0.1422, 0.1318, 0.1197, 0.3702, 0.3537, 0.3288])
+
+ assert np.abs(expected_slice - image_slice).max() < 5e-3
+
+ def test_stable_diffusion_inpaint_k_lms(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.vae = vae
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.8931, 0.8683, 0.8965, 0.8501, 0.8592, 0.9118, 0.8734, 0.7463, 0.8990])
+ assert np.abs(expected_slice - image_slice).max() < 6e-3
+
+ def test_stable_diffusion_inpaint_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ vae = AsymmetricAutoencoderKL.from_pretrained(
+ "cross-attention/asymmetric-autoencoder-kl-x-1-5", torch_dtype=torch.float16
+ )
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe.vae = vae
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.45 GB is allocated
+ assert mem_bytes < 2.45 * 10**9
+
+ @require_python39_or_higher
+ @require_torch_2
+ def test_inpaint_compile(self):
+ pass
+
+ def test_stable_diffusion_inpaint_pil_input_resolution_test(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained(
+ "cross-attention/asymmetric-autoencoder-kl-x-1-5",
+ )
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.vae = vae
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ # change input image to a random size (one that would cause a tensor mismatch error)
+ inputs["image"] = inputs["image"].resize((127, 127))
+ inputs["mask_image"] = inputs["mask_image"].resize((127, 127))
+ inputs["height"] = 128
+ inputs["width"] = 128
+ image = pipe(**inputs).images
+ # verify that the returned image has the same height and width as the input height and width
+ assert image.shape == (1, inputs["height"], inputs["width"], 3)
+
+ def test_stable_diffusion_inpaint_strength_test(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ "runwayml/stable-diffusion-inpainting", safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.vae = vae
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ # change input strength
+ inputs["strength"] = 0.75
+ image = pipe(**inputs).images
+ # verify that the returned image has the same height and width as the input height and width
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+ expected_slice = np.array([0.2458, 0.2576, 0.3124, 0.2679, 0.2669, 0.2796, 0.2872, 0.2975, 0.2661])
+ assert np.abs(expected_slice - image_slice).max() < 3e-3
+
+ def test_stable_diffusion_simple_inpaint_ddim(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
+ pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None)
+ pipe.vae = vae
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.3296, 0.4041, 0.4097, 0.4145, 0.4342, 0.4152, 0.4927, 0.4931, 0.4430])
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_download_local(self):
+ vae = AsymmetricAutoencoderKL.from_pretrained(
+ "cross-attention/asymmetric-autoencoder-kl-x-1-5", torch_dtype=torch.float16
+ )
+ filename = hf_hub_download("runwayml/stable-diffusion-inpainting", filename="sd-v1-5-inpainting.ckpt")
+
+ pipe = StableDiffusionInpaintPipeline.from_single_file(filename, torch_dtype=torch.float16)
+ pipe.vae = vae
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to("cuda")
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 1
+ image_out = pipe(**inputs).images[0]
+
+ assert image_out.shape == (512, 512, 3)
+
+ def test_download_ckpt_diff_format_is_same(self):
+ pass
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionInpaintPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/input_bench_mask.png"
+ )
+ inputs = {
+ "prompt": "Face of a yellow cat, high resolution, sitting on a park bench",
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_inpaint_ddim(self):
+ sd_pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/stable_diffusion_inpaint_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_inpaint_pndm(self):
+ sd_pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+ sd_pipe.scheduler = PNDMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/stable_diffusion_inpaint_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_inpaint_lms(self):
+ sd_pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/stable_diffusion_inpaint_lms.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_inpaint_dpm(self):
+ sd_pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+ sd_pipe.scheduler = DPMSolverMultistepScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 30
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_inpaint/stable_diffusion_inpaint_dpm_multi.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+
+class StableDiffusionInpaintingPrepareMaskAndMaskedImageTests(unittest.TestCase):
+ def test_pil_inputs(self):
+ height, width = 32, 32
+ im = np.random.randint(0, 255, (height, width, 3), dtype=np.uint8)
+ im = Image.fromarray(im)
+ mask = np.random.randint(0, 255, (height, width), dtype=np.uint8) > 127.5
+ mask = Image.fromarray((mask * 255).astype(np.uint8))
+
+ t_mask, t_masked, t_image = prepare_mask_and_masked_image(im, mask, height, width, return_image=True)
+
+ self.assertTrue(isinstance(t_mask, torch.Tensor))
+ self.assertTrue(isinstance(t_masked, torch.Tensor))
+ self.assertTrue(isinstance(t_image, torch.Tensor))
+
+ self.assertEqual(t_mask.ndim, 4)
+ self.assertEqual(t_masked.ndim, 4)
+ self.assertEqual(t_image.ndim, 4)
+
+ self.assertEqual(t_mask.shape, (1, 1, height, width))
+ self.assertEqual(t_masked.shape, (1, 3, height, width))
+ self.assertEqual(t_image.shape, (1, 3, height, width))
+
+ self.assertTrue(t_mask.dtype == torch.float32)
+ self.assertTrue(t_masked.dtype == torch.float32)
+ self.assertTrue(t_image.dtype == torch.float32)
+
+ self.assertTrue(t_mask.min() >= 0.0)
+ self.assertTrue(t_mask.max() <= 1.0)
+ self.assertTrue(t_masked.min() >= -1.0)
+ self.assertTrue(t_masked.min() <= 1.0)
+ self.assertTrue(t_image.min() >= -1.0)
+ self.assertTrue(t_image.min() >= -1.0)
+
+ self.assertTrue(t_mask.sum() > 0.0)
+
+ def test_np_inputs(self):
+ height, width = 32, 32
+
+ im_np = np.random.randint(0, 255, (height, width, 3), dtype=np.uint8)
+ im_pil = Image.fromarray(im_np)
+ mask_np = (
+ np.random.randint(
+ 0,
+ 255,
+ (
+ height,
+ width,
+ ),
+ dtype=np.uint8,
+ )
+ > 127.5
+ )
+ mask_pil = Image.fromarray((mask_np * 255).astype(np.uint8))
+
+ t_mask_np, t_masked_np, t_image_np = prepare_mask_and_masked_image(
+ im_np, mask_np, height, width, return_image=True
+ )
+ t_mask_pil, t_masked_pil, t_image_pil = prepare_mask_and_masked_image(
+ im_pil, mask_pil, height, width, return_image=True
+ )
+
+ self.assertTrue((t_mask_np == t_mask_pil).all())
+ self.assertTrue((t_masked_np == t_masked_pil).all())
+ self.assertTrue((t_image_np == t_image_pil).all())
+
+ def test_torch_3D_2D_inputs(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+ im_np = im_tensor.numpy().transpose(1, 2, 0)
+ mask_np = mask_tensor.numpy()
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ t_mask_np, t_masked_np, t_image_np = prepare_mask_and_masked_image(
+ im_np, mask_np, height, width, return_image=True
+ )
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_torch_3D_3D_inputs(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ 1,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+ im_np = im_tensor.numpy().transpose(1, 2, 0)
+ mask_np = mask_tensor.numpy()[0]
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ t_mask_np, t_masked_np, t_image_np = prepare_mask_and_masked_image(
+ im_np, mask_np, height, width, return_image=True
+ )
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_torch_4D_2D_inputs(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 1,
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+ im_np = im_tensor.numpy()[0].transpose(1, 2, 0)
+ mask_np = mask_tensor.numpy()
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ t_mask_np, t_masked_np, t_image_np = prepare_mask_and_masked_image(
+ im_np, mask_np, height, width, return_image=True
+ )
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_torch_4D_3D_inputs(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 1,
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ 1,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+ im_np = im_tensor.numpy()[0].transpose(1, 2, 0)
+ mask_np = mask_tensor.numpy()[0]
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ t_mask_np, t_masked_np, t_image_np = prepare_mask_and_masked_image(
+ im_np, mask_np, height, width, return_image=True
+ )
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_torch_4D_4D_inputs(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 1,
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ 1,
+ 1,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+ im_np = im_tensor.numpy()[0].transpose(1, 2, 0)
+ mask_np = mask_tensor.numpy()[0][0]
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ t_mask_np, t_masked_np, t_image_np = prepare_mask_and_masked_image(
+ im_np, mask_np, height, width, return_image=True
+ )
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_torch_batch_4D_3D(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 2,
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ 2,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+
+ im_nps = [im.numpy().transpose(1, 2, 0) for im in im_tensor]
+ mask_nps = [mask.numpy() for mask in mask_tensor]
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ nps = [prepare_mask_and_masked_image(i, m, height, width, return_image=True) for i, m in zip(im_nps, mask_nps)]
+ t_mask_np = torch.cat([n[0] for n in nps])
+ t_masked_np = torch.cat([n[1] for n in nps])
+ t_image_np = torch.cat([n[2] for n in nps])
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_torch_batch_4D_4D(self):
+ height, width = 32, 32
+
+ im_tensor = torch.randint(
+ 0,
+ 255,
+ (
+ 2,
+ 3,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ mask_tensor = (
+ torch.randint(
+ 0,
+ 255,
+ (
+ 2,
+ 1,
+ height,
+ width,
+ ),
+ dtype=torch.uint8,
+ )
+ > 127.5
+ )
+
+ im_nps = [im.numpy().transpose(1, 2, 0) for im in im_tensor]
+ mask_nps = [mask.numpy()[0] for mask in mask_tensor]
+
+ t_mask_tensor, t_masked_tensor, t_image_tensor = prepare_mask_and_masked_image(
+ im_tensor / 127.5 - 1, mask_tensor, height, width, return_image=True
+ )
+ nps = [prepare_mask_and_masked_image(i, m, height, width, return_image=True) for i, m in zip(im_nps, mask_nps)]
+ t_mask_np = torch.cat([n[0] for n in nps])
+ t_masked_np = torch.cat([n[1] for n in nps])
+ t_image_np = torch.cat([n[2] for n in nps])
+
+ self.assertTrue((t_mask_tensor == t_mask_np).all())
+ self.assertTrue((t_masked_tensor == t_masked_np).all())
+ self.assertTrue((t_image_tensor == t_image_np).all())
+
+ def test_shape_mismatch(self):
+ height, width = 32, 32
+
+ # test height and width
+ with self.assertRaises(AssertionError):
+ prepare_mask_and_masked_image(
+ torch.randn(
+ 3,
+ height,
+ width,
+ ),
+ torch.randn(64, 64),
+ height,
+ width,
+ return_image=True,
+ )
+ # test batch dim
+ with self.assertRaises(AssertionError):
+ prepare_mask_and_masked_image(
+ torch.randn(
+ 2,
+ 3,
+ height,
+ width,
+ ),
+ torch.randn(4, 64, 64),
+ height,
+ width,
+ return_image=True,
+ )
+ # test batch dim
+ with self.assertRaises(AssertionError):
+ prepare_mask_and_masked_image(
+ torch.randn(
+ 2,
+ 3,
+ height,
+ width,
+ ),
+ torch.randn(4, 1, 64, 64),
+ height,
+ width,
+ return_image=True,
+ )
+
+ def test_type_mismatch(self):
+ height, width = 32, 32
+
+ # test tensors-only
+ with self.assertRaises(TypeError):
+ prepare_mask_and_masked_image(
+ torch.rand(
+ 3,
+ height,
+ width,
+ ),
+ torch.rand(
+ 3,
+ height,
+ width,
+ ).numpy(),
+ height,
+ width,
+ return_image=True,
+ )
+ # test tensors-only
+ with self.assertRaises(TypeError):
+ prepare_mask_and_masked_image(
+ torch.rand(
+ 3,
+ height,
+ width,
+ ).numpy(),
+ torch.rand(
+ 3,
+ height,
+ width,
+ ),
+ height,
+ width,
+ return_image=True,
+ )
+
+ def test_channels_first(self):
+ height, width = 32, 32
+
+ # test channels first for 3D tensors
+ with self.assertRaises(AssertionError):
+ prepare_mask_and_masked_image(
+ torch.rand(height, width, 3),
+ torch.rand(
+ 3,
+ height,
+ width,
+ ),
+ height,
+ width,
+ return_image=True,
+ )
+
+ def test_tensor_range(self):
+ height, width = 32, 32
+
+ # test im <= 1
+ with self.assertRaises(ValueError):
+ prepare_mask_and_masked_image(
+ torch.ones(
+ 3,
+ height,
+ width,
+ )
+ * 2,
+ torch.rand(
+ height,
+ width,
+ ),
+ height,
+ width,
+ return_image=True,
+ )
+ # test im >= -1
+ with self.assertRaises(ValueError):
+ prepare_mask_and_masked_image(
+ torch.ones(
+ 3,
+ height,
+ width,
+ )
+ * (-2),
+ torch.rand(
+ height,
+ width,
+ ),
+ height,
+ width,
+ return_image=True,
+ )
+ # test mask <= 1
+ with self.assertRaises(ValueError):
+ prepare_mask_and_masked_image(
+ torch.rand(
+ 3,
+ height,
+ width,
+ ),
+ torch.ones(
+ height,
+ width,
+ )
+ * 2,
+ height,
+ width,
+ return_image=True,
+ )
+ # test mask >= 0
+ with self.assertRaises(ValueError):
+ prepare_mask_and_masked_image(
+ torch.rand(
+ 3,
+ height,
+ width,
+ ),
+ torch.ones(
+ height,
+ width,
+ )
+ * -1,
+ height,
+ width,
+ return_image=True,
+ )
diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_instruction_pix2pix.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_instruction_pix2pix.py
new file mode 100644
index 0000000..0986f02
--- /dev/null
+++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_instruction_pix2pix.py
@@ -0,0 +1,426 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ EulerAncestralDiscreteScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionInstructPix2PixPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+)
+from ..test_pipelines_common import (
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionInstructPix2PixPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionInstructPix2PixPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width", "cross_attention_kwargs"}
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union({"image_latents"}) - {"negative_prompt_embeds"}
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=8,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB")
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "image_guidance_scale": 1,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_pix2pix_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInstructPix2PixPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.7526, 0.3750, 0.4547, 0.6117, 0.5866, 0.5016, 0.4327, 0.5642, 0.4815])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_pix2pix_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInstructPix2PixPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "french fries"
+ output = sd_pipe(**inputs, negative_prompt=negative_prompt)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.7511, 0.3642, 0.4553, 0.6236, 0.5797, 0.5013, 0.4343, 0.5611, 0.4831])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_pix2pix_multiple_init_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInstructPix2PixPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * 2
+
+ image = np.array(inputs["image"]).astype(np.float32) / 255.0
+ image = torch.from_numpy(image).unsqueeze(0).to(device)
+ image = image / 2 + 0.5
+ image = image.permute(0, 3, 1, 2)
+ inputs["image"] = image.repeat(2, 1, 1, 1)
+
+ image = sd_pipe(**inputs).images
+ image_slice = image[-1, -3:, -3:, -1]
+
+ assert image.shape == (2, 32, 32, 3)
+ expected_slice = np.array([0.5812, 0.5748, 0.5222, 0.5908, 0.5695, 0.7174, 0.6804, 0.5523, 0.5579])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_pix2pix_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = EulerAncestralDiscreteScheduler(
+ beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
+ )
+ sd_pipe = StableDiffusionInstructPix2PixPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ slice = [round(x, 4) for x in image_slice.flatten().tolist()]
+ print(",".join([str(x) for x in slice]))
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.7417, 0.3842, 0.4732, 0.5776, 0.5891, 0.5139, 0.4052, 0.5673, 0.4986])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ # Overwrite the default test_latents_inputs because pix2pix encode the image differently
+ def test_latents_input(self):
+ components = self.get_dummy_components()
+ pipe = StableDiffusionInstructPix2PixPipeline(**components)
+ pipe.image_processor = VaeImageProcessor(do_resize=False, do_normalize=False)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ out = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="pt"))[0]
+
+ vae = components["vae"]
+ inputs = self.get_dummy_inputs_by_type(torch_device, input_image_type="pt")
+
+ for image_param in self.image_latents_params:
+ if image_param in inputs.keys():
+ inputs[image_param] = vae.encode(inputs[image_param]).latent_dist.mode()
+
+ out_latents_inputs = pipe(**inputs)[0]
+
+ max_diff = np.abs(out - out_latents_inputs).max()
+ self.assertLess(max_diff, 1e-4, "passing latents as image input generate different result from passing image")
+
+ # Override the default test_callback_cfg because pix2pix create inputs for cfg differently
+ def test_callback_cfg(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ def callback_no_cfg(pipe, i, t, callback_kwargs):
+ if i == 1:
+ for k, w in callback_kwargs.items():
+ if k in self.callback_cfg_params:
+ callback_kwargs[k] = callback_kwargs[k].chunk(3)[0]
+ pipe._guidance_scale = 1.0
+
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["guidance_scale"] = 1.0
+ inputs["num_inference_steps"] = 2
+ out_no_cfg = pipe(**inputs)[0]
+
+ inputs["guidance_scale"] = 7.5
+ inputs["callback_on_step_end"] = callback_no_cfg
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ out_callback_no_cfg = pipe(**inputs)[0]
+
+ assert out_no_cfg.shape == out_callback_no_cfg.shape
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionInstructPix2PixPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, seed=0):
+ generator = torch.manual_seed(seed)
+ image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main/stable_diffusion_pix2pix/example.jpg"
+ )
+ inputs = {
+ "prompt": "turn him into a cyborg",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "image_guidance_scale": 1.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_pix2pix_default(self):
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ "timbrooks/instruct-pix2pix", safety_checker=None
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.5902, 0.6015, 0.6027, 0.5983, 0.6092, 0.6061, 0.5765, 0.5785, 0.5555])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_pix2pix_k_lms(self):
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ "timbrooks/instruct-pix2pix", safety_checker=None
+ )
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.6578, 0.6817, 0.6972, 0.6761, 0.6856, 0.6916, 0.6428, 0.6516, 0.6301])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_pix2pix_ddim(self):
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ "timbrooks/instruct-pix2pix", safety_checker=None
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.3828, 0.3834, 0.3818, 0.3792, 0.3865, 0.3752, 0.3792, 0.3847, 0.3753])
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-3
+
+ def test_stable_diffusion_pix2pix_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([-0.2463, -0.4644, -0.9756, 1.5176, 1.4414, 0.7866, 0.9897, 0.8521, 0.7983])
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([-0.2644, -0.4626, -0.9653, 1.5176, 1.4551, 0.7686, 0.9805, 0.8452, 0.8115])
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ callback_fn.has_been_called = False
+
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ "timbrooks/instruct-pix2pix", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == 3
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ "timbrooks/instruct-pix2pix", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs()
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.2 GB is allocated
+ assert mem_bytes < 2.2 * 10**9
+
+ def test_stable_diffusion_pix2pix_pipeline_multiple_of_8(self):
+ inputs = self.get_inputs()
+ # resize to resolution that is divisible by 8 but not 16 or 32
+ inputs["image"] = inputs["image"].resize((504, 504))
+
+ model_id = "timbrooks/instruct-pix2pix"
+ pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+ model_id,
+ safety_checker=None,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ output = pipe(**inputs)
+ image = output.images[0]
+
+ image_slice = image[255:258, 383:386, -1]
+
+ assert image.shape == (504, 504, 3)
+ expected_slice = np.array([0.2726, 0.2529, 0.2664, 0.2655, 0.2641, 0.2642, 0.2591, 0.2649, 0.2590])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
diff --git a/tests/pipelines/stable_diffusion_2/__init__.py b/tests/pipelines/stable_diffusion_2/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion.py
new file mode 100644
index 0000000..7aef098
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion.py
@@ -0,0 +1,653 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionPipeline,
+ UNet2DConditionModel,
+ logging,
+)
+from diffusers.utils.testing_utils import (
+ CaptureLogger,
+ backend_empty_cache,
+ enable_full_determinism,
+ load_numpy,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_torch_accelerator,
+ require_torch_gpu,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import (
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ SDFunctionTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusion2PipelineFastTests(
+ SDFunctionTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=512,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ generator_device = "cpu" if not device.startswith("cuda") else "cuda"
+ if not str(device).startswith("mps"):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ else:
+ generator = torch.manual_seed(seed)
+
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5753, 0.6113, 0.5005, 0.5036, 0.5464, 0.4725, 0.4982, 0.4865, 0.4861])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_pndm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5121, 0.5714, 0.4827, 0.5057, 0.5646, 0.4766, 0.5189, 0.4895, 0.4990])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_k_lms(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = LMSDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4865, 0.5439, 0.4840, 0.4995, 0.5543, 0.4846, 0.5199, 0.4942, 0.5061])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_k_euler_ancestral(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = EulerAncestralDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4864, 0.5440, 0.4842, 0.4994, 0.5543, 0.4846, 0.5196, 0.4942, 0.5063])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_k_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = EulerDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4865, 0.5439, 0.4840, 0.4995, 0.5543, 0.4846, 0.5199, 0.4942, 0.5061])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_unflawed(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = DDIMScheduler.from_config(
+ components["scheduler"].config, timestep_spacing="trailing"
+ )
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["guidance_rescale"] = 0.7
+ inputs["num_inference_steps"] = 10
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4736, 0.5405, 0.4705, 0.4955, 0.5675, 0.4812, 0.5310, 0.4967, 0.5064])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_long_prompt(self):
+ components = self.get_dummy_components()
+ components["scheduler"] = LMSDiscreteScheduler.from_config(components["scheduler"].config)
+ sd_pipe = StableDiffusionPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ do_classifier_free_guidance = True
+ negative_prompt = None
+ num_images_per_prompt = 1
+ logger = logging.get_logger("diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion")
+ logger.setLevel(logging.WARNING)
+
+ prompt = 25 * "@"
+ with CaptureLogger(logger) as cap_logger_3:
+ text_embeddings_3, negeative_text_embeddings_3 = sd_pipe.encode_prompt(
+ prompt, torch_device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt
+ )
+ if negeative_text_embeddings_3 is not None:
+ text_embeddings_3 = torch.cat([negeative_text_embeddings_3, text_embeddings_3])
+
+ prompt = 100 * "@"
+ with CaptureLogger(logger) as cap_logger:
+ text_embeddings, negative_embeddings = sd_pipe.encode_prompt(
+ prompt, torch_device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt
+ )
+ if negative_embeddings is not None:
+ text_embeddings = torch.cat([negative_embeddings, text_embeddings])
+
+ negative_prompt = "Hello"
+ with CaptureLogger(logger) as cap_logger_2:
+ text_embeddings_2, negative_text_embeddings_2 = sd_pipe.encode_prompt(
+ prompt, torch_device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt
+ )
+ if negative_text_embeddings_2 is not None:
+ text_embeddings_2 = torch.cat([negative_text_embeddings_2, text_embeddings_2])
+
+ assert text_embeddings_3.shape == text_embeddings_2.shape == text_embeddings.shape
+ assert text_embeddings.shape[1] == 77
+
+ assert cap_logger.out == cap_logger_2.out
+ # 100 - 77 + 1 (BOS token) + 1 (EOS token) = 25
+ assert cap_logger.out.count("@") == 25
+ assert cap_logger_3.out == ""
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+
+@slow
+@require_torch_accelerator
+@skip_mps
+class StableDiffusion2PipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ backend_empty_cache(torch_device)
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ _generator_device = "cpu" if not generator_device.startswith("cuda") else "cuda"
+ if not str(device).startswith("mps"):
+ generator = torch.Generator(device=_generator_device).manual_seed(seed)
+ else:
+ generator = torch.manual_seed(seed)
+
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_default_ddim(self):
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base")
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.49493, 0.47896, 0.40798, 0.54214, 0.53212, 0.48202, 0.47656, 0.46329, 0.48506])
+ assert np.abs(image_slice - expected_slice).max() < 7e-3
+
+ def test_stable_diffusion_pndm(self):
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base")
+ pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.49493, 0.47896, 0.40798, 0.54214, 0.53212, 0.48202, 0.47656, 0.46329, 0.48506])
+ assert np.abs(image_slice - expected_slice).max() < 7e-3
+
+ def test_stable_diffusion_k_lms(self):
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base")
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.10440, 0.13115, 0.11100, 0.10141, 0.11440, 0.07215, 0.11332, 0.09693, 0.10006])
+ assert np.abs(image_slice - expected_slice).max() < 3e-3
+
+ @require_torch_gpu
+ def test_stable_diffusion_attention_slicing(self):
+ torch.cuda.reset_peak_memory_stats()
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-base", torch_dtype=torch.float16
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ # enable attention slicing
+ pipe.enable_attention_slicing()
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image_sliced = pipe(**inputs).images
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+ # make sure that less than 3.3 GB is allocated
+ assert mem_bytes < 3.3 * 10**9
+
+ # disable slicing
+ pipe.disable_attention_slicing()
+ pipe.unet.set_default_attn_processor()
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ image = pipe(**inputs).images
+
+ # make sure that more than 3.3 GB is allocated
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes > 3.3 * 10**9
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_sliced.flatten())
+ assert max_diff < 5e-3
+
+ def test_stable_diffusion_text2img_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.3862, -0.4507, -1.1729, 0.0686, -1.1045, 0.7124, -1.8301, 0.1903, 1.2773]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [0.2720, -0.1863, -0.7383, -0.5029, -0.7534, 0.3970, -0.7646, 0.4468, 1.2686]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ callback_fn.has_been_called = False
+
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-base", torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == inputs["num_inference_steps"]
+
+ @require_torch_gpu
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-base", torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.8 GB is allocated
+ assert mem_bytes < 2.8 * 10**9
+
+ @require_torch_gpu
+ def test_stable_diffusion_pipeline_with_model_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+
+ # Normal inference
+
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-base",
+ torch_dtype=torch.float16,
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ outputs = pipe(**inputs)
+ mem_bytes = torch.cuda.max_memory_allocated()
+
+ # With model offloading
+
+ # Reload but don't move to cuda
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-base",
+ torch_dtype=torch.float16,
+ )
+ pipe.unet.set_default_attn_processor()
+
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ outputs_offloaded = pipe(**inputs)
+ mem_bytes_offloaded = torch.cuda.max_memory_allocated()
+
+ images = outputs.images
+ images_offloaded = outputs_offloaded.images
+ max_diff = numpy_cosine_similarity_distance(images.flatten(), images_offloaded.flatten())
+ assert max_diff < 1e-3
+ assert mem_bytes_offloaded < mem_bytes
+ assert mem_bytes_offloaded < 3 * 10**9
+ for module in pipe.text_encoder, pipe.unet, pipe.vae:
+ assert module.device == torch.device("cpu")
+
+ # With attention slicing
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe.enable_attention_slicing()
+ _ = pipe(**inputs)
+ mem_bytes_slicing = torch.cuda.max_memory_allocated()
+ assert mem_bytes_slicing < mem_bytes_offloaded
+
+
+@nightly
+@require_torch_accelerator
+@skip_mps
+class StableDiffusion2PipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ backend_empty_cache(torch_device)
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ _generator_device = "cpu" if not generator_device.startswith("cuda") else "cuda"
+ if not str(device).startswith("mps"):
+ generator = torch.Generator(device=_generator_device).manual_seed(seed)
+ else:
+ generator = torch.manual_seed(seed)
+
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_2_0_default_ddim(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base").to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_2_text2img/stable_diffusion_2_0_base_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_2_1_default_pndm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base").to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_2_text2img/stable_diffusion_2_1_base_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_ddim(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base").to(torch_device)
+ sd_pipe.scheduler = DDIMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_2_text2img/stable_diffusion_2_1_base_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_lms(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base").to(torch_device)
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_2_text2img/stable_diffusion_2_1_base_lms.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_euler(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base").to(torch_device)
+ sd_pipe.scheduler = EulerDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_2_text2img/stable_diffusion_2_1_base_euler.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_dpm(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base").to(torch_device)
+ sd_pipe.scheduler = DPMSolverMultistepScheduler.from_config(
+ sd_pipe.scheduler.config, final_sigmas_type="sigma_min"
+ )
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 25
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_2_text2img/stable_diffusion_2_1_base_dpm_multi.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_attend_and_excite.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_attend_and_excite.py
new file mode 100644
index 0000000..fdc41a2
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_attend_and_excite.py
@@ -0,0 +1,235 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ StableDiffusionAttendAndExcitePipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ load_numpy,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ skip_mps,
+)
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+torch.backends.cuda.matmul.allow_tf32 = False
+
+
+@skip_mps
+class StableDiffusionAttendAndExcitePipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionAttendAndExcitePipeline
+ test_attention_slicing = False
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS.union({"token_indices"})
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ # Attend and excite requires being able to run a backward pass at
+ # inference time. There's no deterministic backward operator for pad
+
+ @classmethod
+ def setUpClass(cls):
+ super().setUpClass()
+ torch.use_deterministic_algorithms(False)
+
+ @classmethod
+ def tearDownClass(cls):
+ super().tearDownClass()
+ torch.use_deterministic_algorithms(True)
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=512,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = inputs = {
+ "prompt": "a cat and a frog",
+ "token_indices": [2, 5],
+ "generator": generator,
+ "num_inference_steps": 1,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "max_iter_to_alter": 2,
+ "thresholds": {0: 0.7},
+ }
+ return inputs
+
+ def test_inference(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ self.assertEqual(image.shape, (1, 64, 64, 3))
+ expected_slice = np.array(
+ [0.63905364, 0.62897307, 0.48599017, 0.5133624, 0.5550048, 0.45769516, 0.50326973, 0.5023139, 0.45384496]
+ )
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ super().test_sequential_cpu_offload_forward_pass(expected_max_diff=5e-4)
+
+ def test_inference_batch_consistent(self):
+ # NOTE: Larger batch sizes cause this test to timeout, only test on smaller batches
+ self._test_inference_batch_consistent(batch_sizes=[1, 2])
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(batch_size=2, expected_max_diff=7e-4)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=3e-3)
+
+ def test_pt_np_pil_outputs_equivalent(self):
+ super().test_pt_np_pil_outputs_equivalent(expected_max_diff=5e-4)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=5e-4)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=4e-4)
+
+
+@require_torch_gpu
+@nightly
+class StableDiffusionAttendAndExcitePipelineIntegrationTests(unittest.TestCase):
+ # Attend and excite requires being able to run a backward pass at
+ # inference time. There's no deterministic backward operator for pad
+
+ @classmethod
+ def setUpClass(cls):
+ super().setUpClass()
+ torch.use_deterministic_algorithms(False)
+
+ @classmethod
+ def tearDownClass(cls):
+ super().tearDownClass()
+ torch.use_deterministic_algorithms(True)
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_attend_and_excite_fp16(self):
+ generator = torch.manual_seed(51)
+
+ pipe = StableDiffusionAttendAndExcitePipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe.to("cuda")
+
+ prompt = "a painting of an elephant with glasses"
+ token_indices = [5, 7]
+
+ image = pipe(
+ prompt=prompt,
+ token_indices=token_indices,
+ guidance_scale=7.5,
+ generator=generator,
+ num_inference_steps=5,
+ max_iter_to_alter=5,
+ output_type="numpy",
+ ).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/attend-and-excite/elephant_glasses.npy"
+ )
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+ assert max_diff < 5e-1
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py
new file mode 100644
index 0000000..76d480e
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py
@@ -0,0 +1,603 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import (
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTokenizer,
+ DPTConfig,
+ DPTFeatureExtractor,
+ DPTForDepthEstimation,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionDepth2ImgPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils import is_accelerate_available, is_accelerate_version
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+@skip_mps
+class StableDiffusionDepth2ImgPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionDepth2ImgPipeline
+ test_save_load_optional_components = False
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union({"depth_mask"})
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=5,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ backbone_config = {
+ "global_padding": "same",
+ "layer_type": "bottleneck",
+ "depths": [3, 4, 9],
+ "out_features": ["stage1", "stage2", "stage3"],
+ "embedding_dynamic_padding": True,
+ "hidden_sizes": [96, 192, 384, 768],
+ "num_groups": 2,
+ }
+ depth_estimator_config = DPTConfig(
+ image_size=32,
+ patch_size=16,
+ num_channels=3,
+ hidden_size=32,
+ num_hidden_layers=4,
+ backbone_out_indices=(0, 1, 2, 3),
+ num_attention_heads=4,
+ intermediate_size=37,
+ hidden_act="gelu",
+ hidden_dropout_prob=0.1,
+ attention_probs_dropout_prob=0.1,
+ is_decoder=False,
+ initializer_range=0.02,
+ is_hybrid=True,
+ backbone_config=backbone_config,
+ backbone_featmap_shape=[1, 384, 24, 24],
+ )
+ depth_estimator = DPTForDepthEstimation(depth_estimator_config).eval()
+ feature_extractor = DPTFeatureExtractor.from_pretrained(
+ "hf-internal-testing/tiny-random-DPTForDepthEstimation"
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "depth_estimator": depth_estimator,
+ "feature_extractor": feature_extractor,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed))
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB").resize((32, 32))
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_save_load_local(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(output - output_loaded).max()
+ self.assertLess(max_diff, 1e-4)
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self):
+ components = self.get_dummy_components()
+ for name, module in components.items():
+ if hasattr(module, "half"):
+ components[name] = module.to(torch_device).half()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir, torch_dtype=torch.float16)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for name, component in pipe_loaded.components.items():
+ if hasattr(component, "dtype"):
+ self.assertTrue(
+ component.dtype == torch.float16,
+ f"`{name}.dtype` switched from `float16` to {component.dtype} after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(output - output_loaded).max()
+ self.assertLess(max_diff, 2e-2, "The output of the fp16 pipeline changed after saving and loading.")
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_float16_inference(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ for name, module in components.items():
+ if hasattr(module, "half"):
+ components[name] = module.half()
+ pipe_fp16 = self.pipeline_class(**components)
+ pipe_fp16.to(torch_device)
+ pipe_fp16.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(torch_device))[0]
+ output_fp16 = pipe_fp16(**self.get_dummy_inputs(torch_device))[0]
+
+ max_diff = np.abs(output - output_fp16).max()
+ self.assertLess(max_diff, 1.3e-2, "The outputs of the fp16 and fp32 pipelines are too different.")
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_accelerate_available() or is_accelerate_version("<", "0.14.0"),
+ reason="CPU offload is only available with CUDA and `accelerate v0.14.0` or higher",
+ )
+ def test_cpu_offload_forward_pass(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_without_offload = pipe(**inputs)[0]
+
+ pipe.enable_sequential_cpu_offload()
+ inputs = self.get_dummy_inputs(torch_device)
+ output_with_offload = pipe(**inputs)[0]
+
+ max_diff = np.abs(output_with_offload - output_without_offload).max()
+ self.assertLess(max_diff, 1e-4, "CPU offloading should not affect the inference results")
+
+ def test_dict_tuple_outputs_equivalent(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(torch_device))[0]
+ output_tuple = pipe(**self.get_dummy_inputs(torch_device), return_dict=False)[0]
+
+ max_diff = np.abs(output - output_tuple).max()
+ self.assertLess(max_diff, 1e-4)
+
+ def test_progress_bar(self):
+ super().test_progress_bar()
+
+ def test_stable_diffusion_depth2img_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = StableDiffusionDepth2ImgPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ if torch_device == "mps":
+ expected_slice = np.array([0.6071, 0.5035, 0.4378, 0.5776, 0.5753, 0.4316, 0.4513, 0.5263, 0.4546])
+ else:
+ expected_slice = np.array([0.5435, 0.4992, 0.3783, 0.4411, 0.5842, 0.4654, 0.3786, 0.5077, 0.4655])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_depth2img_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = StableDiffusionDepth2ImgPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "french fries"
+ output = pipe(**inputs, negative_prompt=negative_prompt)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ if torch_device == "mps":
+ expected_slice = np.array([0.6296, 0.5125, 0.3890, 0.4456, 0.5955, 0.4621, 0.3810, 0.5310, 0.4626])
+ else:
+ expected_slice = np.array([0.6012, 0.4507, 0.3769, 0.4121, 0.5566, 0.4585, 0.3803, 0.5045, 0.4631])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_depth2img_multiple_init_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = StableDiffusionDepth2ImgPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * 2
+ inputs["image"] = 2 * [inputs["image"]]
+ image = pipe(**inputs).images
+ image_slice = image[-1, -3:, -3:, -1]
+
+ assert image.shape == (2, 32, 32, 3)
+
+ if torch_device == "mps":
+ expected_slice = np.array([0.6501, 0.5150, 0.4939, 0.6688, 0.5437, 0.5758, 0.5115, 0.4406, 0.4551])
+ else:
+ expected_slice = np.array([0.6557, 0.6214, 0.6254, 0.5775, 0.4785, 0.5949, 0.5904, 0.4785, 0.4730])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_depth2img_pil(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = StableDiffusionDepth2ImgPipeline(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ if torch_device == "mps":
+ expected_slice = np.array([0.53232, 0.47015, 0.40868, 0.45651, 0.4891, 0.4668, 0.4287, 0.48822, 0.47439])
+ else:
+ expected_slice = np.array([0.5435, 0.4992, 0.3783, 0.4411, 0.5842, 0.4654, 0.3786, 0.5077, 0.4655])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ return super().test_attention_slicing_forward_pass()
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=7e-3)
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionDepth2ImgPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/depth2img/two_cats.png"
+ )
+ inputs = {
+ "prompt": "two tigers",
+ "image": init_image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "strength": 0.75,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_depth2img_pipeline_default(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth", safety_checker=None
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 480, 640, 3)
+ expected_slice = np.array([0.5435, 0.4992, 0.3783, 0.4411, 0.5842, 0.4654, 0.3786, 0.5077, 0.4655])
+
+ assert np.abs(expected_slice - image_slice).max() < 6e-1
+
+ def test_stable_diffusion_depth2img_pipeline_k_lms(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth", safety_checker=None
+ )
+ pipe.unet.set_default_attn_processor()
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 480, 640, 3)
+ expected_slice = np.array([0.6363, 0.6274, 0.6309, 0.6370, 0.6226, 0.6286, 0.6213, 0.6453, 0.6306])
+
+ assert np.abs(expected_slice - image_slice).max() < 8e-4
+
+ def test_stable_diffusion_depth2img_pipeline_ddim(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth", safety_checker=None
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, 253:256, 253:256, -1].flatten()
+
+ assert image.shape == (1, 480, 640, 3)
+ expected_slice = np.array([0.6424, 0.6524, 0.6249, 0.6041, 0.6634, 0.6420, 0.6522, 0.6555, 0.6436])
+
+ assert np.abs(expected_slice - image_slice).max() < 5e-4
+
+ def test_stable_diffusion_depth2img_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 60, 80)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.7168, -1.5137, -0.1418, -2.9219, -2.7266, -2.4414, -2.1035, -3.0078, -1.7051]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 60, 80)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array(
+ [-0.7109, -1.5068, -0.1403, -2.9160, -2.7207, -2.4414, -2.1035, -3.0059, -1.7090]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ callback_fn.has_been_called = False
+
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(dtype=torch.float16)
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == 2
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-depth", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.9 GB is allocated
+ assert mem_bytes < 2.9 * 10**9
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionImg2ImgPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/depth2img/two_cats.png"
+ )
+ inputs = {
+ "prompt": "two tigers",
+ "image": init_image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "strength": 0.75,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_depth2img_pndm(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-depth")
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_depth2img/stable_diffusion_2_0_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_depth2img_ddim(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-depth")
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_depth2img/stable_diffusion_2_0_ddim.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_img2img_lms(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-depth")
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_depth2img/stable_diffusion_2_0_lms.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_img2img_dpm(self):
+ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-depth")
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs()
+ inputs["num_inference_steps"] = 30
+ image = pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_depth2img/stable_diffusion_2_0_dpm_multi.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_diffedit.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_diffedit.py
new file mode 100644
index 0000000..7634303
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_diffedit.py
@@ -0,0 +1,432 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMInverseScheduler,
+ DDIMScheduler,
+ DPMSolverMultistepInverseScheduler,
+ DPMSolverMultistepScheduler,
+ StableDiffusionDiffEditPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS, TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+from ..test_pipelines_common import PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableDiffusionDiffEditPipelineFastTests(PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableDiffusionDiffEditPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS - {"height", "width", "image"} | {"image_latents"}
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS - {"image"} | {"image_latents"}
+ image_params = frozenset(
+ []
+ ) # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ inverse_scheduler = DDIMInverseScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_zero=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=512,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "inverse_scheduler": inverse_scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ mask = floats_tensor((1, 16, 16), rng=random.Random(seed)).to(device)
+ latents = floats_tensor((1, 2, 4, 16, 16), rng=random.Random(seed)).to(device)
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "a dog and a newt",
+ "mask_image": mask,
+ "image_latents": latents,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "inpaint_strength": 1.0,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ def get_dummy_mask_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB")
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": image,
+ "source_prompt": "a cat and a frog",
+ "target_prompt": "a dog and a newt",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "num_maps_per_mask": 2,
+ "mask_encode_strength": 1.0,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+
+ return inputs
+
+ def get_dummy_inversion_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB")
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": image,
+ "prompt": "a cat and a frog",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "inpaint_strength": 1.0,
+ "guidance_scale": 6.0,
+ "decode_latents": True,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_save_load_optional_components(self):
+ if not hasattr(self.pipeline_class, "_optional_components"):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ # set all optional components to None and update pipeline config accordingly
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+ pipe.register_modules(**{optional_component: None for optional_component in pipe._optional_components})
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(output - output_loaded).max()
+ self.assertLess(max_diff, 1e-4)
+
+ def test_mask(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_mask_inputs(device)
+ mask = pipe.generate_mask(**inputs)
+ mask_slice = mask[0, -3:, -3:]
+
+ self.assertEqual(mask.shape, (1, 16, 16))
+ expected_slice = np.array([0] * 9)
+ max_diff = np.abs(mask_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+ self.assertEqual(mask[0, -3, -4], 0)
+
+ def test_inversion(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inversion_inputs(device)
+ image = pipe.invert(**inputs).images
+ image_slice = image[0, -1, -3:, -3:]
+
+ self.assertEqual(image.shape, (2, 32, 32, 3))
+ expected_slice = np.array(
+ [0.5160, 0.5115, 0.5060, 0.5456, 0.4704, 0.5060, 0.5019, 0.4405, 0.4726],
+ )
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=5e-3)
+
+ def test_inversion_dpm(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ scheduler_args = {"beta_start": 0.00085, "beta_end": 0.012, "beta_schedule": "scaled_linear"}
+ components["scheduler"] = DPMSolverMultistepScheduler(**scheduler_args)
+ components["inverse_scheduler"] = DPMSolverMultistepInverseScheduler(**scheduler_args)
+
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inversion_inputs(device)
+ image = pipe.invert(**inputs).images
+ image_slice = image[0, -1, -3:, -3:]
+
+ self.assertEqual(image.shape, (2, 32, 32, 3))
+ expected_slice = np.array(
+ [0.5305, 0.4673, 0.5314, 0.5308, 0.4886, 0.5279, 0.5142, 0.4724, 0.4892],
+ )
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+
+@require_torch_gpu
+@nightly
+class StableDiffusionDiffEditPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @classmethod
+ def setUpClass(cls):
+ raw_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/diffedit/fruit.png"
+ )
+ raw_image = raw_image.convert("RGB").resize((256, 256))
+
+ cls.raw_image = raw_image
+
+ def test_stable_diffusion_diffedit_full(self):
+ generator = torch.manual_seed(0)
+
+ pipe = StableDiffusionDiffEditPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1-base", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.scheduler.clip_sample = True
+
+ pipe.inverse_scheduler = DDIMInverseScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ source_prompt = "a bowl of fruit"
+ target_prompt = "a bowl of pears"
+
+ mask_image = pipe.generate_mask(
+ image=self.raw_image,
+ source_prompt=source_prompt,
+ target_prompt=target_prompt,
+ generator=generator,
+ )
+
+ inv_latents = pipe.invert(
+ prompt=source_prompt,
+ image=self.raw_image,
+ inpaint_strength=0.7,
+ generator=generator,
+ num_inference_steps=5,
+ ).latents
+
+ image = pipe(
+ prompt=target_prompt,
+ mask_image=mask_image,
+ image_latents=inv_latents,
+ generator=generator,
+ negative_prompt=source_prompt,
+ inpaint_strength=0.7,
+ num_inference_steps=5,
+ output_type="np",
+ ).images[0]
+
+ expected_image = (
+ np.array(
+ load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/diffedit/pears.png"
+ ).resize((256, 256))
+ )
+ / 255
+ )
+
+ assert numpy_cosine_similarity_distance(expected_image.flatten(), image.flatten()) < 2e-1
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionDiffEditPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @classmethod
+ def setUpClass(cls):
+ raw_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/diffedit/fruit.png"
+ )
+
+ raw_image = raw_image.convert("RGB").resize((768, 768))
+
+ cls.raw_image = raw_image
+
+ def test_stable_diffusion_diffedit_dpm(self):
+ generator = torch.manual_seed(0)
+
+ pipe = StableDiffusionDiffEditPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.inverse_scheduler = DPMSolverMultistepInverseScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ source_prompt = "a bowl of fruit"
+ target_prompt = "a bowl of pears"
+
+ mask_image = pipe.generate_mask(
+ image=self.raw_image,
+ source_prompt=source_prompt,
+ target_prompt=target_prompt,
+ generator=generator,
+ )
+
+ inv_latents = pipe.invert(
+ prompt=source_prompt,
+ image=self.raw_image,
+ inpaint_strength=0.7,
+ generator=generator,
+ num_inference_steps=25,
+ ).latents
+
+ image = pipe(
+ prompt=target_prompt,
+ mask_image=mask_image,
+ image_latents=inv_latents,
+ generator=generator,
+ negative_prompt=source_prompt,
+ inpaint_strength=0.7,
+ num_inference_steps=25,
+ output_type="numpy",
+ ).images[0]
+
+ expected_image = (
+ np.array(
+ load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/diffedit/pears.png"
+ ).resize((768, 768))
+ )
+ / 255
+ )
+ assert np.abs((expected_image - image).max()) < 5e-1
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_flax.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_flax.py
new file mode 100644
index 0000000..afc2c86
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_flax.py
@@ -0,0 +1,108 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+from diffusers import FlaxDPMSolverMultistepScheduler, FlaxStableDiffusionPipeline
+from diffusers.utils import is_flax_available
+from diffusers.utils.testing_utils import nightly, require_flax
+
+
+if is_flax_available():
+ import jax
+ import jax.numpy as jnp
+ from flax.jax_utils import replicate
+ from flax.training.common_utils import shard
+
+
+@nightly
+@require_flax
+class FlaxStableDiffusion2PipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+
+ def test_stable_diffusion_flax(self):
+ sd_pipe, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2",
+ revision="bf16",
+ dtype=jnp.bfloat16,
+ )
+
+ prompt = "A painting of a squirrel eating a burger"
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = sd_pipe.prepare_inputs(prompt)
+
+ params = replicate(params)
+ prompt_ids = shard(prompt_ids)
+
+ prng_seed = jax.random.PRNGKey(0)
+ prng_seed = jax.random.split(prng_seed, jax.device_count())
+
+ images = sd_pipe(prompt_ids, params, prng_seed, num_inference_steps=25, jit=True)[0]
+ assert images.shape == (jax.device_count(), 1, 768, 768, 3)
+
+ images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
+ image_slice = images[0, 253:256, 253:256, -1]
+
+ output_slice = jnp.asarray(jax.device_get(image_slice.flatten()))
+ expected_slice = jnp.array([0.4238, 0.4414, 0.4395, 0.4453, 0.4629, 0.4590, 0.4531, 0.45508, 0.4512])
+ print(f"output_slice: {output_slice}")
+ assert jnp.abs(output_slice - expected_slice).max() < 1e-2
+
+
+@nightly
+@require_flax
+class FlaxStableDiffusion2PipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+
+ def test_stable_diffusion_dpm_flax(self):
+ model_id = "stabilityai/stable-diffusion-2"
+ scheduler, scheduler_params = FlaxDPMSolverMultistepScheduler.from_pretrained(model_id, subfolder="scheduler")
+ sd_pipe, params = FlaxStableDiffusionPipeline.from_pretrained(
+ model_id,
+ scheduler=scheduler,
+ revision="bf16",
+ dtype=jnp.bfloat16,
+ )
+ params["scheduler"] = scheduler_params
+
+ prompt = "A painting of a squirrel eating a burger"
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = sd_pipe.prepare_inputs(prompt)
+
+ params = replicate(params)
+ prompt_ids = shard(prompt_ids)
+
+ prng_seed = jax.random.PRNGKey(0)
+ prng_seed = jax.random.split(prng_seed, jax.device_count())
+
+ images = sd_pipe(prompt_ids, params, prng_seed, num_inference_steps=25, jit=True)[0]
+ assert images.shape == (jax.device_count(), 1, 768, 768, 3)
+
+ images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
+ image_slice = images[0, 253:256, 253:256, -1]
+
+ output_slice = jnp.asarray(jax.device_get(image_slice.flatten()))
+ expected_slice = jnp.array([0.4336, 0.42969, 0.4453, 0.4199, 0.4297, 0.4531, 0.4434, 0.4434, 0.4297])
+ print(f"output_slice: {output_slice}")
+ assert jnp.abs(output_slice - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_flax_inpaint.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_flax_inpaint.py
new file mode 100644
index 0000000..8f03998
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_flax_inpaint.py
@@ -0,0 +1,82 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+from diffusers import FlaxStableDiffusionInpaintPipeline
+from diffusers.utils import is_flax_available, load_image
+from diffusers.utils.testing_utils import require_flax, slow
+
+
+if is_flax_available():
+ import jax
+ import jax.numpy as jnp
+ from flax.jax_utils import replicate
+ from flax.training.common_utils import shard
+
+
+@slow
+@require_flax
+class FlaxStableDiffusionInpaintPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+
+ def test_stable_diffusion_inpaint_pipeline(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-inpaint/init_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-inpaint/mask.png"
+ )
+
+ model_id = "xvjiarui/stable-diffusion-2-inpainting"
+ pipeline, params = FlaxStableDiffusionInpaintPipeline.from_pretrained(model_id, safety_checker=None)
+
+ prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+
+ prng_seed = jax.random.PRNGKey(0)
+ num_inference_steps = 50
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ init_image = num_samples * [init_image]
+ mask_image = num_samples * [mask_image]
+ prompt_ids, processed_masked_images, processed_masks = pipeline.prepare_inputs(prompt, init_image, mask_image)
+
+ # shard inputs and rng
+ params = replicate(params)
+ prng_seed = jax.random.split(prng_seed, jax.device_count())
+ prompt_ids = shard(prompt_ids)
+ processed_masked_images = shard(processed_masked_images)
+ processed_masks = shard(processed_masks)
+
+ output = pipeline(
+ prompt_ids, processed_masks, processed_masked_images, params, prng_seed, num_inference_steps, jit=True
+ )
+
+ images = output.images.reshape(num_samples, 512, 512, 3)
+
+ image_slice = images[0, 253:256, 253:256, -1]
+
+ output_slice = jnp.asarray(jax.device_get(image_slice.flatten()))
+ expected_slice = jnp.array(
+ [0.3611307, 0.37649736, 0.3757408, 0.38213953, 0.39295167, 0.3841631, 0.41554978, 0.4137475, 0.4217084]
+ )
+ print(f"output_slice: {output_slice}")
+ assert jnp.abs(output_slice - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_inpaint.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_inpaint.py
new file mode 100644
index 0000000..6157b32
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_inpaint.py
@@ -0,0 +1,277 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import AutoencoderKL, PNDMScheduler, StableDiffusionInpaintPipeline, UNet2DConditionModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableDiffusion2InpaintPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset(
+ []
+ ) # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union({"mask", "masked_image_latents"})
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=9,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=512,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ # TODO: use tensor inputs instead of PIL, this is here just to leave the old expected_slices untouched
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+ mask_image = Image.fromarray(np.uint8(image + 4)).convert("RGB").resize((64, 64))
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_inpaint(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4727, 0.5735, 0.3941, 0.5446, 0.5926, 0.4394, 0.5062, 0.4654, 0.4476])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionInpaintPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_inpaint_pipeline(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-inpaint/init_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-inpaint/mask.png"
+ )
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-inpaint"
+ "/yellow_cat_sitting_on_a_park_bench.npy"
+ )
+
+ model_id = "stabilityai/stable-diffusion-2-inpainting"
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+
+ generator = torch.manual_seed(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+ assert np.abs(expected_image - image).max() < 9e-3
+
+ def test_stable_diffusion_inpaint_pipeline_fp16(self):
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-inpaint/init_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-inpaint/mask.png"
+ )
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-inpaint"
+ "/yellow_cat_sitting_on_a_park_bench_fp16.npy"
+ )
+
+ model_id = "stabilityai/stable-diffusion-2-inpainting"
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ safety_checker=None,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+
+ generator = torch.manual_seed(0)
+ output = pipe(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+ assert np.abs(expected_image - image).max() < 5e-1
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ init_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-inpaint/init_image.png"
+ )
+ mask_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-inpaint/mask.png"
+ )
+
+ model_id = "stabilityai/stable-diffusion-2-inpainting"
+ pndm = PNDMScheduler.from_pretrained(model_id, subfolder="scheduler")
+ pipe = StableDiffusionInpaintPipeline.from_pretrained(
+ model_id,
+ safety_checker=None,
+ scheduler=pndm,
+ torch_dtype=torch.float16,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+
+ generator = torch.manual_seed(0)
+ _ = pipe(
+ prompt=prompt,
+ image=init_image,
+ mask_image=mask_image,
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.65 GB is allocated
+ assert mem_bytes < 2.65 * 10**9
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_latent_upscale.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_latent_upscale.py
new file mode 100644
index 0000000..04721b4
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_latent_upscale.py
@@ -0,0 +1,306 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AutoencoderKL,
+ EulerDiscreteScheduler,
+ StableDiffusionLatentUpscalePipeline,
+ StableDiffusionPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.schedulers import KarrasDiffusionSchedulers
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+def check_same_shape(tensor_list):
+ shapes = [tensor.shape for tensor in tensor_list]
+ return all(shape == shapes[0] for shape in shapes[1:])
+
+
+class StableDiffusionLatentUpscalePipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionLatentUpscalePipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {
+ "height",
+ "width",
+ "cross_attention_kwargs",
+ "negative_prompt_embeds",
+ "prompt_embeds",
+ }
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"num_images_per_prompt"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = frozenset(
+ []
+ ) # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+
+ @property
+ def dummy_image(self):
+ batch_size = 1
+ num_channels = 4
+ sizes = (16, 16)
+
+ image = floats_tensor((batch_size, num_channels) + sizes, rng=random.Random(0)).to(torch_device)
+ return image
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ model = UNet2DConditionModel(
+ act_fn="gelu",
+ attention_head_dim=8,
+ norm_num_groups=None,
+ block_out_channels=[32, 32, 64, 64],
+ time_cond_proj_dim=160,
+ conv_in_kernel=1,
+ conv_out_kernel=1,
+ cross_attention_dim=32,
+ down_block_types=(
+ "KDownBlock2D",
+ "KCrossAttnDownBlock2D",
+ "KCrossAttnDownBlock2D",
+ "KCrossAttnDownBlock2D",
+ ),
+ in_channels=8,
+ mid_block_type=None,
+ only_cross_attention=False,
+ out_channels=5,
+ resnet_time_scale_shift="scale_shift",
+ time_embedding_type="fourier",
+ timestep_post_act="gelu",
+ up_block_types=("KCrossAttnUpBlock2D", "KCrossAttnUpBlock2D", "KCrossAttnUpBlock2D", "KUpBlock2D"),
+ )
+ vae = AutoencoderKL(
+ block_out_channels=[32, 32, 64, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=[
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ "DownEncoderBlock2D",
+ ],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ scheduler = EulerDiscreteScheduler(prediction_type="sample")
+ text_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ hidden_act="quick_gelu",
+ projection_dim=512,
+ )
+ text_encoder = CLIPTextModel(text_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": model.eval(),
+ "vae": vae.eval(),
+ "scheduler": scheduler,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": self.dummy_image.cpu(),
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_inference(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ self.assertEqual(image.shape, (1, 256, 256, 3))
+ expected_slice = np.array(
+ [0.47222412, 0.41921633, 0.44717434, 0.46874192, 0.42588258, 0.46150726, 0.4677534, 0.45583832, 0.48579055]
+ )
+ max_diff = np.abs(image_slice.flatten() - expected_slice).max()
+ self.assertLessEqual(max_diff, 1e-3)
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=7e-3)
+
+ def test_sequential_cpu_offload_forward_pass(self):
+ super().test_sequential_cpu_offload_forward_pass(expected_max_diff=3e-3)
+
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent(expected_max_difference=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=7e-3)
+
+ def test_pt_np_pil_outputs_equivalent(self):
+ super().test_pt_np_pil_outputs_equivalent(expected_max_diff=3e-3)
+
+ def test_save_load_local(self):
+ super().test_save_load_local(expected_max_difference=3e-3)
+
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=3e-3)
+
+ def test_karras_schedulers_shape(self):
+ skip_schedulers = [
+ "DDIMScheduler",
+ "DDPMScheduler",
+ "PNDMScheduler",
+ "HeunDiscreteScheduler",
+ "EulerAncestralDiscreteScheduler",
+ "KDPM2DiscreteScheduler",
+ "KDPM2AncestralDiscreteScheduler",
+ "DPMSolverSDEScheduler",
+ "EDMEulerScheduler",
+ ]
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+
+ # make sure that PNDM does not need warm-up
+ pipe.scheduler.register_to_config(skip_prk_steps=True)
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 2
+
+ outputs = []
+ for scheduler_enum in KarrasDiffusionSchedulers:
+ if scheduler_enum.name in skip_schedulers:
+ # no sigma schedulers are not supported
+ # no schedulers
+ continue
+
+ scheduler_cls = getattr(diffusers, scheduler_enum.name)
+ pipe.scheduler = scheduler_cls.from_config(pipe.scheduler.config)
+ output = pipe(**inputs)[0]
+ outputs.append(output)
+
+ assert check_same_shape(outputs)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=5e-1)
+
+
+@require_torch_gpu
+@slow
+class StableDiffusionLatentUpscalePipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_latent_upscaler_fp16(self):
+ generator = torch.manual_seed(33)
+
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
+ pipe.to("cuda")
+
+ upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(
+ "stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16
+ )
+ upscaler.to("cuda")
+
+ prompt = "a photo of an astronaut high resolution, unreal engine, ultra realistic"
+
+ low_res_latents = pipe(prompt, generator=generator, output_type="latent").images
+
+ image = upscaler(
+ prompt=prompt,
+ image=low_res_latents,
+ num_inference_steps=20,
+ guidance_scale=0,
+ generator=generator,
+ output_type="np",
+ ).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/latent-upscaler/astronaut_1024.npy"
+ )
+ assert np.abs((expected_image - image).mean()) < 5e-2
+
+ def test_latent_upscaler_fp16_image(self):
+ generator = torch.manual_seed(33)
+
+ upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(
+ "stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16
+ )
+ upscaler.to("cuda")
+
+ prompt = "the temple of fire by Ross Tran and Gerardo Dottori, oil on canvas"
+
+ low_res_img = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/latent-upscaler/fire_temple_512.png"
+ )
+
+ image = upscaler(
+ prompt=prompt,
+ image=low_res_img,
+ num_inference_steps=20,
+ guidance_scale=0,
+ generator=generator,
+ output_type="np",
+ ).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/latent-upscaler/fire_temple_1024.npy"
+ )
+ assert np.abs((expected_image - image).max()) < 5e-2
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_upscale.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_upscale.py
new file mode 100644
index 0000000..4dd6121
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_upscale.py
@@ -0,0 +1,552 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import AutoencoderKL, DDIMScheduler, DDPMScheduler, StableDiffusionUpscalePipeline, UNet2DConditionModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionUpscalePipelineFastTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @property
+ def dummy_image(self):
+ batch_size = 1
+ num_channels = 3
+ sizes = (32, 32)
+
+ image = floats_tensor((batch_size, num_channels) + sizes, rng=random.Random(0)).to(torch_device)
+ return image
+
+ @property
+ def dummy_cond_unet_upscale(self):
+ torch.manual_seed(0)
+ model = UNet2DConditionModel(
+ block_out_channels=(32, 32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=7,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ # SD2-specific config below
+ attention_head_dim=8,
+ use_linear_projection=True,
+ only_cross_attention=(True, True, False),
+ num_class_embeds=100,
+ )
+ return model
+
+ @property
+ def dummy_vae(self):
+ torch.manual_seed(0)
+ model = AutoencoderKL(
+ block_out_channels=[32, 32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ return model
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=512,
+ )
+ return CLIPTextModel(config)
+
+ def test_stable_diffusion_upscale(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet_upscale
+ low_res_scheduler = DDPMScheduler()
+ scheduler = DDIMScheduler(prediction_type="v_prediction")
+ vae = self.dummy_vae
+ text_encoder = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ image = self.dummy_image.cpu().permute(0, 2, 3, 1)[0]
+ low_res_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionUpscalePipeline(
+ unet=unet,
+ low_res_scheduler=low_res_scheduler,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=text_encoder,
+ tokenizer=tokenizer,
+ max_noise_level=350,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe(
+ [prompt],
+ image=low_res_image,
+ generator=generator,
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ image=low_res_image,
+ generator=generator,
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ expected_height_width = low_res_image.size[0] * 4
+ assert image.shape == (1, expected_height_width, expected_height_width, 3)
+ expected_slice = np.array([0.3113, 0.3910, 0.4272, 0.4859, 0.5061, 0.4652, 0.5362, 0.5715, 0.5661])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_upscale_batch(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet_upscale
+ low_res_scheduler = DDPMScheduler()
+ scheduler = DDIMScheduler(prediction_type="v_prediction")
+ vae = self.dummy_vae
+ text_encoder = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ image = self.dummy_image.cpu().permute(0, 2, 3, 1)[0]
+ low_res_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionUpscalePipeline(
+ unet=unet,
+ low_res_scheduler=low_res_scheduler,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=text_encoder,
+ tokenizer=tokenizer,
+ max_noise_level=350,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ output = sd_pipe(
+ 2 * [prompt],
+ image=2 * [low_res_image],
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ )
+ image = output.images
+ assert image.shape[0] == 2
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe(
+ [prompt],
+ image=low_res_image,
+ generator=generator,
+ num_images_per_prompt=2,
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ )
+ image = output.images
+ assert image.shape[0] == 2
+
+ def test_stable_diffusion_upscale_prompt_embeds(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet_upscale
+ low_res_scheduler = DDPMScheduler()
+ scheduler = DDIMScheduler(prediction_type="v_prediction")
+ vae = self.dummy_vae
+ text_encoder = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ image = self.dummy_image.cpu().permute(0, 2, 3, 1)[0]
+ low_res_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionUpscalePipeline(
+ unet=unet,
+ low_res_scheduler=low_res_scheduler,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=text_encoder,
+ tokenizer=tokenizer,
+ max_noise_level=350,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe(
+ [prompt],
+ image=low_res_image,
+ generator=generator,
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ prompt_embeds, negative_prompt_embeds = sd_pipe.encode_prompt(prompt, device, 1, False)
+ if negative_prompt_embeds is not None:
+ prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
+
+ image_from_prompt_embeds = sd_pipe(
+ prompt_embeds=prompt_embeds,
+ image=[low_res_image],
+ generator=generator,
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_prompt_embeds_slice = image_from_prompt_embeds[0, -3:, -3:, -1]
+
+ expected_height_width = low_res_image.size[0] * 4
+ assert image.shape == (1, expected_height_width, expected_height_width, 3)
+ expected_slice = np.array([0.3113, 0.3910, 0.4272, 0.4859, 0.5061, 0.4652, 0.5362, 0.5715, 0.5661])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_prompt_embeds_slice.flatten() - expected_slice).max() < 1e-2
+
+ @unittest.skipIf(torch_device != "cuda", "This test requires a GPU")
+ def test_stable_diffusion_upscale_fp16(self):
+ """Test that stable diffusion upscale works with fp16"""
+ unet = self.dummy_cond_unet_upscale
+ low_res_scheduler = DDPMScheduler()
+ scheduler = DDIMScheduler(prediction_type="v_prediction")
+ vae = self.dummy_vae
+ text_encoder = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ image = self.dummy_image.cpu().permute(0, 2, 3, 1)[0]
+ low_res_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+
+ # put models in fp16, except vae as it overflows in fp16
+ unet = unet.half()
+ text_encoder = text_encoder.half()
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionUpscalePipeline(
+ unet=unet,
+ low_res_scheduler=low_res_scheduler,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=text_encoder,
+ tokenizer=tokenizer,
+ max_noise_level=350,
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ image = sd_pipe(
+ [prompt],
+ image=low_res_image,
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ expected_height_width = low_res_image.size[0] * 4
+ assert image.shape == (1, expected_height_width, expected_height_width, 3)
+
+ def test_stable_diffusion_upscale_from_save_pretrained(self):
+ pipes = []
+
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ low_res_scheduler = DDPMScheduler()
+ scheduler = DDIMScheduler(prediction_type="v_prediction")
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionUpscalePipeline(
+ unet=self.dummy_cond_unet_upscale,
+ low_res_scheduler=low_res_scheduler,
+ scheduler=scheduler,
+ vae=self.dummy_vae,
+ text_encoder=self.dummy_text_encoder,
+ tokenizer=tokenizer,
+ max_noise_level=350,
+ )
+ sd_pipe = sd_pipe.to(device)
+ pipes.append(sd_pipe)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd_pipe.save_pretrained(tmpdirname)
+ sd_pipe = StableDiffusionUpscalePipeline.from_pretrained(tmpdirname).to(device)
+ pipes.append(sd_pipe)
+
+ prompt = "A painting of a squirrel eating a burger"
+ image = self.dummy_image.cpu().permute(0, 2, 3, 1)[0]
+ low_res_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+
+ image_slices = []
+ for pipe in pipes:
+ generator = torch.Generator(device=device).manual_seed(0)
+ image = pipe(
+ [prompt],
+ image=low_res_image,
+ generator=generator,
+ guidance_scale=6.0,
+ noise_level=20,
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionUpscalePipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_upscale_pipeline(self):
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-upscale/low_res_cat.png"
+ )
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale"
+ "/upsampled_cat.npy"
+ )
+
+ model_id = "stabilityai/stable-diffusion-x4-upscaler"
+ pipe = StableDiffusionUpscalePipeline.from_pretrained(model_id)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ prompt = "a cat sitting on a park bench"
+
+ generator = torch.manual_seed(0)
+ output = pipe(
+ prompt=prompt,
+ image=image,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+ assert np.abs(expected_image - image).max() < 1e-3
+
+ def test_stable_diffusion_upscale_pipeline_fp16(self):
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-upscale/low_res_cat.png"
+ )
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale"
+ "/upsampled_cat_fp16.npy"
+ )
+
+ model_id = "stabilityai/stable-diffusion-x4-upscaler"
+ pipe = StableDiffusionUpscalePipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ prompt = "a cat sitting on a park bench"
+
+ generator = torch.manual_seed(0)
+ output = pipe(
+ prompt=prompt,
+ image=image,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ assert image.shape == (512, 512, 3)
+ assert np.abs(expected_image - image).max() < 5e-1
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-upscale/low_res_cat.png"
+ )
+
+ model_id = "stabilityai/stable-diffusion-x4-upscaler"
+ pipe = StableDiffusionUpscalePipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ prompt = "a cat sitting on a park bench"
+
+ generator = torch.manual_seed(0)
+ _ = pipe(
+ prompt=prompt,
+ image=image,
+ generator=generator,
+ num_inference_steps=5,
+ output_type="np",
+ )
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.9 GB is allocated
+ assert mem_bytes < 2.9 * 10**9
+
+ def test_download_ckpt_diff_format_is_same(self):
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/sd2-upscale/low_res_cat.png"
+ )
+
+ prompt = "a cat sitting on a park bench"
+ model_id = "stabilityai/stable-diffusion-x4-upscaler"
+ pipe = StableDiffusionUpscalePipeline.from_pretrained(model_id)
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator("cpu").manual_seed(0)
+ output = pipe(prompt=prompt, image=image, generator=generator, output_type="np", num_inference_steps=3)
+ image_from_pretrained = output.images[0]
+
+ single_file_path = (
+ "https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.safetensors"
+ )
+ pipe_from_single_file = StableDiffusionUpscalePipeline.from_single_file(single_file_path)
+ pipe_from_single_file.enable_model_cpu_offload()
+
+ generator = torch.Generator("cpu").manual_seed(0)
+ output_from_single_file = pipe_from_single_file(
+ prompt=prompt, image=image, generator=generator, output_type="np", num_inference_steps=3
+ )
+ image_from_single_file = output_from_single_file.images[0]
+
+ assert image_from_pretrained.shape == (512, 512, 3)
+ assert image_from_single_file.shape == (512, 512, 3)
+ assert (
+ numpy_cosine_similarity_distance(image_from_pretrained.flatten(), image_from_single_file.flatten()) < 1e-3
+ )
+
+ def test_single_file_component_configs(self):
+ pipe = StableDiffusionUpscalePipeline.from_pretrained(
+ "stabilityai/stable-diffusion-x4-upscaler", variant="fp16"
+ )
+
+ ckpt_path = (
+ "https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.safetensors"
+ )
+ single_file_pipe = StableDiffusionUpscalePipeline.from_single_file(ckpt_path, load_safety_checker=True)
+
+ for param_name, param_value in single_file_pipe.text_encoder.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder.config.to_dict()[param_name] == param_value
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.safety_checker.config.to_dict().items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.safety_checker.config.to_dict()[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
diff --git a/tests/pipelines/stable_diffusion_2/test_stable_diffusion_v_pred.py b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_v_pred.py
new file mode 100644
index 0000000..be5b639
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_2/test_stable_diffusion_v_pred.py
@@ -0,0 +1,563 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import time
+import unittest
+
+import numpy as np
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerDiscreteScheduler,
+ StableDiffusionPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.models.attention_processor import AttnProcessor
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusion2VPredictionPipelineFastTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @property
+ def dummy_cond_unet(self):
+ torch.manual_seed(0)
+ model = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ )
+ return model
+
+ @property
+ def dummy_vae(self):
+ torch.manual_seed(0)
+ model = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ return model
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=64,
+ )
+ return CLIPTextModel(config)
+
+ def test_stable_diffusion_v_pred_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ prediction_type="v_prediction",
+ )
+
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=None,
+ image_encoder=None,
+ requires_safety_checker=False,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=6.0,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.6569, 0.6525, 0.5142, 0.4968, 0.4923, 0.4601, 0.4996, 0.5041, 0.4544])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_v_pred_k_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", prediction_type="v_prediction"
+ )
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=None,
+ image_encoder=None,
+ requires_safety_checker=False,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=6.0,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5644, 0.6514, 0.5190, 0.5663, 0.5287, 0.4953, 0.5430, 0.5243, 0.4778])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ @unittest.skipIf(torch_device != "cuda", "This test requires a GPU")
+ def test_stable_diffusion_v_pred_fp16(self):
+ """Test that stable diffusion v-prediction works with fp16"""
+ unet = self.dummy_cond_unet
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ prediction_type="v_prediction",
+ )
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # put models in fp16
+ unet = unet.half()
+ vae = vae.half()
+ bert = bert.half()
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=None,
+ image_encoder=None,
+ requires_safety_checker=False,
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ image = sd_pipe([prompt], generator=generator, num_inference_steps=2, output_type="np").images
+
+ assert image.shape == (1, 64, 64, 3)
+
+
+@slow
+@require_torch_gpu
+class StableDiffusion2VPredictionPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_v_pred_default(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.enable_attention_slicing()
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=7.5, num_inference_steps=20, output_type="np")
+
+ image = output.images
+ image_slice = image[0, 253:256, 253:256, -1]
+
+ assert image.shape == (1, 768, 768, 3)
+ expected_slice = np.array([0.1868, 0.1922, 0.1527, 0.1921, 0.1908, 0.1624, 0.1779, 0.1652, 0.1734])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_v_pred_upcast_attention(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.enable_attention_slicing()
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=7.5, num_inference_steps=20, output_type="np")
+
+ image = output.images
+ image_slice = image[0, 253:256, 253:256, -1]
+
+ assert image.shape == (1, 768, 768, 3)
+ expected_slice = np.array([0.4209, 0.4087, 0.4097, 0.4209, 0.3860, 0.4329, 0.4280, 0.4324, 0.4187])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-2
+
+ def test_stable_diffusion_v_pred_euler(self):
+ scheduler = EulerDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-2", subfolder="scheduler")
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", scheduler=scheduler)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.enable_attention_slicing()
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+
+ output = sd_pipe([prompt], generator=generator, num_inference_steps=5, output_type="numpy")
+ image = output.images
+
+ image_slice = image[0, 253:256, 253:256, -1]
+
+ assert image.shape == (1, 768, 768, 3)
+ expected_slice = np.array([0.1781, 0.1695, 0.1661, 0.1705, 0.1588, 0.1699, 0.2005, 0.1589, 0.1677])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_v_pred_dpm(self):
+ """
+ TODO: update this test after making DPM compatible with V-prediction!
+ """
+ scheduler = DPMSolverMultistepScheduler.from_pretrained(
+ "stabilityai/stable-diffusion-2",
+ subfolder="scheduler",
+ final_sigmas_type="sigma_min",
+ )
+ sd_pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", scheduler=scheduler)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.enable_attention_slicing()
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "a photograph of an astronaut riding a horse"
+ generator = torch.manual_seed(0)
+ image = sd_pipe(
+ [prompt], generator=generator, guidance_scale=7.5, num_inference_steps=5, output_type="numpy"
+ ).images
+
+ image_slice = image[0, 253:256, 253:256, -1]
+ assert image.shape == (1, 768, 768, 3)
+ expected_slice = np.array([0.3303, 0.3184, 0.3291, 0.3300, 0.3256, 0.3113, 0.2965, 0.3134, 0.3192])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_attention_slicing_v_pred(self):
+ torch.cuda.reset_peak_memory_stats()
+ model_id = "stabilityai/stable-diffusion-2"
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "a photograph of an astronaut riding a horse"
+
+ # make attention efficient
+ pipe.enable_attention_slicing()
+ generator = torch.manual_seed(0)
+ output_chunked = pipe(
+ [prompt], generator=generator, guidance_scale=7.5, num_inference_steps=10, output_type="numpy"
+ )
+ image_chunked = output_chunked.images
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+ # make sure that less than 5.5 GB is allocated
+ assert mem_bytes < 5.5 * 10**9
+
+ # disable slicing
+ pipe.disable_attention_slicing()
+ generator = torch.manual_seed(0)
+ output = pipe([prompt], generator=generator, guidance_scale=7.5, num_inference_steps=10, output_type="numpy")
+ image = output.images
+
+ # make sure that more than 3.0 GB is allocated
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes > 3 * 10**9
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_chunked.flatten())
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_text2img_pipeline_v_pred_default(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/"
+ "sd2-text2img/astronaut_riding_a_horse_v_pred.npy"
+ )
+
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2")
+ pipe.to(torch_device)
+ pipe.enable_attention_slicing()
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "astronaut riding a horse"
+
+ generator = torch.manual_seed(0)
+ output = pipe(prompt=prompt, guidance_scale=7.5, generator=generator, output_type="np")
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_text2img_pipeline_unflawed(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/"
+ "sd2-text2img/lion_galaxy.npy"
+ )
+
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
+ pipe.scheduler = DDIMScheduler.from_config(
+ pipe.scheduler.config, timestep_spacing="trailing", rescale_betas_zero_snr=True
+ )
+ pipe.to(torch_device)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
+
+ generator = torch.Generator("cpu").manual_seed(0)
+ output = pipe(
+ prompt=prompt,
+ guidance_scale=7.5,
+ num_inference_steps=10,
+ guidance_rescale=0.7,
+ generator=generator,
+ output_type="np",
+ )
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+ assert max_diff < 5e-2
+
+ def test_stable_diffusion_text2img_pipeline_v_pred_fp16(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/"
+ "sd2-text2img/astronaut_riding_a_horse_v_pred_fp16.npy"
+ )
+
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "astronaut riding a horse"
+
+ generator = torch.manual_seed(0)
+ output = pipe(prompt=prompt, guidance_scale=7.5, generator=generator, output_type="np")
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+ assert max_diff < 1e-3
+
+ def test_download_local(self):
+ filename = hf_hub_download("stabilityai/stable-diffusion-2-1", filename="v2-1_768-ema-pruned.safetensors")
+
+ pipe = StableDiffusionPipeline.from_single_file(filename, torch_dtype=torch.float16)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+
+ image_out = pipe("test", num_inference_steps=1, output_type="np").images[0]
+
+ assert image_out.shape == (768, 768, 3)
+
+ def test_download_ckpt_diff_format_is_same(self):
+ single_file_path = (
+ "https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/v2-1_768-ema-pruned.safetensors"
+ )
+
+ pipe_single = StableDiffusionPipeline.from_single_file(single_file_path)
+ pipe_single.scheduler = DDIMScheduler.from_config(pipe_single.scheduler.config)
+ pipe_single.unet.set_attn_processor(AttnProcessor())
+ pipe_single.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_ckpt = pipe_single("a turtle", num_inference_steps=2, generator=generator, output_type="np").images[0]
+
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_attn_processor(AttnProcessor())
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image = pipe("a turtle", num_inference_steps=2, generator=generator, output_type="np").images[0]
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_ckpt.flatten())
+ assert max_diff < 1e-3
+
+ def test_stable_diffusion_text2img_intermediate_state_v_pred(self):
+ number_of_steps = 0
+
+ def test_callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ test_callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 0:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 96, 96)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([0.7749, 0.0325, 0.5088, 0.1619, 0.3372, 0.3667, -0.5186, 0.6860, 1.4326])
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 19:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 96, 96)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([1.3887, 1.0273, 1.7266, 0.0726, 0.6611, 0.1598, -1.0547, 0.1522, 0.0227])
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ test_callback_fn.has_been_called = False
+
+ pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ prompt = "Andromeda galaxy in a bottle"
+
+ generator = torch.manual_seed(0)
+ pipe(
+ prompt=prompt,
+ num_inference_steps=20,
+ guidance_scale=7.5,
+ generator=generator,
+ callback=test_callback_fn,
+ callback_steps=1,
+ )
+ assert test_callback_fn.has_been_called
+ assert number_of_steps == 20
+
+ def test_stable_diffusion_low_cpu_mem_usage_v_pred(self):
+ pipeline_id = "stabilityai/stable-diffusion-2"
+
+ start_time = time.time()
+ pipeline_low_cpu_mem_usage = StableDiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
+ pipeline_low_cpu_mem_usage.to(torch_device)
+ low_cpu_mem_usage_time = time.time() - start_time
+
+ start_time = time.time()
+ _ = StableDiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16, low_cpu_mem_usage=False)
+ normal_load_time = time.time() - start_time
+
+ assert 2 * low_cpu_mem_usage_time < normal_load_time
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading_v_pred(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipeline_id = "stabilityai/stable-diffusion-2"
+ prompt = "Andromeda galaxy in a bottle"
+
+ pipeline = StableDiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
+ pipeline = pipeline.to(torch_device)
+ pipeline.enable_attention_slicing(1)
+ pipeline.enable_sequential_cpu_offload()
+
+ generator = torch.manual_seed(0)
+ _ = pipeline(prompt, generator=generator, num_inference_steps=5)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.8 GB is allocated
+ assert mem_bytes < 2.8 * 10**9
diff --git a/tests/pipelines/stable_diffusion_adapter/__init__.py b/tests/pipelines/stable_diffusion_adapter/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_adapter/test_stable_diffusion_adapter.py b/tests/pipelines/stable_diffusion_adapter/test_stable_diffusion_adapter.py
new file mode 100644
index 0000000..f1b61c3
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_adapter/test_stable_diffusion_adapter.py
@@ -0,0 +1,950 @@
+# coding=utf-8
+# Copyright 2022 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from parameterized import parameterized
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AutoencoderKL,
+ LCMScheduler,
+ MultiAdapter,
+ PNDMScheduler,
+ StableDiffusionAdapterPipeline,
+ T2IAdapter,
+ UNet2DConditionModel,
+)
+from diffusers.utils import logging
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class AdapterTests:
+ pipeline_class = StableDiffusionAdapterPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+
+ def get_dummy_components(self, adapter_type, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ time_cond_proj_dim=time_cond_proj_dim,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+
+ if adapter_type == "full_adapter" or adapter_type == "light_adapter":
+ adapter = T2IAdapter(
+ in_channels=3,
+ channels=[32, 64],
+ num_res_blocks=2,
+ downscale_factor=2,
+ adapter_type=adapter_type,
+ )
+ elif adapter_type == "multi_adapter":
+ adapter = MultiAdapter(
+ [
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 64],
+ num_res_blocks=2,
+ downscale_factor=2,
+ adapter_type="full_adapter",
+ ),
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 64],
+ num_res_blocks=2,
+ downscale_factor=2,
+ adapter_type="full_adapter",
+ ),
+ ]
+ )
+ else:
+ raise ValueError(
+ f"Unknown adapter type: {adapter_type}, must be one of 'full_adapter', 'light_adapter', or 'multi_adapter''"
+ )
+
+ components = {
+ "adapter": adapter,
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def get_dummy_components_with_full_downscaling(self, adapter_type):
+ """Get dummy components with x8 VAE downscaling and 4 UNet down blocks.
+ These dummy components are intended to fully-exercise the T2I-Adapter
+ downscaling behavior.
+ """
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 32, 32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 32, 32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+
+ if adapter_type == "full_adapter" or adapter_type == "light_adapter":
+ adapter = T2IAdapter(
+ in_channels=3,
+ channels=[32, 32, 32, 64],
+ num_res_blocks=2,
+ downscale_factor=8,
+ adapter_type=adapter_type,
+ )
+ elif adapter_type == "multi_adapter":
+ adapter = MultiAdapter(
+ [
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 32, 32, 64],
+ num_res_blocks=2,
+ downscale_factor=8,
+ adapter_type="full_adapter",
+ ),
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 32, 32, 64],
+ num_res_blocks=2,
+ downscale_factor=8,
+ adapter_type="full_adapter",
+ ),
+ ]
+ )
+ else:
+ raise ValueError(
+ f"Unknown adapter type: {adapter_type}, must be one of 'full_adapter', 'light_adapter', or 'multi_adapter''"
+ )
+
+ components = {
+ "adapter": adapter,
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0, height=64, width=64, num_images=1):
+ if num_images == 1:
+ image = floats_tensor((1, 3, height, width), rng=random.Random(seed)).to(device)
+ else:
+ image = [
+ floats_tensor((1, 3, height, width), rng=random.Random(seed)).to(device) for _ in range(num_images)
+ ]
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_attention_slicing_forward_pass(self):
+ return self._test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(expected_max_diff=2e-3)
+
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=2e-3)
+
+ @parameterized.expand(
+ [
+ # (dim=264) The internal feature map will be 33x33 after initial pixel unshuffling (downscaled x8).
+ (((4 * 8 + 1) * 8),),
+ # (dim=272) The internal feature map will be 17x17 after the first T2I down block (downscaled x16).
+ (((4 * 4 + 1) * 16),),
+ # (dim=288) The internal feature map will be 9x9 after the second T2I down block (downscaled x32).
+ (((4 * 2 + 1) * 32),),
+ # (dim=320) The internal feature map will be 5x5 after the third T2I down block (downscaled x64).
+ (((4 * 1 + 1) * 64),),
+ ]
+ )
+ def test_multiple_image_dimensions(self, dim):
+ """Test that the T2I-Adapter pipeline supports any input dimension that
+ is divisible by the adapter's `downscale_factor`. This test was added in
+ response to an issue where the T2I Adapter's downscaling padding
+ behavior did not match the UNet's behavior.
+
+ Note that we have selected `dim` values to produce odd resolutions at
+ each downscaling level.
+ """
+ components = self.get_dummy_components_with_full_downscaling()
+ sd_pipe = StableDiffusionAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device, height=dim, width=dim)
+ image = sd_pipe(**inputs).images
+
+ assert image.shape == (1, dim, dim, 3)
+
+ def test_adapter_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionAdapterPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4535, 0.5493, 0.4359, 0.5452, 0.6086, 0.4441, 0.5544, 0.501, 0.4859])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_adapter_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionAdapterPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4535, 0.5493, 0.4359, 0.5452, 0.6086, 0.4441, 0.5544, 0.501, 0.4859])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+
+class StableDiffusionFullAdapterPipelineFastTests(AdapterTests, PipelineTesterMixin, unittest.TestCase):
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ return super().get_dummy_components("full_adapter", time_cond_proj_dim=time_cond_proj_dim)
+
+ def get_dummy_components_with_full_downscaling(self):
+ return super().get_dummy_components_with_full_downscaling("full_adapter")
+
+ def test_stable_diffusion_adapter_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4858, 0.5500, 0.4278, 0.4669, 0.6184, 0.4322, 0.5010, 0.5033, 0.4746])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
+
+
+class StableDiffusionLightAdapterPipelineFastTests(AdapterTests, PipelineTesterMixin, unittest.TestCase):
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ return super().get_dummy_components("light_adapter", time_cond_proj_dim=time_cond_proj_dim)
+
+ def get_dummy_components_with_full_downscaling(self):
+ return super().get_dummy_components_with_full_downscaling("light_adapter")
+
+ def test_stable_diffusion_adapter_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4965, 0.5548, 0.4330, 0.4771, 0.6226, 0.4382, 0.5037, 0.5071, 0.4782])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
+
+
+class StableDiffusionMultiAdapterPipelineFastTests(AdapterTests, PipelineTesterMixin, unittest.TestCase):
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ return super().get_dummy_components("multi_adapter", time_cond_proj_dim=time_cond_proj_dim)
+
+ def get_dummy_components_with_full_downscaling(self):
+ return super().get_dummy_components_with_full_downscaling("multi_adapter")
+
+ def get_dummy_inputs(self, device, height=64, width=64, seed=0):
+ inputs = super().get_dummy_inputs(device, seed, height=height, width=width, num_images=2)
+ inputs["adapter_conditioning_scale"] = [0.5, 0.5]
+ return inputs
+
+ def test_stable_diffusion_adapter_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4902, 0.5539, 0.4317, 0.4682, 0.6190, 0.4351, 0.5018, 0.5046, 0.4772])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
+
+ def test_inference_batch_consistent(
+ self, batch_sizes=[2, 4, 13], additional_params_copy_to_batched_inputs=["num_inference_steps"]
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ for batch_size in batch_sizes:
+ batched_inputs = {}
+ for name, value in inputs.items():
+ if name in self.batch_params:
+ # prompt is string
+ if name == "prompt":
+ len_prompt = len(value)
+ # make unequal batch sizes
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+
+ # make last batch super long
+ batched_inputs[name][-1] = 100 * "very long"
+ elif name == "image":
+ batched_images = []
+
+ for image in value:
+ batched_images.append(batch_size * [image])
+
+ batched_inputs[name] = batched_images
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ elif name == "batch_size":
+ batched_inputs[name] = batch_size
+ else:
+ batched_inputs[name] = value
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ batched_inputs["output_type"] = "np"
+
+ if self.pipeline_class.__name__ == "DanceDiffusionPipeline":
+ batched_inputs.pop("output_type")
+
+ output = pipe(**batched_inputs)
+
+ assert len(output[0]) == batch_size
+
+ batched_inputs["output_type"] = "np"
+
+ if self.pipeline_class.__name__ == "DanceDiffusionPipeline":
+ batched_inputs.pop("output_type")
+
+ output = pipe(**batched_inputs)[0]
+
+ assert output.shape[0] == batch_size
+
+ logger.setLevel(level=diffusers.logging.WARNING)
+
+ def test_num_images_per_prompt(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ batch_sizes = [1, 2]
+ num_images_per_prompts = [1, 2]
+
+ for batch_size in batch_sizes:
+ for num_images_per_prompt in num_images_per_prompts:
+ inputs = self.get_dummy_inputs(torch_device)
+
+ for key in inputs.keys():
+ if key in self.batch_params:
+ if key == "image":
+ batched_images = []
+
+ for image in inputs[key]:
+ batched_images.append(batch_size * [image])
+
+ inputs[key] = batched_images
+ else:
+ inputs[key] = batch_size * [inputs[key]]
+
+ images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt)[0]
+
+ assert images.shape[0] == batch_size * num_images_per_prompt
+
+ def test_inference_batch_single_identical(
+ self,
+ batch_size=3,
+ test_max_difference=None,
+ test_mean_pixel_difference=None,
+ relax_max_difference=False,
+ expected_max_diff=2e-3,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ if test_max_difference is None:
+ # TODO(Pedro) - not sure why, but not at all reproducible at the moment it seems
+ # make sure that batched and non-batched is identical
+ test_max_difference = torch_device != "mps"
+
+ if test_mean_pixel_difference is None:
+ # TODO same as above
+ test_mean_pixel_difference = torch_device != "mps"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batch_size = batch_size
+ for name, value in inputs.items():
+ if name in self.batch_params:
+ # prompt is string
+ if name == "prompt":
+ len_prompt = len(value)
+ # make unequal batch sizes
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+
+ # make last batch super long
+ batched_inputs[name][-1] = 100 * "very long"
+ elif name == "image":
+ batched_images = []
+
+ for image in value:
+ batched_images.append(batch_size * [image])
+
+ batched_inputs[name] = batched_images
+ else:
+ batched_inputs[name] = batch_size * [value]
+ elif name == "batch_size":
+ batched_inputs[name] = batch_size
+ elif name == "generator":
+ batched_inputs[name] = [self.get_generator(i) for i in range(batch_size)]
+ else:
+ batched_inputs[name] = value
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ if self.pipeline_class.__name__ != "DanceDiffusionPipeline":
+ batched_inputs["output_type"] = "np"
+
+ output_batch = pipe(**batched_inputs)
+ assert output_batch[0].shape[0] == batch_size
+
+ inputs["generator"] = self.get_generator(0)
+
+ output = pipe(**inputs)
+
+ logger.setLevel(level=diffusers.logging.WARNING)
+ if test_max_difference:
+ if relax_max_difference:
+ # Taking the median of the largest differences
+ # is resilient to outliers
+ diff = np.abs(output_batch[0][0] - output[0][0])
+ diff = diff.flatten()
+ diff.sort()
+ max_diff = np.median(diff[-5:])
+ else:
+ max_diff = np.abs(output_batch[0][0] - output[0][0]).max()
+ assert max_diff < expected_max_diff
+
+ if test_mean_pixel_difference:
+ assert_mean_pixel_difference(output_batch[0][0], output[0][0])
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionAdapterPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_adapter_color(self):
+ adapter_model = "TencentARC/t2iadapter_color_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "snail"
+ image_url = (
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/color.png"
+ )
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_color_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_depth(self):
+ adapter_model = "TencentARC/t2iadapter_depth_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "snail"
+ image_url = (
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/color.png"
+ )
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_color_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_depth_sd_v14(self):
+ adapter_model = "TencentARC/t2iadapter_depth_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "desk"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/desk_depth.png"
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_depth_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_depth_sd_v15(self):
+ adapter_model = "TencentARC/t2iadapter_depth_sd15v2"
+ sd_model = "runwayml/stable-diffusion-v1-5"
+ prompt = "desk"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/desk_depth.png"
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_depth_sd15v2.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_keypose_sd_v14(self):
+ adapter_model = "TencentARC/t2iadapter_keypose_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "person"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/person_keypose.png"
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_keypose_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_openpose_sd_v14(self):
+ adapter_model = "TencentARC/t2iadapter_openpose_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "person"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/iron_man_pose.png"
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_openpose_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_seg_sd_v14(self):
+ adapter_model = "TencentARC/t2iadapter_seg_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "motorcycle"
+ image_url = (
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/motor.png"
+ )
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_seg_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_zoedepth_sd_v15(self):
+ adapter_model = "TencentARC/t2iadapter_zoedepth_sd15v1"
+ sd_model = "runwayml/stable-diffusion-v1-5"
+ prompt = "motorcycle"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/motorcycle.png"
+ input_channels = 3
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_zoedepth_sd15v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_model_cpu_offload()
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_canny_sd_v14(self):
+ adapter_model = "TencentARC/t2iadapter_canny_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "toy"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/toy_canny.png"
+ input_channels = 1
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_canny_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_canny_sd_v15(self):
+ adapter_model = "TencentARC/t2iadapter_canny_sd15v2"
+ sd_model = "runwayml/stable-diffusion-v1-5"
+ prompt = "toy"
+ image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/toy_canny.png"
+ input_channels = 1
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_canny_sd15v2.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_sketch_sd14(self):
+ adapter_model = "TencentARC/t2iadapter_sketch_sd14v1"
+ sd_model = "CompVis/stable-diffusion-v1-4"
+ prompt = "cat"
+ image_url = (
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/edge.png"
+ )
+ input_channels = 1
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_sketch_sd14v1.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_sketch_sd15(self):
+ adapter_model = "TencentARC/t2iadapter_sketch_sd15v2"
+ sd_model = "runwayml/stable-diffusion-v1-5"
+ prompt = "cat"
+ image_url = (
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/edge.png"
+ )
+ input_channels = 1
+ out_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/t2iadapter_sketch_sd15v2.npy"
+
+ image = load_image(image_url)
+ expected_out = load_numpy(out_url)
+ if input_channels == 1:
+ image = image.convert("L")
+
+ adapter = T2IAdapter.from_pretrained(adapter_model, torch_dtype=torch.float16)
+
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(sd_model, adapter=adapter, safety_checker=None)
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ out = pipe(prompt=prompt, image=image, generator=generator, num_inference_steps=2, output_type="np").images
+
+ max_diff = numpy_cosine_similarity_distance(out.flatten(), expected_out.flatten())
+ assert max_diff < 1e-2
+
+ def test_stable_diffusion_adapter_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_seg_sd14v1")
+ pipe = StableDiffusionAdapterPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", adapter=adapter, safety_checker=None
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/motor.png"
+ )
+
+ pipe(prompt="foo", image=image, num_inference_steps=2)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ assert mem_bytes < 5 * 10**9
diff --git a/tests/pipelines/stable_diffusion_gligen/__init__.py b/tests/pipelines/stable_diffusion_gligen/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_gligen/test_stable_diffusion_gligen.py b/tests/pipelines/stable_diffusion_gligen/test_stable_diffusion_gligen.py
new file mode 100644
index 0000000..3b8383b
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_gligen/test_stable_diffusion_gligen.py
@@ -0,0 +1,162 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ EulerAncestralDiscreteScheduler,
+ StableDiffusionGLIGENPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import enable_full_determinism
+
+from ..pipeline_params import (
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class GligenPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionGLIGENPipeline
+ params = TEXT_TO_IMAGE_PARAMS | {"gligen_phrases", "gligen_boxes"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ attention_type="gated",
+ )
+ # unet.position_net = PositionNet(32,32)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A modern livingroom",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "gligen_phrases": ["a birthday cake"],
+ "gligen_boxes": [[0.2676, 0.6088, 0.4773, 0.7183]],
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_gligen_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionGLIGENPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5069, 0.5561, 0.4577, 0.4792, 0.5203, 0.4089, 0.5039, 0.4919, 0.4499])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_gligen_k_euler_ancestral(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionGLIGENPipeline(**components)
+ sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.425, 0.494, 0.429, 0.469, 0.525, 0.417, 0.533, 0.5, 0.47])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(batch_size=3, expected_max_diff=3e-3)
diff --git a/tests/pipelines/stable_diffusion_gligen_text_image/__init__.py b/tests/pipelines/stable_diffusion_gligen_text_image/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_gligen_text_image/test_stable_diffusion_gligen_text_image.py b/tests/pipelines/stable_diffusion_gligen_text_image/test_stable_diffusion_gligen_text_image.py
new file mode 100644
index 0000000..111e8d8
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_gligen_text_image/test_stable_diffusion_gligen_text_image.py
@@ -0,0 +1,192 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPProcessor,
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ EulerAncestralDiscreteScheduler,
+ StableDiffusionGLIGENTextImagePipeline,
+ UNet2DConditionModel,
+)
+from diffusers.pipelines.stable_diffusion import CLIPImageProjection
+from diffusers.utils import load_image
+from diffusers.utils.testing_utils import enable_full_determinism
+
+from ..pipeline_params import (
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class GligenTextImagePipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionGLIGENTextImagePipeline
+ params = TEXT_TO_IMAGE_PARAMS | {"gligen_phrases", "gligen_images", "gligen_boxes"}
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ attention_type="gated-text-image",
+ )
+ # unet.position_net = PositionNet(32,32)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ image_encoder_config = CLIPVisionConfig(
+ hidden_size=32,
+ projection_dim=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ )
+ image_encoder = CLIPVisionModelWithProjection(image_encoder_config)
+ processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
+
+ image_project = CLIPImageProjection(hidden_size=32)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": image_encoder,
+ "image_project": image_project,
+ "processor": processor,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ gligen_images = load_image(
+ "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
+ )
+ inputs = {
+ "prompt": "A modern livingroom",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "gligen_phrases": ["a birthday cake"],
+ "gligen_images": [gligen_images],
+ "gligen_boxes": [[0.2676, 0.6088, 0.4773, 0.7183]],
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_gligen_text_image_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionGLIGENTextImagePipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5069, 0.5561, 0.4577, 0.4792, 0.5203, 0.4089, 0.5039, 0.4919, 0.4499])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_gligen_k_euler_ancestral(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionGLIGENTextImagePipeline(**components)
+ sd_pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.425, 0.494, 0.429, 0.469, 0.525, 0.417, 0.533, 0.5, 0.47])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(batch_size=3, expected_max_diff=3e-3)
diff --git a/tests/pipelines/stable_diffusion_image_variation/__init__.py b/tests/pipelines/stable_diffusion_image_variation/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_image_variation/test_stable_diffusion_image_variation.py b/tests/pipelines/stable_diffusion_image_variation/test_stable_diffusion_image_variation.py
new file mode 100644
index 0000000..4dd7de7
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_image_variation/test_stable_diffusion_image_variation.py
@@ -0,0 +1,330 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import CLIPImageProcessor, CLIPVisionConfig, CLIPVisionModelWithProjection
+
+from diffusers import (
+ AutoencoderKL,
+ DPMSolverMultistepScheduler,
+ PNDMScheduler,
+ StableDiffusionImageVariationPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import IMAGE_VARIATION_BATCH_PARAMS, IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableDiffusionImageVariationPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionImageVariationPipeline
+ params = IMAGE_VARIATION_PARAMS
+ batch_params = IMAGE_VARIATION_BATCH_PARAMS
+ image_params = frozenset([])
+ # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ image_encoder_config = CLIPVisionConfig(
+ hidden_size=32,
+ projection_dim=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ image_size=32,
+ patch_size=4,
+ )
+ image_encoder = CLIPVisionModelWithProjection(image_encoder_config)
+ feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "image_encoder": image_encoder,
+ "feature_extractor": feature_extractor,
+ "safety_checker": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed))
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB").resize((32, 32))
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_img_variation_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImageVariationPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5239, 0.5723, 0.4796, 0.5049, 0.5550, 0.4685, 0.5329, 0.4891, 0.4921])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_stable_diffusion_img_variation_multiple_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionImageVariationPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["image"] = 2 * [inputs["image"]]
+ output = sd_pipe(**inputs)
+
+ image = output.images
+
+ image_slice = image[-1, -3:, -3:, -1]
+
+ assert image.shape == (2, 64, 64, 3)
+ expected_slice = np.array([0.6892, 0.5637, 0.5836, 0.5771, 0.6254, 0.6409, 0.5580, 0.5569, 0.5289])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionImageVariationPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_imgvar/input_image_vermeer.png"
+ )
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "image": init_image,
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_img_variation_pipeline_default(self):
+ sd_pipe = StableDiffusionImageVariationPipeline.from_pretrained(
+ "lambdalabs/sd-image-variations-diffusers", safety_checker=None
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_inputs(generator_device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.8449, 0.9079, 0.7571, 0.7873, 0.8348, 0.7010, 0.6694, 0.6873, 0.6138])
+
+ max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice)
+ assert max_diff < 1e-4
+
+ def test_stable_diffusion_img_variation_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([-0.7974, -0.4343, -1.087, 0.04785, -1.327, 0.855, -2.148, -0.1725, 1.439])
+ max_diff = numpy_cosine_similarity_distance(latents_slice.flatten(), expected_slice)
+
+ assert max_diff < 1e-3
+
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 64)
+ latents_slice = latents[0, -3:, -3:, -1]
+ expected_slice = np.array([0.3232, 0.004883, 0.913, -1.084, 0.6143, -1.6875, -2.463, -0.439, -0.419])
+ max_diff = numpy_cosine_similarity_distance(latents_slice.flatten(), expected_slice)
+
+ assert max_diff < 1e-3
+
+ callback_fn.has_been_called = False
+
+ pipe = StableDiffusionImageVariationPipeline.from_pretrained(
+ "lambdalabs/sd-image-variations-diffusers",
+ safety_checker=None,
+ torch_dtype=torch.float16,
+ )
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ generator_device = "cpu"
+ inputs = self.get_inputs(generator_device, dtype=torch.float16)
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == inputs["num_inference_steps"]
+
+ def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableDiffusionImageVariationPipeline.from_pretrained(
+ "lambdalabs/sd-image-variations-diffusers", safety_checker=None, torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs(torch_device, dtype=torch.float16)
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 2.6 GB is allocated
+ assert mem_bytes < 2.6 * 10**9
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionImageVariationPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_imgvar/input_image_vermeer.png"
+ )
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "image": init_image,
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_img_variation_pndm(self):
+ sd_pipe = StableDiffusionImageVariationPipeline.from_pretrained("fusing/sd-image-variations-diffusers")
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_imgvar/lambdalabs_variations_pndm.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
+
+ def test_img_variation_dpm(self):
+ sd_pipe = StableDiffusionImageVariationPipeline.from_pretrained("fusing/sd-image-variations-diffusers")
+ sd_pipe.scheduler = DPMSolverMultistepScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ inputs["num_inference_steps"] = 25
+ image = sd_pipe(**inputs).images[0]
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_imgvar/lambdalabs_variations_dpm_multi.npy"
+ )
+ max_diff = np.abs(expected_image - image).max()
+ assert max_diff < 1e-3
diff --git a/tests/pipelines/stable_diffusion_k_diffusion/__init__.py b/tests/pipelines/stable_diffusion_k_diffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_k_diffusion/test_stable_diffusion_k_diffusion.py b/tests/pipelines/stable_diffusion_k_diffusion/test_stable_diffusion_k_diffusion.py
new file mode 100644
index 0000000..65b4f23
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_k_diffusion/test_stable_diffusion_k_diffusion.py
@@ -0,0 +1,135 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import StableDiffusionKDiffusionPipeline
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, torch_device
+
+
+enable_full_determinism()
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_1(self):
+ sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_euler")
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=9.0, num_inference_steps=20, output_type="np")
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.0447, 0.0492, 0.0468, 0.0408, 0.0383, 0.0408, 0.0354, 0.0380, 0.0339])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_2(self):
+ sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_euler")
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=9.0, num_inference_steps=20, output_type="np")
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.1237, 0.1320, 0.1438, 0.1359, 0.1390, 0.1132, 0.1277, 0.1175, 0.1112])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-1
+
+ def test_stable_diffusion_karras_sigmas(self):
+ sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_dpmpp_2m")
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=7.5,
+ num_inference_steps=15,
+ output_type="np",
+ use_karras_sigmas=True,
+ )
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.11381689, 0.12112921, 0.1389457, 0.12549606, 0.1244964, 0.10831517, 0.11562866, 0.10867816, 0.10499048]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_noise_sampler_seed(self):
+ sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_dpmpp_sde")
+
+ prompt = "A painting of a squirrel eating a burger"
+ seed = 0
+ images1 = sd_pipe(
+ [prompt],
+ generator=torch.manual_seed(seed),
+ noise_sampler_seed=seed,
+ guidance_scale=9.0,
+ num_inference_steps=20,
+ output_type="np",
+ ).images
+ images2 = sd_pipe(
+ [prompt],
+ generator=torch.manual_seed(seed),
+ noise_sampler_seed=seed,
+ guidance_scale=9.0,
+ num_inference_steps=20,
+ output_type="np",
+ ).images
+
+ assert images1.shape == (1, 512, 512, 3)
+ assert images2.shape == (1, 512, 512, 3)
+ assert np.abs(images1.flatten() - images2.flatten()).max() < 1e-2
diff --git a/tests/pipelines/stable_diffusion_ldm3d/__init__.py b/tests/pipelines/stable_diffusion_ldm3d/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_ldm3d/test_stable_diffusion_ldm3d.py b/tests/pipelines/stable_diffusion_ldm3d/test_stable_diffusion_ldm3d.py
new file mode 100644
index 0000000..9ac69c8
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_ldm3d/test_stable_diffusion_ldm3d.py
@@ -0,0 +1,310 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ PNDMScheduler,
+ StableDiffusionLDM3DPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+
+
+enable_full_determinism()
+
+
+class StableDiffusionLDM3DPipelineFastTests(unittest.TestCase):
+ pipeline_class = StableDiffusionLDM3DPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=6,
+ out_channels=6,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components()
+ ldm3d_pipe = StableDiffusionLDM3DPipeline(**components)
+ ldm3d_pipe = ldm3d_pipe.to(torch_device)
+ ldm3d_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = ldm3d_pipe(**inputs)
+ rgb, depth = output.rgb, output.depth
+
+ image_slice_rgb = rgb[0, -3:, -3:, -1]
+ image_slice_depth = depth[0, -3:, -1]
+
+ assert rgb.shape == (1, 64, 64, 3)
+ assert depth.shape == (1, 64, 64)
+
+ expected_slice_rgb = np.array(
+ [0.37338176, 0.70247, 0.74203193, 0.51643604, 0.58256793, 0.60932136, 0.4181095, 0.48355877, 0.46535262]
+ )
+ expected_slice_depth = np.array([103.46727, 85.812004, 87.849236])
+
+ assert np.abs(image_slice_rgb.flatten() - expected_slice_rgb).max() < 1e-2
+ assert np.abs(image_slice_depth.flatten() - expected_slice_depth).max() < 1e-2
+
+ def test_stable_diffusion_prompt_embeds(self):
+ components = self.get_dummy_components()
+ ldm3d_pipe = StableDiffusionLDM3DPipeline(**components)
+ ldm3d_pipe = ldm3d_pipe.to(torch_device)
+ ldm3d_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ # forward
+ output = ldm3d_pipe(**inputs)
+ rgb_slice_1, depth_slice_1 = output.rgb, output.depth
+ rgb_slice_1 = rgb_slice_1[0, -3:, -3:, -1]
+ depth_slice_1 = depth_slice_1[0, -3:, -1]
+
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ text_inputs = ldm3d_pipe.tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=ldm3d_pipe.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_inputs = text_inputs["input_ids"].to(torch_device)
+
+ prompt_embeds = ldm3d_pipe.text_encoder(text_inputs)[0]
+
+ inputs["prompt_embeds"] = prompt_embeds
+
+ # forward
+ output = ldm3d_pipe(**inputs)
+ rgb_slice_2, depth_slice_2 = output.rgb, output.depth
+ rgb_slice_2 = rgb_slice_2[0, -3:, -3:, -1]
+ depth_slice_2 = depth_slice_2[0, -3:, -1]
+
+ assert np.abs(rgb_slice_1.flatten() - rgb_slice_2.flatten()).max() < 1e-4
+ assert np.abs(depth_slice_1.flatten() - depth_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+ ldm3d_pipe = StableDiffusionLDM3DPipeline(**components)
+ ldm3d_pipe = ldm3d_pipe.to(device)
+ ldm3d_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "french fries"
+ output = ldm3d_pipe(**inputs, negative_prompt=negative_prompt)
+
+ rgb, depth = output.rgb, output.depth
+ rgb_slice = rgb[0, -3:, -3:, -1]
+ depth_slice = depth[0, -3:, -1]
+
+ assert rgb.shape == (1, 64, 64, 3)
+ assert depth.shape == (1, 64, 64)
+
+ expected_slice_rgb = np.array(
+ [0.37044, 0.71811503, 0.7223251, 0.48603675, 0.5638391, 0.6364948, 0.42833704, 0.4901315, 0.47926217]
+ )
+ expected_slice_depth = np.array([107.84738, 84.62802, 89.962135])
+ assert np.abs(rgb_slice.flatten() - expected_slice_rgb).max() < 1e-2
+ assert np.abs(depth_slice.flatten() - expected_slice_depth).max() < 1e-2
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionLDM3DPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_ldm3d_stable_diffusion(self):
+ ldm3d_pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d")
+ ldm3d_pipe = ldm3d_pipe.to(torch_device)
+ ldm3d_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ output = ldm3d_pipe(**inputs)
+ rgb, depth = output.rgb, output.depth
+ rgb_slice = rgb[0, -3:, -3:, -1].flatten()
+ depth_slice = rgb[0, -3:, -1].flatten()
+
+ assert rgb.shape == (1, 512, 512, 3)
+ assert depth.shape == (1, 512, 512)
+
+ expected_slice_rgb = np.array(
+ [0.53805465, 0.56707305, 0.5486515, 0.57012236, 0.5814511, 0.56253487, 0.54843014, 0.55092263, 0.6459706]
+ )
+ expected_slice_depth = np.array(
+ [0.9263781, 0.6678672, 0.5486515, 0.92202145, 0.67831135, 0.56253487, 0.9241694, 0.7551478, 0.6459706]
+ )
+ assert np.abs(rgb_slice - expected_slice_rgb).max() < 3e-3
+ assert np.abs(depth_slice - expected_slice_depth).max() < 3e-3
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+ generator = torch.Generator(device=generator_device).manual_seed(seed)
+ latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+ latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+ inputs = {
+ "prompt": "a photograph of an astronaut riding a horse",
+ "latents": latents,
+ "generator": generator,
+ "num_inference_steps": 50,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_ldm3d(self):
+ ldm3d_pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d").to(torch_device)
+ ldm3d_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ output = ldm3d_pipe(**inputs)
+ rgb, depth = output.rgb, output.depth
+
+ expected_rgb_mean = 0.495586
+ expected_rgb_std = 0.33795515
+ expected_depth_mean = 112.48518
+ expected_depth_std = 98.489746
+ assert np.abs(expected_rgb_mean - rgb.mean()) < 1e-3
+ assert np.abs(expected_rgb_std - rgb.std()) < 1e-3
+ assert np.abs(expected_depth_mean - depth.mean()) < 1e-3
+ assert np.abs(expected_depth_std - depth.std()) < 1e-3
+
+ def test_ldm3d_v2(self):
+ ldm3d_pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-4c").to(torch_device)
+ ldm3d_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_inputs(torch_device)
+ output = ldm3d_pipe(**inputs)
+ rgb, depth = output.rgb, output.depth
+
+ expected_rgb_mean = 0.4194127
+ expected_rgb_std = 0.35375586
+ expected_depth_mean = 0.5638502
+ expected_depth_std = 0.34686103
+
+ assert rgb.shape == (1, 512, 512, 3)
+ assert depth.shape == (1, 512, 512, 1)
+ assert np.abs(expected_rgb_mean - rgb.mean()) < 1e-3
+ assert np.abs(expected_rgb_std - rgb.std()) < 1e-3
+ assert np.abs(expected_depth_mean - depth.mean()) < 1e-3
+ assert np.abs(expected_depth_std - depth.std()) < 1e-3
diff --git a/tests/pipelines/stable_diffusion_panorama/__init__.py b/tests/pipelines/stable_diffusion_panorama/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_panorama/test_stable_diffusion_panorama.py b/tests/pipelines/stable_diffusion_panorama/test_stable_diffusion_panorama.py
new file mode 100644
index 0000000..aa7212b
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_panorama/test_stable_diffusion_panorama.py
@@ -0,0 +1,412 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ EulerAncestralDiscreteScheduler,
+ LMSDiscreteScheduler,
+ PNDMScheduler,
+ StableDiffusionPanoramaPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, skip_mps, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+@skip_mps
+class StableDiffusionPanoramaPipelineFastTests(PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableDiffusionPanoramaPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ scheduler = DDIMScheduler()
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ generator = torch.manual_seed(seed)
+ inputs = {
+ "prompt": "a photo of the dolomites",
+ "generator": generator,
+ # Setting height and width to None to prevent OOMs on CPU.
+ "height": None,
+ "width": None,
+ "num_inference_steps": 1,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_panorama_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6186, 0.5374, 0.4915, 0.4135, 0.4114, 0.4563, 0.5128, 0.4977, 0.4757])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_circular_padding_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs, circular_padding=True).images
+ image_slice = image[0, -3:, -3:, -1]
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6127, 0.6299, 0.4595, 0.4051, 0.4543, 0.3925, 0.5510, 0.5693, 0.5031])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # override to speed the overall test timing up.
+ def test_inference_batch_consistent(self):
+ super().test_inference_batch_consistent(batch_sizes=[1, 2])
+
+ # override to speed the overall test timing up.
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(batch_size=2, expected_max_diff=5.0e-3)
+
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1e-1)
+
+ def test_stable_diffusion_panorama_negative_prompt(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ negative_prompt = "french fries"
+ output = sd_pipe(**inputs, negative_prompt=negative_prompt)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6187, 0.5375, 0.4915, 0.4136, 0.4114, 0.4563, 0.5128, 0.4976, 0.4757])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_views_batch(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs, view_batch_size=2)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6187, 0.5375, 0.4915, 0.4136, 0.4114, 0.4563, 0.5128, 0.4976, 0.4757])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_views_batch_circular_padding(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs, circular_padding=True, view_batch_size=2)
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6127, 0.6299, 0.4595, 0.4051, 0.4543, 0.3925, 0.5510, 0.5693, 0.5031])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = EulerAncestralDiscreteScheduler(
+ beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
+ )
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.4024, 0.6510, 0.4901, 0.5378, 0.5813, 0.5622, 0.4795, 0.4467, 0.4952])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_pndm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ components["scheduler"] = PNDMScheduler(
+ beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", skip_prk_steps=True
+ )
+ sd_pipe = StableDiffusionPanoramaPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6391, 0.6291, 0.4861, 0.5134, 0.5552, 0.4578, 0.5032, 0.5023, 0.4539])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionPanoramaNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, seed=0):
+ generator = torch.manual_seed(seed)
+ inputs = {
+ "prompt": "a photo of the dolomites",
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 7.5,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_panorama_default(self):
+ model_ckpt = "stabilityai/stable-diffusion-2-base"
+ scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
+ pipe = StableDiffusionPanoramaPipeline.from_pretrained(model_ckpt, scheduler=scheduler, safety_checker=None)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 2048, 3)
+
+ expected_slice = np.array(
+ [
+ 0.36968392,
+ 0.27025372,
+ 0.32446766,
+ 0.28379387,
+ 0.36363274,
+ 0.30733347,
+ 0.27100027,
+ 0.27054125,
+ 0.25536096,
+ ]
+ )
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_k_lms(self):
+ pipe = StableDiffusionPanoramaPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-2-base", safety_checker=None
+ )
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+ assert image.shape == (1, 512, 2048, 3)
+
+ expected_slice = np.array(
+ [
+ [
+ 0.0,
+ 0.0,
+ 0.0,
+ 0.0,
+ 0.0,
+ 0.0,
+ 0.0,
+ 0.0,
+ 0.0,
+ ]
+ ]
+ )
+
+ assert np.abs(expected_slice - image_slice).max() < 1e-2
+
+ def test_stable_diffusion_panorama_intermediate_state(self):
+ number_of_steps = 0
+
+ def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+ callback_fn.has_been_called = True
+ nonlocal number_of_steps
+ number_of_steps += 1
+ if step == 1:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 256)
+ latents_slice = latents[0, -3:, -3:, -1]
+
+ expected_slice = np.array(
+ [
+ 0.18681869,
+ 0.33907816,
+ 0.5361276,
+ 0.14432865,
+ -0.02856611,
+ -0.73941123,
+ 0.23397987,
+ 0.47322682,
+ -0.37823164,
+ ]
+ )
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+ elif step == 2:
+ latents = latents.detach().cpu().numpy()
+ assert latents.shape == (1, 4, 64, 256)
+ latents_slice = latents[0, -3:, -3:, -1]
+
+ expected_slice = np.array(
+ [
+ 0.18539645,
+ 0.33987248,
+ 0.5378559,
+ 0.14437142,
+ -0.02455261,
+ -0.7338317,
+ 0.23990755,
+ 0.47356272,
+ -0.3786505,
+ ]
+ )
+
+ assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+
+ callback_fn.has_been_called = False
+
+ model_ckpt = "stabilityai/stable-diffusion-2-base"
+ scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
+ pipe = StableDiffusionPanoramaPipeline.from_pretrained(model_ckpt, scheduler=scheduler, safety_checker=None)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs()
+ pipe(**inputs, callback=callback_fn, callback_steps=1)
+ assert callback_fn.has_been_called
+ assert number_of_steps == 3
+
+ def test_stable_diffusion_panorama_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ model_ckpt = "stabilityai/stable-diffusion-2-base"
+ scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
+ pipe = StableDiffusionPanoramaPipeline.from_pretrained(model_ckpt, scheduler=scheduler, safety_checker=None)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing(1)
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_inputs()
+ _ = pipe(**inputs)
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 5.2 GB is allocated
+ assert mem_bytes < 5.5 * 10**9
diff --git a/tests/pipelines/stable_diffusion_safe/__init__.py b/tests/pipelines/stable_diffusion_safe/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_safe/test_safe_diffusion.py b/tests/pipelines/stable_diffusion_safe/test_safe_diffusion.py
new file mode 100644
index 0000000..478e465
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_safe/test_safe_diffusion.py
@@ -0,0 +1,435 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import AutoencoderKL, DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler, UNet2DConditionModel
+from diffusers.pipelines.stable_diffusion_safe import StableDiffusionPipelineSafe as StableDiffusionPipeline
+from diffusers.utils.testing_utils import floats_tensor, nightly, require_torch_gpu, torch_device
+
+
+class SafeDiffusionPipelineFastTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ @property
+ def dummy_image(self):
+ batch_size = 1
+ num_channels = 3
+ sizes = (32, 32)
+
+ image = floats_tensor((batch_size, num_channels) + sizes, rng=random.Random(0)).to(torch_device)
+ return image
+
+ @property
+ def dummy_cond_unet(self):
+ torch.manual_seed(0)
+ model = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ return model
+
+ @property
+ def dummy_vae(self):
+ torch.manual_seed(0)
+ model = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ return model
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config)
+
+ @property
+ def dummy_extractor(self):
+ def extract(*args, **kwargs):
+ class Out:
+ def __init__(self):
+ self.pixel_values = torch.ones([0])
+
+ def to(self, device):
+ self.pixel_values.to(device)
+ return self
+
+ return Out()
+
+ return extract
+
+ def test_safe_diffusion_ddim(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=6.0,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5756, 0.6118, 0.5005, 0.5041, 0.5471, 0.4726, 0.4976, 0.4865, 0.4864])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_pndm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unet = self.dummy_cond_unet
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = sd_pipe([prompt], generator=generator, guidance_scale=6.0, num_inference_steps=2, output_type="np")
+
+ image = output.images
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_tuple = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=6.0,
+ num_inference_steps=2,
+ output_type="np",
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5125, 0.5716, 0.4828, 0.5060, 0.5650, 0.4768, 0.5185, 0.4895, 0.4993])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_no_safety_checker(self):
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-lms-pipe", safety_checker=None
+ )
+ assert isinstance(pipe, StableDiffusionPipeline)
+ assert isinstance(pipe.scheduler, LMSDiscreteScheduler)
+ assert pipe.safety_checker is None
+
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ # check that there's no error when saving a pipeline with one of the models being None
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe = StableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ # sanity check that the pipeline still works
+ assert pipe.safety_checker is None
+ image = pipe("example prompt", num_inference_steps=2).images[0]
+ assert image is not None
+
+ @unittest.skipIf(torch_device != "cuda", "This test requires a GPU")
+ def test_stable_diffusion_fp16(self):
+ """Test that stable diffusion works with fp16"""
+ unet = self.dummy_cond_unet
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ # put models in fp16
+ unet = unet.half()
+ vae = vae.half()
+ bert = bert.half()
+
+ # make sure here that pndm scheduler skips prk
+ sd_pipe = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "A painting of a squirrel eating a burger"
+ image = sd_pipe([prompt], num_inference_steps=2, output_type="np").images
+
+ assert image.shape == (1, 64, 64, 3)
+
+
+@nightly
+@require_torch_gpu
+class SafeDiffusionPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_harm_safe_stable_diffusion(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None)
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = (
+ "portrait of girl with smokey eyes makeup in abandoned hotel, grange clothes, redshift, wide high angle"
+ " coloured polaroid photograph with flash, kodak film, hyper real, stunning moody cinematography, with"
+ " anamorphic lenses, by maripol, fallen angels by wong kar - wai, style of suspiria and neon demon and"
+ " children from bahnhof zoo, detailed "
+ )
+ seed = 4003660346
+ guidance_scale = 7
+
+ # without safety guidance (sld_guidance_scale = 0)
+ generator = torch.manual_seed(seed)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ sld_guidance_scale=0,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [0.2278, 0.2231, 0.2249, 0.2333, 0.2303, 0.1885, 0.2273, 0.2144, 0.2176]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ # without safety guidance (strong configuration)
+ generator = torch.manual_seed(seed)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ sld_guidance_scale=2000,
+ sld_warmup_steps=7,
+ sld_threshold=0.025,
+ sld_momentum_scale=0.5,
+ sld_mom_beta=0.7,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [0.2383, 0.2276, 0.236, 0.2192, 0.2186, 0.2053, 0.1971, 0.1901, 0.1719]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_nudity_safe_stable_diffusion(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", safety_checker=None)
+ sd_pipe.scheduler = LMSDiscreteScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "padme amidala taking a bath artwork, safe for work, no nudity"
+ seed = 2734971755
+ guidance_scale = 7
+
+ generator = torch.manual_seed(seed)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ sld_guidance_scale=0,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [0.3502, 0.3622, 0.3396, 0.3642, 0.3478, 0.3318, 0.35, 0.3348, 0.3297]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ generator = torch.manual_seed(seed)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ sld_guidance_scale=2000,
+ sld_warmup_steps=7,
+ sld_threshold=0.025,
+ sld_momentum_scale=0.5,
+ sld_mom_beta=0.7,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = [0.5531, 0.5206, 0.4895, 0.5156, 0.5182, 0.4751, 0.4802, 0.4803, 0.4443]
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_nudity_safetychecker_safe_stable_diffusion(self):
+ sd_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = (
+ "the four horsewomen of the apocalypse, painting by tom of finland, gaston bussiere, craig mullins, j. c."
+ " leyendecker"
+ )
+ seed = 1044355234
+ guidance_scale = 12
+
+ generator = torch.manual_seed(seed)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ sld_guidance_scale=0,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
+
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-7
+
+ generator = torch.manual_seed(seed)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=guidance_scale,
+ num_inference_steps=50,
+ output_type="np",
+ width=512,
+ height=512,
+ sld_guidance_scale=2000,
+ sld_warmup_steps=7,
+ sld_threshold=0.025,
+ sld_momentum_scale=0.5,
+ sld_mom_beta=0.7,
+ )
+
+ image = output.images
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.5818, 0.6285, 0.6835, 0.6019, 0.625, 0.6754, 0.6096, 0.6334, 0.6561])
+ assert image.shape == (1, 512, 512, 3)
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
diff --git a/tests/pipelines/stable_diffusion_sag/__init__.py b/tests/pipelines/stable_diffusion_sag/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_sag/test_stable_diffusion_sag.py b/tests/pipelines/stable_diffusion_sag/test_stable_diffusion_sag.py
new file mode 100644
index 0000000..94a5616
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_sag/test_stable_diffusion_sag.py
@@ -0,0 +1,215 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DEISMultistepScheduler,
+ DPMSolverMultistepScheduler,
+ EulerDiscreteScheduler,
+ StableDiffusionSAGPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableDiffusionSAGPipelineFastTests(PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableDiffusionSAGPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=2,
+ sample_size=8,
+ norm_num_groups=1,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=8,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[4, 8],
+ norm_num_groups=1,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=8,
+ num_hidden_layers=2,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": ".",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 1.0,
+ "sag_scale": 1.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ @unittest.skip("Not necessary to test here.")
+ def test_xformers_attention_forwardGenerator_pass(self):
+ pass
+
+ def test_pipeline_different_schedulers(self):
+ pipeline = self.pipeline_class(**self.get_dummy_components())
+ inputs = self.get_dummy_inputs("cpu")
+
+ expected_image_size = (16, 16, 3)
+ for scheduler_cls in [DDIMScheduler, DEISMultistepScheduler, DPMSolverMultistepScheduler]:
+ pipeline.scheduler = scheduler_cls.from_config(pipeline.scheduler.config)
+ image = pipeline(**inputs).images[0]
+
+ shape = image.shape
+ assert shape == expected_image_size
+
+ pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
+
+ with self.assertRaises(ValueError):
+ # Karras schedulers are not supported
+ image = pipeline(**inputs).images[0]
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_1(self):
+ sag_pipe = StableDiffusionSAGPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ sag_pipe = sag_pipe.to(torch_device)
+ sag_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "."
+ generator = torch.manual_seed(0)
+ output = sag_pipe(
+ [prompt], generator=generator, guidance_scale=7.5, sag_scale=1.0, num_inference_steps=20, output_type="np"
+ )
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.1568, 0.1738, 0.1695, 0.1693, 0.1507, 0.1705, 0.1547, 0.1751, 0.1949])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-2
+
+ def test_stable_diffusion_2(self):
+ sag_pipe = StableDiffusionSAGPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
+ sag_pipe = sag_pipe.to(torch_device)
+ sag_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "."
+ generator = torch.manual_seed(0)
+ output = sag_pipe(
+ [prompt], generator=generator, guidance_scale=7.5, sag_scale=1.0, num_inference_steps=20, output_type="np"
+ )
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.3459, 0.2876, 0.2537, 0.3002, 0.2671, 0.2160, 0.3026, 0.2262, 0.2371])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-2
+
+ def test_stable_diffusion_2_non_square(self):
+ sag_pipe = StableDiffusionSAGPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
+ sag_pipe = sag_pipe.to(torch_device)
+ sag_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "."
+ generator = torch.manual_seed(0)
+ output = sag_pipe(
+ [prompt],
+ width=768,
+ height=512,
+ generator=generator,
+ guidance_scale=7.5,
+ sag_scale=1.0,
+ num_inference_steps=20,
+ output_type="np",
+ )
+
+ image = output.images
+
+ assert image.shape == (1, 512, 768, 3)
diff --git a/tests/pipelines/stable_diffusion_xl/__init__.py b/tests/pipelines/stable_diffusion_xl/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py
new file mode 100644
index 0000000..a9acebb
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py
@@ -0,0 +1,1102 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import gc
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerDiscreteScheduler,
+ HeunDiscreteScheduler,
+ LCMScheduler,
+ StableDiffusionXLImg2ImgPipeline,
+ StableDiffusionXLPipeline,
+ UNet2DConditionModel,
+ UniPCMultistepScheduler,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_image,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ TEXT_TO_IMAGE_BATCH_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+ TEXT_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_TO_IMAGE_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ SDFunctionTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLPipelineFastTests(
+ SDFunctionTesterMixin,
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionXLPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union({"add_text_embeds", "add_time_ids"})
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(2, 4),
+ layers_per_block=2,
+ time_cond_proj_dim=time_cond_proj_dim,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ norm_num_groups=1,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "image_encoder": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 5.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_stable_diffusion_xl_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5552, 0.5569, 0.4725, 0.4348, 0.4994, 0.4632, 0.5142, 0.5012, 0.47])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_euler_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4917, 0.6555, 0.4348, 0.5219, 0.7324, 0.4855, 0.5168, 0.5447, 0.5156])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_euler_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.4917, 0.6555, 0.4348, 0.5219, 0.7324, 0.4855, 0.5168, 0.5447, 0.5156])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt"] = 2 * [inputs["prompt"]]
+ inputs["num_images_per_prompt"] = 2
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ prompt = 2 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_xl_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ prompt = 3 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt, negative_prompt=negative_prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_img2img_prompt_embeds_only(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ _,
+ pooled_prompt_embeds,
+ _,
+ ) = sd_pipe.encode_prompt(prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_two_xl_mixture_of_denoiser_fast(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+ pipe_2 = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps,
+ split,
+ scheduler_cls_orig,
+ expected_tss,
+ num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps,
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss))
+ expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split, expected_tss))
+ expected_steps = expected_steps_1 + expected_steps_2
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss))
+ expected_steps_2 = list(filter(lambda ts: ts < split, expected_tss))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {
+ **inputs,
+ **{
+ "denoising_end": 1.0 - (split / num_train_timesteps),
+ "output_type": "latent",
+ },
+ }
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ inputs_2 = {
+ **inputs,
+ **{
+ "denoising_start": 1.0 - (split / num_train_timesteps),
+ "image": latents,
+ },
+ }
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+ assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ steps = 10
+ for split in [300, 700]:
+ for scheduler_cls_timesteps in [
+ (EulerDiscreteScheduler, [901, 801, 701, 601, 501, 401, 301, 201, 101, 1]),
+ (
+ HeunDiscreteScheduler,
+ [
+ 901.0,
+ 801.0,
+ 801.0,
+ 701.0,
+ 701.0,
+ 601.0,
+ 601.0,
+ 501.0,
+ 501.0,
+ 401.0,
+ 401.0,
+ 301.0,
+ 301.0,
+ 201.0,
+ 201.0,
+ 101.0,
+ 101.0,
+ 1.0,
+ 1.0,
+ ],
+ ),
+ ]:
+ assert_run_mixture(steps, split, scheduler_cls_timesteps[0], scheduler_cls_timesteps[1])
+
+ @slow
+ def test_stable_diffusion_two_xl_mixture_of_denoiser(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+ pipe_2 = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps,
+ split,
+ scheduler_cls_orig,
+ expected_tss,
+ num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps,
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss))
+ expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split, expected_tss))
+ expected_steps = expected_steps_1 + expected_steps_2
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss))
+ expected_steps_2 = list(filter(lambda ts: ts < split, expected_tss))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {
+ **inputs,
+ **{
+ "denoising_end": 1.0 - (split / num_train_timesteps),
+ "output_type": "latent",
+ },
+ }
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ inputs_2 = {
+ **inputs,
+ **{
+ "denoising_start": 1.0 - (split / num_train_timesteps),
+ "image": latents,
+ },
+ }
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+ assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ steps = 10
+ for split in [300, 500, 700]:
+ for scheduler_cls_timesteps in [
+ (DDIMScheduler, [901, 801, 701, 601, 501, 401, 301, 201, 101, 1]),
+ (EulerDiscreteScheduler, [901, 801, 701, 601, 501, 401, 301, 201, 101, 1]),
+ (DPMSolverMultistepScheduler, [901, 811, 721, 631, 541, 451, 361, 271, 181, 91]),
+ (UniPCMultistepScheduler, [901, 811, 721, 631, 541, 451, 361, 271, 181, 91]),
+ (
+ HeunDiscreteScheduler,
+ [
+ 901.0,
+ 801.0,
+ 801.0,
+ 701.0,
+ 701.0,
+ 601.0,
+ 601.0,
+ 501.0,
+ 501.0,
+ 401.0,
+ 401.0,
+ 301.0,
+ 301.0,
+ 201.0,
+ 201.0,
+ 101.0,
+ 101.0,
+ 1.0,
+ 1.0,
+ ],
+ ),
+ ]:
+ assert_run_mixture(steps, split, scheduler_cls_timesteps[0], scheduler_cls_timesteps[1])
+
+ steps = 25
+ for split in [300, 500, 700]:
+ for scheduler_cls_timesteps in [
+ (
+ DDIMScheduler,
+ [
+ 961,
+ 921,
+ 881,
+ 841,
+ 801,
+ 761,
+ 721,
+ 681,
+ 641,
+ 601,
+ 561,
+ 521,
+ 481,
+ 441,
+ 401,
+ 361,
+ 321,
+ 281,
+ 241,
+ 201,
+ 161,
+ 121,
+ 81,
+ 41,
+ 1,
+ ],
+ ),
+ (
+ EulerDiscreteScheduler,
+ [
+ 961.0,
+ 921.0,
+ 881.0,
+ 841.0,
+ 801.0,
+ 761.0,
+ 721.0,
+ 681.0,
+ 641.0,
+ 601.0,
+ 561.0,
+ 521.0,
+ 481.0,
+ 441.0,
+ 401.0,
+ 361.0,
+ 321.0,
+ 281.0,
+ 241.0,
+ 201.0,
+ 161.0,
+ 121.0,
+ 81.0,
+ 41.0,
+ 1.0,
+ ],
+ ),
+ (
+ DPMSolverMultistepScheduler,
+ [
+ 951,
+ 913,
+ 875,
+ 837,
+ 799,
+ 761,
+ 723,
+ 685,
+ 647,
+ 609,
+ 571,
+ 533,
+ 495,
+ 457,
+ 419,
+ 381,
+ 343,
+ 305,
+ 267,
+ 229,
+ 191,
+ 153,
+ 115,
+ 77,
+ 39,
+ ],
+ ),
+ (
+ UniPCMultistepScheduler,
+ [
+ 951,
+ 913,
+ 875,
+ 837,
+ 799,
+ 761,
+ 723,
+ 685,
+ 647,
+ 609,
+ 571,
+ 533,
+ 495,
+ 457,
+ 419,
+ 381,
+ 343,
+ 305,
+ 267,
+ 229,
+ 191,
+ 153,
+ 115,
+ 77,
+ 39,
+ ],
+ ),
+ (
+ HeunDiscreteScheduler,
+ [
+ 961.0,
+ 921.0,
+ 921.0,
+ 881.0,
+ 881.0,
+ 841.0,
+ 841.0,
+ 801.0,
+ 801.0,
+ 761.0,
+ 761.0,
+ 721.0,
+ 721.0,
+ 681.0,
+ 681.0,
+ 641.0,
+ 641.0,
+ 601.0,
+ 601.0,
+ 561.0,
+ 561.0,
+ 521.0,
+ 521.0,
+ 481.0,
+ 481.0,
+ 441.0,
+ 441.0,
+ 401.0,
+ 401.0,
+ 361.0,
+ 361.0,
+ 321.0,
+ 321.0,
+ 281.0,
+ 281.0,
+ 241.0,
+ 241.0,
+ 201.0,
+ 201.0,
+ 161.0,
+ 161.0,
+ 121.0,
+ 121.0,
+ 81.0,
+ 81.0,
+ 41.0,
+ 41.0,
+ 1.0,
+ 1.0,
+ ],
+ ),
+ ]:
+ assert_run_mixture(steps, split, scheduler_cls_timesteps[0], scheduler_cls_timesteps[1])
+
+ @slow
+ def test_stable_diffusion_three_xl_mixture_of_denoiser(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+ pipe_2 = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+ pipe_3 = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device)
+ pipe_3.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps,
+ split_1,
+ split_2,
+ scheduler_cls_orig,
+ num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps,
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+ pipe_3.scheduler = scheduler_cls.from_config(pipe_3.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ split_1_ts = num_train_timesteps - int(round(num_train_timesteps * split_1))
+ split_2_ts = num_train_timesteps - int(round(num_train_timesteps * split_2))
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps))
+ expected_steps_2 = expected_steps_1[-1:] + list(
+ filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps)
+ )
+ expected_steps_3 = expected_steps_2[-1:] + list(filter(lambda ts: ts < split_2_ts, expected_steps))
+ expected_steps = expected_steps_1 + expected_steps_2 + expected_steps_3
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps))
+ expected_steps_2 = list(filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps))
+ expected_steps_3 = list(filter(lambda ts: ts < split_2_ts, expected_steps))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {**inputs, **{"denoising_end": split_1, "output_type": "latent"}}
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert (
+ expected_steps_1 == done_steps
+ ), f"Failure with {scheduler_cls.__name__} and {num_steps} and {split_1} and {split_2}"
+
+ with self.assertRaises(ValueError) as cm:
+ inputs_2 = {
+ **inputs,
+ **{
+ "denoising_start": split_2,
+ "denoising_end": split_1,
+ "image": latents,
+ "output_type": "latent",
+ },
+ }
+ pipe_2(**inputs_2).images[0]
+ assert "cannot be larger than or equal to `denoising_end`" in str(cm.exception)
+
+ inputs_2 = {
+ **inputs,
+ **{"denoising_start": split_1, "denoising_end": split_2, "image": latents, "output_type": "latent"},
+ }
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+
+ inputs_3 = {**inputs, **{"denoising_start": split_2, "image": latents}}
+ pipe_3(**inputs_3).images[0]
+
+ assert expected_steps_3 == done_steps[len(expected_steps_1) + len(expected_steps_2) :]
+ assert (
+ expected_steps == done_steps
+ ), f"Failure with {scheduler_cls.__name__} and {num_steps} and {split_1} and {split_2}"
+
+ for steps in [7, 11, 20]:
+ for split_1, split_2 in zip([0.19, 0.32], [0.81, 0.68]):
+ for scheduler_cls in [
+ DDIMScheduler,
+ EulerDiscreteScheduler,
+ DPMSolverMultistepScheduler,
+ UniPCMultistepScheduler,
+ HeunDiscreteScheduler,
+ ]:
+ assert_run_mixture(steps, split_1, split_2, scheduler_cls)
+
+ def test_stable_diffusion_xl_multi_prompts(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+
+ # forward with single prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = inputs["prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["prompt_2"] = "different prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # manually set a negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same negative_prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = inputs["negative_prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = "different negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ def test_stable_diffusion_xl_negative_conditions(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice_with_no_neg_cond = image[0, -3:, -3:, -1]
+
+ image = sd_pipe(
+ **inputs,
+ negative_original_size=(512, 512),
+ negative_crops_coords_top_left=(0, 0),
+ negative_target_size=(1024, 1024),
+ ).images
+ image_slice_with_neg_cond = image[0, -3:, -3:, -1]
+
+ self.assertTrue(np.abs(image_slice_with_no_neg_cond - image_slice_with_neg_cond).max() > 1e-2)
+
+ def test_stable_diffusion_xl_save_from_pretrained(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd_pipe.save_pretrained(tmpdirname)
+ sd_pipe = StableDiffusionXLPipeline.from_pretrained(tmpdirname).to(torch_device)
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+
+ def test_pipeline_interrupt(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "hey"
+ num_inference_steps = 3
+
+ # store intermediate latents from the generation process
+ class PipelineState:
+ def __init__(self):
+ self.state = []
+
+ def apply(self, pipe, i, t, callback_kwargs):
+ self.state.append(callback_kwargs["latents"])
+ return callback_kwargs
+
+ pipe_state = PipelineState()
+ sd_pipe(
+ prompt,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=pipe_state.apply,
+ ).images
+
+ # interrupt generation at step index
+ interrupt_step_idx = 1
+
+ def callback_on_step_end(pipe, i, t, callback_kwargs):
+ if i == interrupt_step_idx:
+ pipe._interrupt = True
+
+ return callback_kwargs
+
+ output_interrupted = sd_pipe(
+ prompt,
+ num_inference_steps=num_inference_steps,
+ output_type="latent",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=callback_on_step_end,
+ ).images
+
+ # fetch intermediate latents at the interrupted step
+ # from the completed generation process
+ intermediate_latent = pipe_state.state[interrupt_step_idx]
+
+ # compare the intermediate latent to the output of the interrupted process
+ # they should be the same
+ assert torch.allclose(intermediate_latent, output_interrupted, atol=1e-4)
+
+
+@slow
+class StableDiffusionXLPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_lcm(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel.from_pretrained(
+ "latent-consistency/lcm-ssd-1b", torch_dtype=torch.float16, variant="fp16"
+ )
+ sd_pipe = StableDiffusionXLPipeline.from_pretrained(
+ "segmind/SSD-1B", unet=unet, torch_dtype=torch.float16, variant="fp16"
+ ).to(torch_device)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ prompt = "a red car standing on the side of the street"
+
+ image = sd_pipe(prompt, num_inference_steps=4, guidance_scale=8.0).images[0]
+
+ expected_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/lcm_full/stable_diffusion_ssd_1b_lcm.png"
+ )
+
+ image = sd_pipe.image_processor.pil_to_numpy(image)
+ expected_image = sd_pipe.image_processor.pil_to_numpy(expected_image)
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+
+ assert max_diff < 1e-2
+
+ def test_download_ckpt_diff_format_is_same(self):
+ ckpt_path = (
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors"
+ )
+
+ pipe = StableDiffusionXLPipeline.from_single_file(ckpt_path, torch_dtype=torch.float16)
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_default_attn_processor()
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_ckpt = pipe("a turtle", num_inference_steps=2, generator=generator, output_type="np").images[0]
+
+ pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_default_attn_processor()
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image = pipe("a turtle", num_inference_steps=2, generator=generator, output_type="np").images[0]
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_ckpt.flatten())
+
+ assert max_diff < 6e-3
+
+ def test_single_file_component_configs(self):
+ pipe = StableDiffusionXLPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
+ )
+ ckpt_path = (
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors"
+ )
+ single_file_pipe = StableDiffusionXLPipeline.from_single_file(
+ ckpt_path, variant="fp16", torch_dtype=torch.float16
+ )
+
+ for param_name, param_value in single_file_pipe.text_encoder.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder.config.to_dict()[param_name] == param_value
+
+ for param_name, param_value in single_file_pipe.text_encoder_2.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder_2.config.to_dict()[param_name] == param_value
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ if param_name == "upcast_attention" and pipe.unet.config[param_name] is None:
+ pipe.unet.config[param_name] = False
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} is differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} is differs between single file loading and pretrained loading"
diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py
new file mode 100644
index 0000000..0bcffeb
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py
@@ -0,0 +1,711 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from parameterized import parameterized
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AutoencoderKL,
+ EulerDiscreteScheduler,
+ LCMScheduler,
+ MultiAdapter,
+ StableDiffusionXLAdapterPipeline,
+ T2IAdapter,
+ UNet2DConditionModel,
+)
+from diffusers.utils import load_image, logging
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+ assert_mean_pixel_difference,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLAdapterPipelineFastTests(
+ IPAdapterTesterMixin, PipelineTesterMixin, SDXLOptionalComponentsTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLAdapterPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+
+ def get_dummy_components(self, adapter_type="full_adapter_xl", time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ time_cond_proj_dim=time_cond_proj_dim,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ if adapter_type == "full_adapter_xl":
+ adapter = T2IAdapter(
+ in_channels=3,
+ channels=[32, 64],
+ num_res_blocks=2,
+ downscale_factor=4,
+ adapter_type=adapter_type,
+ )
+ elif adapter_type == "multi_adapter":
+ adapter = MultiAdapter(
+ [
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 64],
+ num_res_blocks=2,
+ downscale_factor=4,
+ adapter_type="full_adapter_xl",
+ ),
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 64],
+ num_res_blocks=2,
+ downscale_factor=4,
+ adapter_type="full_adapter_xl",
+ ),
+ ]
+ )
+ else:
+ raise ValueError(
+ f"Unknown adapter type: {adapter_type}, must be one of 'full_adapter_xl', or 'multi_adapter''"
+ )
+
+ components = {
+ "adapter": adapter,
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ # "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_components_with_full_downscaling(self, adapter_type="full_adapter_xl"):
+ """Get dummy components with x8 VAE downscaling and 3 UNet down blocks.
+ These dummy components are intended to fully-exercise the T2I-Adapter
+ downscaling behavior.
+ """
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=2,
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=1,
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 32, 32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ if adapter_type == "full_adapter_xl":
+ adapter = T2IAdapter(
+ in_channels=3,
+ channels=[32, 32, 64],
+ num_res_blocks=2,
+ downscale_factor=16,
+ adapter_type=adapter_type,
+ )
+ elif adapter_type == "multi_adapter":
+ adapter = MultiAdapter(
+ [
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 32, 64],
+ num_res_blocks=2,
+ downscale_factor=16,
+ adapter_type="full_adapter_xl",
+ ),
+ T2IAdapter(
+ in_channels=3,
+ channels=[32, 32, 64],
+ num_res_blocks=2,
+ downscale_factor=16,
+ adapter_type="full_adapter_xl",
+ ),
+ ]
+ )
+ else:
+ raise ValueError(
+ f"Unknown adapter type: {adapter_type}, must be one of 'full_adapter_xl', or 'multi_adapter''"
+ )
+
+ components = {
+ "adapter": adapter,
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ # "safety_checker": None,
+ "feature_extractor": None,
+ "image_encoder": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0, height=64, width=64, num_images=1):
+ if num_images == 1:
+ image = floats_tensor((1, 3, height, width), rng=random.Random(seed)).to(device)
+ else:
+ image = [
+ floats_tensor((1, 3, height, width), rng=random.Random(seed)).to(device) for _ in range(num_images)
+ ]
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 5.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_stable_diffusion_adapter_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array(
+ [0.5752919, 0.6022097, 0.4728038, 0.49861962, 0.57084894, 0.4644975, 0.5193715, 0.5133664, 0.4729858]
+ )
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
+
+ @parameterized.expand(
+ [
+ # (dim=144) The internal feature map will be 9x9 after initial pixel unshuffling (downscaled x16).
+ (((4 * 2 + 1) * 16),),
+ # (dim=160) The internal feature map will be 5x5 after the first T2I down block (downscaled x32).
+ (((4 * 1 + 1) * 32),),
+ ]
+ )
+ def test_multiple_image_dimensions(self, dim):
+ """Test that the T2I-Adapter pipeline supports any input dimension that
+ is divisible by the adapter's `downscale_factor`. This test was added in
+ response to an issue where the T2I Adapter's downscaling padding
+ behavior did not match the UNet's behavior.
+
+ Note that we have selected `dim` values to produce odd resolutions at
+ each downscaling level.
+ """
+ components = self.get_dummy_components_with_full_downscaling()
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device, height=dim, width=dim)
+ image = sd_pipe(**inputs).images
+
+ assert image.shape == (1, dim, dim, 3)
+
+ @parameterized.expand(["full_adapter", "full_adapter_xl", "light_adapter"])
+ def test_total_downscale_factor(self, adapter_type):
+ """Test that the T2IAdapter correctly reports its total_downscale_factor."""
+ batch_size = 1
+ in_channels = 3
+ out_channels = [320, 640, 1280, 1280]
+ in_image_size = 512
+
+ adapter = T2IAdapter(
+ in_channels=in_channels,
+ channels=out_channels,
+ num_res_blocks=2,
+ downscale_factor=8,
+ adapter_type=adapter_type,
+ )
+ adapter.to(torch_device)
+
+ in_image = floats_tensor((batch_size, in_channels, in_image_size, in_image_size)).to(torch_device)
+
+ adapter_state = adapter(in_image)
+
+ # Assume that the last element in `adapter_state` has been downsampled the most, and check
+ # that it matches the `total_downscale_factor`.
+ expected_out_image_size = in_image_size // adapter.total_downscale_factor
+ assert adapter_state[-1].shape == (
+ batch_size,
+ out_channels[-1],
+ expected_out_image_size,
+ expected_out_image_size,
+ )
+
+ def test_save_load_optional_components(self):
+ return self._test_save_load_optional_components()
+
+ def test_adapter_sdxl_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5425, 0.5385, 0.4964, 0.5045, 0.6149, 0.4974, 0.5469, 0.5332, 0.5426])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_adapter_sdxl_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5425, 0.5385, 0.4964, 0.5045, 0.6149, 0.4974, 0.5469, 0.5332, 0.5426])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+
+class StableDiffusionXLMultiAdapterPipelineFastTests(
+ StableDiffusionXLAdapterPipelineFastTests, PipelineTesterMixin, unittest.TestCase
+):
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ return super().get_dummy_components("multi_adapter", time_cond_proj_dim=time_cond_proj_dim)
+
+ def get_dummy_components_with_full_downscaling(self):
+ return super().get_dummy_components_with_full_downscaling("multi_adapter")
+
+ def get_dummy_inputs(self, device, seed=0, height=64, width=64):
+ inputs = super().get_dummy_inputs(device, seed, height, width, num_images=2)
+ inputs["adapter_conditioning_scale"] = [0.5, 0.5]
+ return inputs
+
+ def test_stable_diffusion_adapter_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array(
+ [0.5813032, 0.60995954, 0.47563356, 0.5056669, 0.57199144, 0.4631841, 0.5176794, 0.51252556, 0.47183886]
+ )
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
+
+ def test_inference_batch_consistent(
+ self, batch_sizes=[2, 4, 13], additional_params_copy_to_batched_inputs=["num_inference_steps"]
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ for batch_size in batch_sizes:
+ batched_inputs = {}
+ for name, value in inputs.items():
+ if name in self.batch_params:
+ # prompt is string
+ if name == "prompt":
+ len_prompt = len(value)
+ # make unequal batch sizes
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+
+ # make last batch super long
+ batched_inputs[name][-1] = 100 * "very long"
+ elif name == "image":
+ batched_images = []
+
+ for image in value:
+ batched_images.append(batch_size * [image])
+
+ batched_inputs[name] = batched_images
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ elif name == "batch_size":
+ batched_inputs[name] = batch_size
+ else:
+ batched_inputs[name] = value
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ batched_inputs["output_type"] = "np"
+
+ output = pipe(**batched_inputs)
+
+ assert len(output[0]) == batch_size
+
+ batched_inputs["output_type"] = "np"
+
+ output = pipe(**batched_inputs)[0]
+
+ assert output.shape[0] == batch_size
+
+ logger.setLevel(level=diffusers.logging.WARNING)
+
+ def test_num_images_per_prompt(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ batch_sizes = [1, 2]
+ num_images_per_prompts = [1, 2]
+
+ for batch_size in batch_sizes:
+ for num_images_per_prompt in num_images_per_prompts:
+ inputs = self.get_dummy_inputs(torch_device)
+
+ for key in inputs.keys():
+ if key in self.batch_params:
+ if key == "image":
+ batched_images = []
+
+ for image in inputs[key]:
+ batched_images.append(batch_size * [image])
+
+ inputs[key] = batched_images
+ else:
+ inputs[key] = batch_size * [inputs[key]]
+
+ images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt)[0]
+
+ assert images.shape[0] == batch_size * num_images_per_prompt
+
+ def test_inference_batch_single_identical(
+ self,
+ batch_size=3,
+ test_max_difference=None,
+ test_mean_pixel_difference=None,
+ relax_max_difference=False,
+ expected_max_diff=2e-3,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ if test_max_difference is None:
+ # TODO(Pedro) - not sure why, but not at all reproducible at the moment it seems
+ # make sure that batched and non-batched is identical
+ test_max_difference = torch_device != "mps"
+
+ if test_mean_pixel_difference is None:
+ # TODO same as above
+ test_mean_pixel_difference = torch_device != "mps"
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batch_size = batch_size
+ for name, value in inputs.items():
+ if name in self.batch_params:
+ # prompt is string
+ if name == "prompt":
+ len_prompt = len(value)
+ # make unequal batch sizes
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+
+ # make last batch super long
+ batched_inputs[name][-1] = 100 * "very long"
+ elif name == "image":
+ batched_images = []
+
+ for image in value:
+ batched_images.append(batch_size * [image])
+
+ batched_inputs[name] = batched_images
+ else:
+ batched_inputs[name] = batch_size * [value]
+ elif name == "batch_size":
+ batched_inputs[name] = batch_size
+ elif name == "generator":
+ batched_inputs[name] = [self.get_generator(i) for i in range(batch_size)]
+ else:
+ batched_inputs[name] = value
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ output_batch = pipe(**batched_inputs)
+ assert output_batch[0].shape[0] == batch_size
+
+ inputs["generator"] = self.get_generator(0)
+
+ output = pipe(**inputs)
+
+ logger.setLevel(level=diffusers.logging.WARNING)
+ if test_max_difference:
+ if relax_max_difference:
+ # Taking the median of the largest differences
+ # is resilient to outliers
+ diff = np.abs(output_batch[0][0] - output[0][0])
+ diff = diff.flatten()
+ diff.sort()
+ max_diff = np.median(diff[-5:])
+ else:
+ max_diff = np.abs(output_batch[0][0] - output[0][0]).max()
+ assert max_diff < expected_max_diff
+
+ if test_mean_pixel_difference:
+ assert_mean_pixel_difference(output_batch[0][0], output[0][0])
+
+ def test_adapter_sdxl_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5313, 0.5375, 0.4942, 0.5021, 0.6142, 0.4968, 0.5434, 0.5311, 0.5448])
+
+ debug = [str(round(i, 4)) for i in image_slice.flatten().tolist()]
+ print(",".join(debug))
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_adapter_sdxl_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLAdapterPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ output = sd_pipe(**inputs)
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+ expected_slice = np.array([0.5313, 0.5375, 0.4942, 0.5021, 0.6142, 0.4968, 0.5434, 0.5311, 0.5448])
+
+ debug = [str(round(i, 4)) for i in image_slice.flatten().tolist()]
+ print(",".join(debug))
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+
+@slow
+@require_torch_gpu
+class AdapterSDXLPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_download_ckpt_diff_format_is_same(self):
+ ckpt_path = (
+ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors"
+ )
+ adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-lineart-sdxl-1.0", torch_dtype=torch.float16)
+ prompt = "toy"
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/t2i_adapter/toy_canny.png"
+ )
+ pipe_single_file = StableDiffusionXLAdapterPipeline.from_single_file(
+ ckpt_path,
+ adapter=adapter,
+ torch_dtype=torch.float16,
+ )
+ pipe_single_file.enable_model_cpu_offload()
+ pipe_single_file.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ images_single_file = pipe_single_file(
+ prompt, image=image, generator=generator, output_type="np", num_inference_steps=3
+ ).images
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0",
+ adapter=adapter,
+ torch_dtype=torch.float16,
+ )
+ pipe.enable_model_cpu_offload()
+ images = pipe(prompt, image=image, generator=generator, output_type="np", num_inference_steps=3).images
+
+ assert images_single_file[0].shape == (768, 512, 3)
+ assert images[0].shape == (768, 512, 3)
+
+ max_diff = numpy_cosine_similarity_distance(images[0].flatten(), images_single_file[0].flatten())
+ assert max_diff < 5e-3
diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_img2img.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_img2img.py
new file mode 100644
index 0000000..9718aed
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_img2img.py
@@ -0,0 +1,850 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ AutoencoderTiny,
+ DDIMScheduler,
+ EulerDiscreteScheduler,
+ LCMScheduler,
+ StableDiffusionXLImg2ImgPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils import load_image
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+)
+from ..test_pipelines_common import (
+ IPAdapterTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLImg2ImgPipelineFastTests(
+ IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union(
+ {"add_text_embeds", "add_time_ids", "add_neg_time_ids"}
+ )
+
+ def get_dummy_components(self, skip_first_text_encoder=False, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ time_cond_proj_dim=time_cond_proj_dim,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=72, # 5 * 8 + 32
+ cross_attention_dim=64 if not skip_first_text_encoder else 32,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ image_encoder_config = CLIPVisionConfig(
+ hidden_size=32,
+ image_size=224,
+ projection_dim=32,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_channels=3,
+ num_hidden_layers=5,
+ patch_size=14,
+ )
+
+ image_encoder = CLIPVisionModelWithProjection(image_encoder_config)
+
+ feature_extractor = CLIPImageProcessor(
+ crop_size=224,
+ do_center_crop=True,
+ do_normalize=True,
+ do_resize=True,
+ image_mean=[0.48145466, 0.4578275, 0.40821073],
+ image_std=[0.26862954, 0.26130258, 0.27577711],
+ resample=3,
+ size=224,
+ )
+
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder if not skip_first_text_encoder else None,
+ "tokenizer": tokenizer if not skip_first_text_encoder else None,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "requires_aesthetics_score": True,
+ "image_encoder": image_encoder,
+ "feature_extractor": feature_extractor,
+ }
+ return components
+
+ def get_dummy_tiny_autoencoder(self):
+ return AutoencoderTiny(in_channels=3, out_channels=3, latent_channels=4)
+
+ def test_components_function(self):
+ init_components = self.get_dummy_components()
+ init_components.pop("requires_aesthetics_score")
+ pipe = self.pipeline_class(**init_components)
+
+ self.assertTrue(hasattr(pipe, "components"))
+ self.assertTrue(set(pipe.components.keys()) == set(init_components.keys()))
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image / 2 + 0.5
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 5.0,
+ "output_type": "np",
+ "strength": 0.8,
+ }
+ return inputs
+
+ def test_stable_diffusion_xl_img2img_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+
+ expected_slice = np.array([0.4664, 0.4886, 0.4403, 0.6902, 0.5592, 0.4534, 0.5931, 0.5951, 0.5224])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_img2img_euler_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+
+ expected_slice = np.array([0.5604, 0.4352, 0.4717, 0.5844, 0.5101, 0.6704, 0.6290, 0.5460, 0.5286])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_img2img_euler_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+
+ expected_slice = np.array([0.5604, 0.4352, 0.4717, 0.5844, 0.5101, 0.6704, 0.6290, 0.5460, 0.5286])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ # TODO(Patrick, Sayak) - skip for now as this requires more refiner tests
+ def test_save_load_optional_components(self):
+ pass
+
+ def test_stable_diffusion_xl_img2img_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ prompt = 3 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt, negative_prompt=negative_prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_xl_img2img_tiny_autoencoder(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.vae = self.get_dummy_tiny_autoencoder()
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.0, 0.0, 0.0106, 0.0, 0.0, 0.0087, 0.0052, 0.0062, 0.0177])
+
+ assert np.allclose(image_slice, expected_slice, atol=1e-4, rtol=1e-4)
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_multi_prompts(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+
+ # forward with single prompt
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["num_inference_steps"] = 5
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same prompt duplicated
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["num_inference_steps"] = 5
+ inputs["prompt_2"] = inputs["prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different prompt
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["num_inference_steps"] = 5
+ inputs["prompt_2"] = "different prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # manually set a negative_prompt
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["num_inference_steps"] = 5
+ inputs["negative_prompt"] = "negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same negative_prompt duplicated
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["num_inference_steps"] = 5
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = inputs["negative_prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different negative_prompt
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["num_inference_steps"] = 5
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = "different negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ def test_stable_diffusion_xl_img2img_negative_conditions(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice_with_no_neg_conditions = image[0, -3:, -3:, -1]
+
+ image = sd_pipe(
+ **inputs,
+ negative_original_size=(512, 512),
+ negative_crops_coords_top_left=(
+ 0,
+ 0,
+ ),
+ negative_target_size=(1024, 1024),
+ ).images
+ image_slice_with_neg_conditions = image[0, -3:, -3:, -1]
+
+ assert (
+ np.abs(image_slice_with_no_neg_conditions.flatten() - image_slice_with_neg_conditions.flatten()).max()
+ > 1e-4
+ )
+
+ def test_pipeline_interrupt(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = "hey"
+ num_inference_steps = 5
+
+ # store intermediate latents from the generation process
+ class PipelineState:
+ def __init__(self):
+ self.state = []
+
+ def apply(self, pipe, i, t, callback_kwargs):
+ self.state.append(callback_kwargs["latents"])
+ return callback_kwargs
+
+ pipe_state = PipelineState()
+ sd_pipe(
+ prompt,
+ image=inputs["image"],
+ strength=0.8,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=pipe_state.apply,
+ ).images
+
+ # interrupt generation at step index
+ interrupt_step_idx = 1
+
+ def callback_on_step_end(pipe, i, t, callback_kwargs):
+ if i == interrupt_step_idx:
+ pipe._interrupt = True
+
+ return callback_kwargs
+
+ output_interrupted = sd_pipe(
+ prompt,
+ image=inputs["image"],
+ strength=0.8,
+ num_inference_steps=num_inference_steps,
+ output_type="latent",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=callback_on_step_end,
+ ).images
+
+ # fetch intermediate latents at the interrupted step
+ # from the completed generation process
+ intermediate_latent = pipe_state.state[interrupt_step_idx]
+
+ # compare the intermediate latent to the output of the interrupted process
+ # they should be the same
+ assert torch.allclose(intermediate_latent, output_interrupted, atol=1e-4)
+
+
+class StableDiffusionXLImg2ImgRefinerOnlyPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineTesterMixin, SDXLOptionalComponentsTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width"}
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=72, # 5 * 8 + 32
+ cross_attention_dim=32,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "tokenizer": None,
+ "text_encoder": None,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "requires_aesthetics_score": True,
+ "image_encoder": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def test_components_function(self):
+ init_components = self.get_dummy_components()
+ init_components.pop("requires_aesthetics_score")
+ pipe = self.pipeline_class(**init_components)
+
+ self.assertTrue(hasattr(pipe, "components"))
+ self.assertTrue(set(pipe.components.keys()) == set(init_components.keys()))
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image / 2 + 0.5
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 5.0,
+ "output_type": "np",
+ "strength": 0.8,
+ }
+ return inputs
+
+ def test_stable_diffusion_xl_img2img_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+
+ expected_slice = np.array([0.4745, 0.4924, 0.4338, 0.6468, 0.5547, 0.4419, 0.5646, 0.5897, 0.5146])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_img2img_negative_conditions(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice_with_no_neg_conditions = image[0, -3:, -3:, -1]
+
+ image = sd_pipe(
+ **inputs,
+ negative_original_size=(512, 512),
+ negative_crops_coords_top_left=(
+ 0,
+ 0,
+ ),
+ negative_target_size=(1024, 1024),
+ ).images
+ image_slice_with_neg_conditions = image[0, -3:, -3:, -1]
+
+ assert (
+ np.abs(image_slice_with_no_neg_conditions.flatten() - image_slice_with_neg_conditions.flatten()).max()
+ > 1e-4
+ )
+
+ def test_stable_diffusion_xl_img2img_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ prompt = 3 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt, negative_prompt=negative_prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_stable_diffusion_xl_img2img_prompt_embeds_only(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ prompt = 3 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ _,
+ pooled_prompt_embeds,
+ _,
+ ) = sd_pipe.encode_prompt(prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
+
+
+@slow
+class StableDiffusionXLImg2ImgIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_download_ckpt_diff_format_is_same(self):
+ ckpt_path = "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors"
+ init_image = load_image(
+ "https://huggingface.co/datasets/diffusers/test-arrays/resolve/main"
+ "/stable_diffusion_img2img/sketch-mountains-input.png"
+ )
+
+ pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16
+ )
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe.unet.set_default_attn_processor()
+ pipe.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image = pipe(
+ prompt="mountains", image=init_image, num_inference_steps=5, generator=generator, output_type="np"
+ ).images[0]
+
+ pipe_single_file = StableDiffusionXLImg2ImgPipeline.from_single_file(ckpt_path, torch_dtype=torch.float16)
+ pipe_single_file.scheduler = DDIMScheduler.from_config(pipe_single_file.scheduler.config)
+ pipe_single_file.unet.set_default_attn_processor()
+ pipe_single_file.enable_model_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ image_single_file = pipe_single_file(
+ prompt="mountains", image=init_image, num_inference_steps=5, generator=generator, output_type="np"
+ ).images[0]
+
+ max_diff = numpy_cosine_similarity_distance(image.flatten(), image_single_file.flatten())
+
+ assert max_diff < 5e-2
+
+ def test_single_file_component_configs(self):
+ pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0",
+ torch_dtype=torch.float16,
+ variant="fp16",
+ )
+ ckpt_path = "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors"
+ single_file_pipe = StableDiffusionXLImg2ImgPipeline.from_single_file(ckpt_path, torch_dtype=torch.float16)
+
+ assert pipe.text_encoder is None
+ assert single_file_pipe.text_encoder is None
+
+ for param_name, param_value in single_file_pipe.text_encoder_2.config.to_dict().items():
+ if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
+ continue
+ assert pipe.text_encoder_2.config.to_dict()[param_name] == param_value
+
+ PARAMS_TO_IGNORE = ["torch_dtype", "_name_or_path", "architectures", "_use_default_values"]
+ for param_name, param_value in single_file_pipe.unet.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.unet.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
+
+ for param_name, param_value in single_file_pipe.vae.config.items():
+ if param_name in PARAMS_TO_IGNORE:
+ continue
+ assert (
+ pipe.vae.config[param_name] == param_value
+ ), f"{param_name} differs between single file loading and pretrained loading"
diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py
new file mode 100644
index 0000000..11c711e
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py
@@ -0,0 +1,810 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import random
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DPMSolverMultistepScheduler,
+ EulerDiscreteScheduler,
+ HeunDiscreteScheduler,
+ LCMScheduler,
+ StableDiffusionXLInpaintPipeline,
+ UNet2DConditionModel,
+ UniPCMultistepScheduler,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, require_torch_gpu, slow, torch_device
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS,
+ TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS,
+)
+from ..test_pipelines_common import IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLInpaintPipelineFastTests(
+ IPAdapterTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableDiffusionXLInpaintPipeline
+ params = TEXT_GUIDED_IMAGE_INPAINTING_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = frozenset([])
+ # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+ callback_cfg_params = TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS.union(
+ {
+ "add_text_embeds",
+ "add_time_ids",
+ "mask",
+ "masked_image_latents",
+ }
+ )
+
+ def get_dummy_components(self, skip_first_text_encoder=False, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ time_cond_proj_dim=time_cond_proj_dim,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=72, # 5 * 8 + 32
+ cross_attention_dim=64 if not skip_first_text_encoder else 32,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+ image_encoder_config = CLIPVisionConfig(
+ hidden_size=32,
+ image_size=224,
+ projection_dim=32,
+ intermediate_size=37,
+ num_attention_heads=4,
+ num_channels=3,
+ num_hidden_layers=5,
+ patch_size=14,
+ )
+
+ image_encoder = CLIPVisionModelWithProjection(image_encoder_config)
+
+ feature_extractor = CLIPImageProcessor(
+ crop_size=224,
+ do_center_crop=True,
+ do_normalize=True,
+ do_resize=True,
+ image_mean=[0.48145466, 0.4578275, 0.40821073],
+ image_std=[0.26862954, 0.26130258, 0.27577711],
+ resample=3,
+ size=224,
+ )
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder if not skip_first_text_encoder else None,
+ "tokenizer": tokenizer if not skip_first_text_encoder else None,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "image_encoder": image_encoder,
+ "feature_extractor": feature_extractor,
+ "requires_aesthetics_score": True,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ # TODO: use tensor inputs instead of PIL, this is here just to leave the old expected_slices untouched
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
+ # create mask
+ image[8:, 8:, :] = 255
+ mask_image = Image.fromarray(np.uint8(image)).convert("L").resize((64, 64))
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": init_image,
+ "mask_image": mask_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "strength": 1.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def get_dummy_inputs_2images(self, device, seed=0, img_res=64):
+ # Get random floats in [0, 1] as image with spatial size (img_res, img_res)
+ image1 = floats_tensor((1, 3, img_res, img_res), rng=random.Random(seed)).to(device)
+ image2 = floats_tensor((1, 3, img_res, img_res), rng=random.Random(seed + 22)).to(device)
+ # Convert images to [-1, 1]
+ init_image1 = 2.0 * image1 - 1.0
+ init_image2 = 2.0 * image2 - 1.0
+
+ # empty mask
+ mask_image = torch.zeros((1, 1, img_res, img_res), device=device)
+
+ if str(device).startswith("mps"):
+ generator1 = torch.manual_seed(seed)
+ generator2 = torch.manual_seed(seed)
+ else:
+ generator1 = torch.Generator(device=device).manual_seed(seed)
+ generator2 = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": ["A painting of a squirrel eating a burger"] * 2,
+ "image": [init_image1, init_image2],
+ "mask_image": [mask_image] * 2,
+ "generator": [generator1, generator2],
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_components_function(self):
+ init_components = self.get_dummy_components()
+ init_components.pop("requires_aesthetics_score")
+ pipe = self.pipeline_class(**init_components)
+
+ self.assertTrue(hasattr(pipe, "components"))
+ self.assertTrue(set(pipe.components.keys()) == set(init_components.keys()))
+
+ def test_stable_diffusion_xl_inpaint_euler(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.8029, 0.5523, 0.5825, 0.6003, 0.6702, 0.7018, 0.6369, 0.5955, 0.5123])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_inpaint_euler_lcm(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6611, 0.5569, 0.5531, 0.5471, 0.5918, 0.6393, 0.5074, 0.5468, 0.5185])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_xl_inpaint_euler_lcm_custom_timesteps(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(time_cond_proj_dim=256)
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.config)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ del inputs["num_inference_steps"]
+ inputs["timesteps"] = [999, 499]
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.6611, 0.5569, 0.5531, 0.5471, 0.5918, 0.6393, 0.5074, 0.5468, 0.5185])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=3e-3)
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ # TODO(Patrick, Sayak) - skip for now as this requires more refiner tests
+ def test_save_load_optional_components(self):
+ pass
+
+ def test_stable_diffusion_xl_inpaint_negative_prompt_embeds(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # forward without prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ inputs["negative_prompt"] = negative_prompt
+ inputs["prompt"] = 3 * [inputs["prompt"]]
+
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with prompt embeds
+ inputs = self.get_dummy_inputs(torch_device)
+ negative_prompt = 3 * ["this is a negative prompt"]
+ prompt = 3 * [inputs.pop("prompt")]
+
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = sd_pipe.encode_prompt(prompt, negative_prompt=negative_prompt)
+
+ output = sd_pipe(
+ **inputs,
+ prompt_embeds=prompt_embeds,
+ negative_prompt_embeds=negative_prompt_embeds,
+ pooled_prompt_embeds=pooled_prompt_embeds,
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+ )
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # make sure that it's equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ @require_torch_gpu
+ def test_stable_diffusion_xl_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ pipe.unet.set_default_attn_processor()
+
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_stable_diffusion_xl_refiner(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components(skip_first_text_encoder=True)
+
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.7045, 0.4838, 0.5454, 0.6270, 0.6168, 0.6717, 0.6484, 0.5681, 0.4922])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_two_xl_mixture_of_denoiser_fast(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+ pipe_2 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps, split, scheduler_cls_orig, num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ split_ts = num_train_timesteps - int(round(num_train_timesteps * split))
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps))
+ expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split_ts, expected_steps))
+ expected_steps = expected_steps_1 + expected_steps_2
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps))
+ expected_steps_2 = list(filter(lambda ts: ts < split_ts, expected_steps))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {**inputs, **{"denoising_end": split, "output_type": "latent"}}
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ inputs_2 = {**inputs, **{"denoising_start": split, "image": latents}}
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+ assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ for steps in [7, 20]:
+ assert_run_mixture(steps, 0.33, EulerDiscreteScheduler)
+ assert_run_mixture(steps, 0.33, HeunDiscreteScheduler)
+
+ @slow
+ def test_stable_diffusion_two_xl_mixture_of_denoiser(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+ pipe_2 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps, split, scheduler_cls_orig, num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ split_ts = num_train_timesteps - int(round(num_train_timesteps * split))
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps))
+ expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split_ts, expected_steps))
+ expected_steps = expected_steps_1 + expected_steps_2
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps))
+ expected_steps_2 = list(filter(lambda ts: ts < split_ts, expected_steps))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {**inputs, **{"denoising_end": split, "output_type": "latent"}}
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ inputs_2 = {**inputs, **{"denoising_start": split, "image": latents}}
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+ assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}"
+
+ for steps in [5, 8, 20]:
+ for split in [0.33, 0.49, 0.71]:
+ for scheduler_cls in [
+ DDIMScheduler,
+ EulerDiscreteScheduler,
+ DPMSolverMultistepScheduler,
+ UniPCMultistepScheduler,
+ HeunDiscreteScheduler,
+ ]:
+ assert_run_mixture(steps, split, scheduler_cls)
+
+ @slow
+ def test_stable_diffusion_three_xl_mixture_of_denoiser(self):
+ components = self.get_dummy_components()
+ pipe_1 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_1.unet.set_default_attn_processor()
+ pipe_2 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_2.unet.set_default_attn_processor()
+ pipe_3 = StableDiffusionXLInpaintPipeline(**components).to(torch_device)
+ pipe_3.unet.set_default_attn_processor()
+
+ def assert_run_mixture(
+ num_steps,
+ split_1,
+ split_2,
+ scheduler_cls_orig,
+ num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps,
+ ):
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = num_steps
+
+ class scheduler_cls(scheduler_cls_orig):
+ pass
+
+ pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config)
+ pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config)
+ pipe_3.scheduler = scheduler_cls.from_config(pipe_3.scheduler.config)
+
+ # Let's retrieve the number of timesteps we want to use
+ pipe_1.scheduler.set_timesteps(num_steps)
+ expected_steps = pipe_1.scheduler.timesteps.tolist()
+
+ split_1_ts = num_train_timesteps - int(round(num_train_timesteps * split_1))
+ split_2_ts = num_train_timesteps - int(round(num_train_timesteps * split_2))
+
+ if pipe_1.scheduler.order == 2:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps))
+ expected_steps_2 = expected_steps_1[-1:] + list(
+ filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps)
+ )
+ expected_steps_3 = expected_steps_2[-1:] + list(filter(lambda ts: ts < split_2_ts, expected_steps))
+ expected_steps = expected_steps_1 + expected_steps_2 + expected_steps_3
+ else:
+ expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps))
+ expected_steps_2 = list(filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps))
+ expected_steps_3 = list(filter(lambda ts: ts < split_2_ts, expected_steps))
+
+ # now we monkey patch step `done_steps`
+ # list into the step function for testing
+ done_steps = []
+ old_step = copy.copy(scheduler_cls.step)
+
+ def new_step(self, *args, **kwargs):
+ done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t`
+ return old_step(self, *args, **kwargs)
+
+ scheduler_cls.step = new_step
+
+ inputs_1 = {**inputs, **{"denoising_end": split_1, "output_type": "latent"}}
+ latents = pipe_1(**inputs_1).images[0]
+
+ assert (
+ expected_steps_1 == done_steps
+ ), f"Failure with {scheduler_cls.__name__} and {num_steps} and {split_1} and {split_2}"
+
+ inputs_2 = {
+ **inputs,
+ **{"denoising_start": split_1, "denoising_end": split_2, "image": latents, "output_type": "latent"},
+ }
+ pipe_2(**inputs_2).images[0]
+
+ assert expected_steps_2 == done_steps[len(expected_steps_1) :]
+
+ inputs_3 = {**inputs, **{"denoising_start": split_2, "image": latents}}
+ pipe_3(**inputs_3).images[0]
+
+ assert expected_steps_3 == done_steps[len(expected_steps_1) + len(expected_steps_2) :]
+ assert (
+ expected_steps == done_steps
+ ), f"Failure with {scheduler_cls.__name__} and {num_steps} and {split_1} and {split_2}"
+
+ for steps in [7, 11, 20]:
+ for split_1, split_2 in zip([0.19, 0.32], [0.81, 0.68]):
+ for scheduler_cls in [
+ DDIMScheduler,
+ EulerDiscreteScheduler,
+ DPMSolverMultistepScheduler,
+ UniPCMultistepScheduler,
+ HeunDiscreteScheduler,
+ ]:
+ assert_run_mixture(steps, split_1, split_2, scheduler_cls)
+
+ def test_stable_diffusion_xl_multi_prompts(self):
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+
+ # forward with single prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ inputs["prompt_2"] = inputs["prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ inputs["prompt_2"] = "different prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ # manually set a negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ inputs["negative_prompt"] = "negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_1 = output.images[0, -3:, -3:, -1]
+
+ # forward with same negative_prompt duplicated
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = inputs["negative_prompt"]
+ output = sd_pipe(**inputs)
+ image_slice_2 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are equal
+ assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4
+
+ # forward with different negative_prompt
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 5
+ inputs["negative_prompt"] = "negative prompt"
+ inputs["negative_prompt_2"] = "different negative prompt"
+ output = sd_pipe(**inputs)
+ image_slice_3 = output.images[0, -3:, -3:, -1]
+
+ # ensure the results are not equal
+ assert np.abs(image_slice_1.flatten() - image_slice_3.flatten()).max() > 1e-4
+
+ def test_stable_diffusion_xl_img2img_negative_conditions(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe(**inputs).images
+ image_slice_with_no_neg_conditions = image[0, -3:, -3:, -1]
+
+ image = sd_pipe(
+ **inputs,
+ negative_original_size=(512, 512),
+ negative_crops_coords_top_left=(
+ 0,
+ 0,
+ ),
+ negative_target_size=(1024, 1024),
+ ).images
+ image_slice_with_neg_conditions = image[0, -3:, -3:, -1]
+
+ assert (
+ np.abs(image_slice_with_no_neg_conditions.flatten() - image_slice_with_neg_conditions.flatten()).max()
+ > 1e-4
+ )
+
+ def test_stable_diffusion_xl_inpaint_mask_latents(self):
+ device = "cpu"
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # normal mask + normal image
+ ## `image`: pil, `mask_image``: pil, `masked_image_latents``: None
+ inputs = self.get_dummy_inputs(device)
+ inputs["strength"] = 0.9
+ out_0 = sd_pipe(**inputs).images
+
+ # image latents + mask latents
+ inputs = self.get_dummy_inputs(device)
+ image = sd_pipe.image_processor.preprocess(inputs["image"]).to(sd_pipe.device)
+ mask = sd_pipe.mask_processor.preprocess(inputs["mask_image"]).to(sd_pipe.device)
+ masked_image = image * (mask < 0.5)
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_latents = sd_pipe._encode_vae_image(image, generator=generator)
+ torch.randn((1, 4, 32, 32), generator=generator)
+ mask_latents = sd_pipe._encode_vae_image(masked_image, generator=generator)
+ inputs["image"] = image_latents
+ inputs["masked_image_latents"] = mask_latents
+ inputs["mask_image"] = mask
+ inputs["strength"] = 0.9
+ generator = torch.Generator(device=device).manual_seed(0)
+ torch.randn((1, 4, 32, 32), generator=generator)
+ inputs["generator"] = generator
+ out_1 = sd_pipe(**inputs).images
+ assert np.abs(out_0 - out_1).max() < 1e-2
+
+ def test_stable_diffusion_xl_inpaint_2_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ # test to confirm if we pass two same image, we will get same output
+ inputs = self.get_dummy_inputs(device)
+ gen1 = torch.Generator(device=device).manual_seed(0)
+ gen2 = torch.Generator(device=device).manual_seed(0)
+ for name in ["prompt", "image", "mask_image"]:
+ inputs[name] = [inputs[name]] * 2
+ inputs["generator"] = [gen1, gen2]
+ images = sd_pipe(**inputs).images
+
+ assert images.shape == (2, 64, 64, 3)
+
+ image_slice1 = images[0, -3:, -3:, -1]
+ image_slice2 = images[1, -3:, -3:, -1]
+ assert np.abs(image_slice1.flatten() - image_slice2.flatten()).max() < 1e-4
+
+ # test to confirm that if we pass two different images, we will get different output
+ inputs = self.get_dummy_inputs_2images(device)
+ images = sd_pipe(**inputs).images
+ assert images.shape == (2, 64, 64, 3)
+
+ image_slice1 = images[0, -3:, -3:, -1]
+ image_slice2 = images[1, -3:, -3:, -1]
+ assert np.abs(image_slice1.flatten() - image_slice2.flatten()).max() > 1e-2
+
+ def test_pipeline_interrupt(self):
+ components = self.get_dummy_components()
+ sd_pipe = StableDiffusionXLInpaintPipeline(**components)
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ prompt = "hey"
+ num_inference_steps = 5
+
+ # store intermediate latents from the generation process
+ class PipelineState:
+ def __init__(self):
+ self.state = []
+
+ def apply(self, pipe, i, t, callback_kwargs):
+ self.state.append(callback_kwargs["latents"])
+ return callback_kwargs
+
+ pipe_state = PipelineState()
+ sd_pipe(
+ prompt,
+ image=inputs["image"],
+ mask_image=inputs["mask_image"],
+ strength=0.8,
+ num_inference_steps=num_inference_steps,
+ output_type="np",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=pipe_state.apply,
+ ).images
+
+ # interrupt generation at step index
+ interrupt_step_idx = 1
+
+ def callback_on_step_end(pipe, i, t, callback_kwargs):
+ if i == interrupt_step_idx:
+ pipe._interrupt = True
+
+ return callback_kwargs
+
+ output_interrupted = sd_pipe(
+ prompt,
+ image=inputs["image"],
+ mask_image=inputs["mask_image"],
+ strength=0.8,
+ num_inference_steps=num_inference_steps,
+ output_type="latent",
+ generator=torch.Generator("cpu").manual_seed(0),
+ callback_on_step_end=callback_on_step_end,
+ ).images
+
+ # fetch intermediate latents at the interrupted step
+ # from the completed generation process
+ intermediate_latent = pipe_state.state[interrupt_step_idx]
+
+ # compare the intermediate latent to the output of the interrupted process
+ # they should be the same
+ assert torch.allclose(intermediate_latent, output_interrupted, atol=1e-4)
diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_instruction_pix2pix.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_instruction_pix2pix.py
new file mode 100644
index 0000000..0b4324f
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_instruction_pix2pix.py
@@ -0,0 +1,189 @@
+# coding=utf-8
+# Copyright 2024 Harutatsu Akiyama and HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ EulerDiscreteScheduler,
+ UNet2DConditionModel,
+)
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl_instruct_pix2pix import (
+ StableDiffusionXLInstructPix2PixPipeline,
+)
+from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, torch_device
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import (
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+)
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLInstructPix2PixPipelineFastTests(
+ PipelineLatentTesterMixin,
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineTesterMixin,
+ SDXLOptionalComponentsTesterMixin,
+ unittest.TestCase,
+):
+ pipeline_class = StableDiffusionXLInstructPix2PixPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width", "cross_attention_kwargs"}
+ batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=8,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 5 * 8 + 32
+ cross_attention_dim=64,
+ )
+
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ timestep_spacing="leading",
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
+ image = image / 2 + 0.5
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "image_guidance_scale": 1,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_components_function(self):
+ init_components = self.get_dummy_components()
+ pipe = self.pipeline_class(**init_components)
+
+ self.assertTrue(hasattr(pipe, "components"))
+ self.assertTrue(set(pipe.components.keys()) == set(init_components.keys()))
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+ def test_attention_slicing_forward_pass(self):
+ super().test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+ # Overwrite the default test_latents_inputs because pix2pix encode the image differently
+ def test_latents_input(self):
+ components = self.get_dummy_components()
+ pipe = StableDiffusionXLInstructPix2PixPipeline(**components)
+ pipe.image_processor = VaeImageProcessor(do_resize=False, do_normalize=False)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ out = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="pt"))[0]
+
+ vae = components["vae"]
+ inputs = self.get_dummy_inputs_by_type(torch_device, input_image_type="pt")
+
+ for image_param in self.image_latents_params:
+ if image_param in inputs.keys():
+ inputs[image_param] = vae.encode(inputs[image_param]).latent_dist.mode()
+
+ out_latents_inputs = pipe(**inputs)[0]
+
+ max_diff = np.abs(out - out_latents_inputs).max()
+ self.assertLess(max_diff, 1e-4, "passing latents as image input generate different result from passing image")
+
+ def test_cfg(self):
+ pass
+
+ def test_save_load_optional_components(self):
+ self._test_save_load_optional_components()
diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_k_diffusion.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_k_diffusion.py
new file mode 100644
index 0000000..d5d2abf
--- /dev/null
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_k_diffusion.py
@@ -0,0 +1,138 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import StableDiffusionXLKDiffusionPipeline
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device
+
+
+enable_full_determinism()
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionXLKPipelineIntegrationTests(unittest.TestCase):
+ dtype = torch.float16
+
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_diffusion_xl(self):
+ sd_pipe = StableDiffusionXLKDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=self.dtype
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_euler")
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=9.0,
+ num_inference_steps=20,
+ height=512,
+ width=512,
+ output_type="np",
+ )
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.79600024, 0.796546, 0.80682373, 0.79428387, 0.7905743, 0.8008807, 0.786183, 0.7835959, 0.797892]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_karras_sigmas(self):
+ sd_pipe = StableDiffusionXLKDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=self.dtype
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_dpmpp_2m")
+
+ prompt = "A painting of a squirrel eating a burger"
+ generator = torch.manual_seed(0)
+ output = sd_pipe(
+ [prompt],
+ generator=generator,
+ guidance_scale=7.5,
+ num_inference_steps=15,
+ output_type="np",
+ use_karras_sigmas=True,
+ height=512,
+ width=512,
+ )
+
+ image = output.images
+
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array(
+ [0.9506951, 0.9527786, 0.95309967, 0.9511477, 0.952523, 0.9515326, 0.9511933, 0.9480397, 0.94930184]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_stable_diffusion_noise_sampler_seed(self):
+ sd_pipe = StableDiffusionXLKDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=self.dtype
+ )
+ sd_pipe = sd_pipe.to(torch_device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ sd_pipe.set_scheduler("sample_dpmpp_sde")
+
+ prompt = "A painting of a squirrel eating a burger"
+ seed = 0
+ images1 = sd_pipe(
+ [prompt],
+ generator=torch.manual_seed(seed),
+ noise_sampler_seed=seed,
+ guidance_scale=9.0,
+ num_inference_steps=20,
+ output_type="np",
+ height=512,
+ width=512,
+ ).images
+ images2 = sd_pipe(
+ [prompt],
+ generator=torch.manual_seed(seed),
+ noise_sampler_seed=seed,
+ guidance_scale=9.0,
+ num_inference_steps=20,
+ output_type="np",
+ height=512,
+ width=512,
+ ).images
+ assert images1.shape == (1, 512, 512, 3)
+ assert images2.shape == (1, 512, 512, 3)
+ assert np.abs(images1.flatten() - images2.flatten()).max() < 1e-2
diff --git a/tests/pipelines/stable_unclip/__init__.py b/tests/pipelines/stable_unclip/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_unclip/test_stable_unclip.py b/tests/pipelines/stable_unclip/test_stable_unclip.py
new file mode 100644
index 0000000..f05edf6
--- /dev/null
+++ b/tests/pipelines/stable_unclip/test_stable_unclip.py
@@ -0,0 +1,239 @@
+import gc
+import unittest
+
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ DDPMScheduler,
+ PriorTransformer,
+ StableUnCLIPPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.pipelines.stable_diffusion.stable_unclip_image_normalizer import StableUnCLIPImageNormalizer
+from diffusers.utils.testing_utils import enable_full_determinism, load_numpy, nightly, require_torch_gpu, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import (
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ assert_mean_pixel_difference,
+)
+
+
+enable_full_determinism()
+
+
+class StableUnCLIPPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableUnCLIPPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+ # TODO(will) Expected attn_bias.stride(1) == 0 to be true, but got false
+ test_xformers_attention = False
+
+ def get_dummy_components(self):
+ embedder_hidden_size = 32
+ embedder_projection_dim = embedder_hidden_size
+
+ # prior components
+
+ torch.manual_seed(0)
+ prior_tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+ prior_text_encoder = CLIPTextModelWithProjection(
+ CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=embedder_hidden_size,
+ projection_dim=embedder_projection_dim,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ )
+
+ torch.manual_seed(0)
+ prior = PriorTransformer(
+ num_attention_heads=2,
+ attention_head_dim=12,
+ embedding_dim=embedder_projection_dim,
+ num_layers=1,
+ )
+
+ torch.manual_seed(0)
+ prior_scheduler = DDPMScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="sample",
+ num_train_timesteps=1000,
+ clip_sample=True,
+ clip_sample_range=5.0,
+ beta_schedule="squaredcos_cap_v2",
+ )
+
+ # regular denoising components
+
+ torch.manual_seed(0)
+ image_normalizer = StableUnCLIPImageNormalizer(embedding_dim=embedder_hidden_size)
+ image_noising_scheduler = DDPMScheduler(beta_schedule="squaredcos_cap_v2")
+
+ torch.manual_seed(0)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+ text_encoder = CLIPTextModel(
+ CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=embedder_hidden_size,
+ projection_dim=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ )
+
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("UpBlock2D", "CrossAttnUpBlock2D"),
+ block_out_channels=(32, 64),
+ attention_head_dim=(2, 4),
+ class_embed_type="projection",
+ # The class embeddings are the noise augmented image embeddings.
+ # I.e. the image embeddings concated with the noised embeddings of the same dimension
+ projection_class_embeddings_input_dim=embedder_projection_dim * 2,
+ cross_attention_dim=embedder_hidden_size,
+ layers_per_block=1,
+ upcast_attention=True,
+ use_linear_projection=True,
+ )
+
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_schedule="scaled_linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ prediction_type="v_prediction",
+ set_alpha_to_one=False,
+ steps_offset=1,
+ )
+
+ torch.manual_seed(0)
+ vae = AutoencoderKL()
+
+ components = {
+ # prior components
+ "prior_tokenizer": prior_tokenizer,
+ "prior_text_encoder": prior_text_encoder,
+ "prior": prior,
+ "prior_scheduler": prior_scheduler,
+ # image noising components
+ "image_normalizer": image_normalizer,
+ "image_noising_scheduler": image_noising_scheduler,
+ # regular denoising components
+ "tokenizer": tokenizer,
+ "text_encoder": text_encoder,
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "prior_num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ # Overriding PipelineTesterMixin::test_attention_slicing_forward_pass
+ # because UnCLIP GPU undeterminism requires a looser check.
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+
+ self._test_attention_slicing_forward_pass(test_max_difference=test_max_difference)
+
+ # Overriding PipelineTesterMixin::test_inference_batch_single_identical
+ # because UnCLIP undeterminism requires a looser check.
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-3)
+
+
+@nightly
+@require_torch_gpu
+class StableUnCLIPPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_unclip(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/stable_unclip_2_1_l_anime_turtle_fp16.npy"
+ )
+
+ pipe = StableUnCLIPPipeline.from_pretrained("fusing/stable-unclip-2-1-l", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ # stable unclip will oom when integration tests are run on a V100,
+ # so turn on memory savings
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe("anime turle", generator=generator, output_type="np")
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
+
+ def test_stable_unclip_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableUnCLIPPipeline.from_pretrained("fusing/stable-unclip-2-1-l", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ _ = pipe(
+ "anime turtle",
+ prior_num_inference_steps=2,
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 7 GB is allocated
+ assert mem_bytes < 7 * 10**9
diff --git a/tests/pipelines/stable_unclip/test_stable_unclip_img2img.py b/tests/pipelines/stable_unclip/test_stable_unclip_img2img.py
new file mode 100644
index 0000000..12f6a91
--- /dev/null
+++ b/tests/pipelines/stable_unclip/test_stable_unclip_img2img.py
@@ -0,0 +1,300 @@
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModel,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import AutoencoderKL, DDIMScheduler, DDPMScheduler, StableUnCLIPImg2ImgPipeline, UNet2DConditionModel
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+from diffusers.pipelines.stable_diffusion.stable_unclip_image_normalizer import StableUnCLIPImageNormalizer
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ skip_mps,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import (
+ PipelineKarrasSchedulerTesterMixin,
+ PipelineLatentTesterMixin,
+ PipelineTesterMixin,
+ assert_mean_pixel_difference,
+)
+
+
+enable_full_determinism()
+
+
+class StableUnCLIPImg2ImgPipelineFastTests(
+ PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+ pipeline_class = StableUnCLIPImg2ImgPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = frozenset(
+ []
+ ) # TO-DO: update image_params once pipeline is refactored with VaeImageProcessor.preprocess
+ image_latents_params = frozenset([])
+
+ def get_dummy_components(self):
+ embedder_hidden_size = 32
+ embedder_projection_dim = embedder_hidden_size
+
+ # image encoding components
+
+ feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+
+ torch.manual_seed(0)
+ image_encoder = CLIPVisionModelWithProjection(
+ CLIPVisionConfig(
+ hidden_size=embedder_hidden_size,
+ projection_dim=embedder_projection_dim,
+ num_hidden_layers=5,
+ num_attention_heads=4,
+ image_size=32,
+ intermediate_size=37,
+ patch_size=1,
+ )
+ )
+
+ # regular denoising components
+
+ torch.manual_seed(0)
+ image_normalizer = StableUnCLIPImageNormalizer(embedding_dim=embedder_hidden_size)
+ image_noising_scheduler = DDPMScheduler(beta_schedule="squaredcos_cap_v2")
+
+ torch.manual_seed(0)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ torch.manual_seed(0)
+ text_encoder = CLIPTextModel(
+ CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=embedder_hidden_size,
+ projection_dim=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ )
+
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"),
+ up_block_types=("UpBlock2D", "CrossAttnUpBlock2D"),
+ block_out_channels=(32, 64),
+ attention_head_dim=(2, 4),
+ class_embed_type="projection",
+ # The class embeddings are the noise augmented image embeddings.
+ # I.e. the image embeddings concated with the noised embeddings of the same dimension
+ projection_class_embeddings_input_dim=embedder_projection_dim * 2,
+ cross_attention_dim=embedder_hidden_size,
+ layers_per_block=1,
+ upcast_attention=True,
+ use_linear_projection=True,
+ )
+
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_schedule="scaled_linear",
+ beta_start=0.00085,
+ beta_end=0.012,
+ prediction_type="v_prediction",
+ set_alpha_to_one=False,
+ steps_offset=1,
+ )
+
+ torch.manual_seed(0)
+ vae = AutoencoderKL()
+
+ components = {
+ # image encoding components
+ "feature_extractor": feature_extractor,
+ "image_encoder": image_encoder.eval(),
+ # image noising components
+ "image_normalizer": image_normalizer.eval(),
+ "image_noising_scheduler": image_noising_scheduler,
+ # regular denoising components
+ "tokenizer": tokenizer,
+ "text_encoder": text_encoder.eval(),
+ "unet": unet.eval(),
+ "scheduler": scheduler,
+ "vae": vae.eval(),
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0, pil_image=True):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ input_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ if pil_image:
+ input_image = input_image * 0.5 + 0.5
+ input_image = input_image.clamp(0, 1)
+ input_image = input_image.cpu().permute(0, 2, 3, 1).float().numpy()
+ input_image = DiffusionPipeline.numpy_to_pil(input_image)[0]
+
+ return {
+ "prompt": "An anime racoon running a marathon",
+ "image": input_image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+
+ @skip_mps
+ def test_image_embeds_none(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = StableUnCLIPImg2ImgPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs.update({"image_embeds": None})
+ image = sd_pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 32, 32, 3)
+ expected_slice = np.array([0.3872, 0.7224, 0.5601, 0.4741, 0.6872, 0.5814, 0.4636, 0.3867, 0.5078])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ # Overriding PipelineTesterMixin::test_attention_slicing_forward_pass
+ # because GPU undeterminism requires a looser check.
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device in ["cpu", "mps"]
+
+ self._test_attention_slicing_forward_pass(test_max_difference=test_max_difference)
+
+ # Overriding PipelineTesterMixin::test_inference_batch_single_identical
+ # because undeterminism requires a looser check.
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(test_max_difference=False)
+
+
+@nightly
+@require_torch_gpu
+class StableUnCLIPImg2ImgPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_stable_unclip_l_img2img(self):
+ input_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/turtle.png"
+ )
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/stable_unclip_2_1_l_img2img_anime_turtle_fp16.npy"
+ )
+
+ pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
+ "fusing/stable-unclip-2-1-l-img2img", torch_dtype=torch.float16
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ # stable unclip will oom when integration tests are run on a V100,
+ # so turn on memory savings
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(input_image, "anime turle", generator=generator, output_type="np")
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
+
+ def test_stable_unclip_h_img2img(self):
+ input_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/turtle.png"
+ )
+
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/stable_unclip_2_1_h_img2img_anime_turtle_fp16.npy"
+ )
+
+ pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
+ "fusing/stable-unclip-2-1-h-img2img", torch_dtype=torch.float16
+ )
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ # stable unclip will oom when integration tests are run on a V100,
+ # so turn on memory savings
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipe(input_image, "anime turle", generator=generator, output_type="np")
+
+ image = output.images[0]
+
+ assert image.shape == (768, 768, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
+
+ def test_stable_unclip_img2img_pipeline_with_sequential_cpu_offloading(self):
+ input_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/turtle.png"
+ )
+
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
+ "fusing/stable-unclip-2-1-h-img2img", torch_dtype=torch.float16
+ )
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ _ = pipe(
+ input_image,
+ "anime turtle",
+ num_inference_steps=2,
+ output_type="np",
+ )
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 7 GB is allocated
+ assert mem_bytes < 7 * 10**9
diff --git a/tests/pipelines/stable_video_diffusion/__init__.py b/tests/pipelines/stable_video_diffusion/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/stable_video_diffusion/test_stable_video_diffusion.py b/tests/pipelines/stable_video_diffusion/test_stable_video_diffusion.py
new file mode 100644
index 0000000..33cf4c7
--- /dev/null
+++ b/tests/pipelines/stable_video_diffusion/test_stable_video_diffusion.py
@@ -0,0 +1,554 @@
+import gc
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPImageProcessor,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+import diffusers
+from diffusers import (
+ AutoencoderKLTemporalDecoder,
+ EulerDiscreteScheduler,
+ StableVideoDiffusionPipeline,
+ UNetSpatioTemporalConditionModel,
+)
+from diffusers.utils import is_accelerate_available, is_accelerate_version, load_image, logging
+from diffusers.utils.import_utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ CaptureLogger,
+ enable_full_determinism,
+ floats_tensor,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+def to_np(tensor):
+ if isinstance(tensor, torch.Tensor):
+ tensor = tensor.detach().cpu().numpy()
+
+ return tensor
+
+
+class StableVideoDiffusionPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = StableVideoDiffusionPipeline
+ params = frozenset(["image"])
+ batch_params = frozenset(["image", "generator"])
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "return_dict",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNetSpatioTemporalConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=8,
+ out_channels=4,
+ down_block_types=(
+ "CrossAttnDownBlockSpatioTemporal",
+ "DownBlockSpatioTemporal",
+ ),
+ up_block_types=("UpBlockSpatioTemporal", "CrossAttnUpBlockSpatioTemporal"),
+ cross_attention_dim=32,
+ num_attention_heads=8,
+ projection_class_embeddings_input_dim=96,
+ addition_time_embed_dim=32,
+ )
+ scheduler = EulerDiscreteScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ interpolation_type="linear",
+ num_train_timesteps=1000,
+ prediction_type="v_prediction",
+ sigma_max=700.0,
+ sigma_min=0.002,
+ steps_offset=1,
+ timestep_spacing="leading",
+ timestep_type="continuous",
+ trained_betas=None,
+ use_karras_sigmas=True,
+ )
+
+ torch.manual_seed(0)
+ vae = AutoencoderKLTemporalDecoder(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ latent_channels=4,
+ )
+
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=32,
+ projection_dim=32,
+ num_hidden_layers=5,
+ num_attention_heads=4,
+ image_size=32,
+ intermediate_size=37,
+ patch_size=1,
+ )
+ image_encoder = CLIPVisionModelWithProjection(config)
+
+ torch.manual_seed(0)
+ feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+ components = {
+ "unet": unet,
+ "image_encoder": image_encoder,
+ "scheduler": scheduler,
+ "vae": vae,
+ "feature_extractor": feature_extractor,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device="cpu").manual_seed(seed)
+
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(0)).to(device)
+ inputs = {
+ "generator": generator,
+ "image": image,
+ "num_inference_steps": 2,
+ "output_type": "pt",
+ "min_guidance_scale": 1.0,
+ "max_guidance_scale": 2.5,
+ "num_frames": 2,
+ "height": 32,
+ "width": 32,
+ }
+ return inputs
+
+ @unittest.skip("Deprecated functionality")
+ def test_attention_slicing_forward_pass(self):
+ pass
+
+ @unittest.skip("Batched inference works and outputs look correct, but the test is failing")
+ def test_inference_batch_single_identical(
+ self,
+ batch_size=2,
+ expected_max_diff=1e-4,
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for components in pipe.components.values():
+ if hasattr(components, "set_default_attn_processor"):
+ components.set_default_attn_processor()
+ pipe.to(torch_device)
+
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+
+ # Reset generator in case it is has been used in self.get_dummy_inputs
+ inputs["generator"] = torch.Generator("cpu").manual_seed(0)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batched_inputs.update(inputs)
+
+ batched_inputs["generator"] = [torch.Generator("cpu").manual_seed(0) for i in range(batch_size)]
+ batched_inputs["image"] = torch.cat([inputs["image"]] * batch_size, dim=0)
+
+ output = pipe(**inputs).frames
+ output_batch = pipe(**batched_inputs).frames
+
+ assert len(output_batch) == batch_size
+
+ max_diff = np.abs(to_np(output_batch[0]) - to_np(output[0])).max()
+ assert max_diff < expected_max_diff
+
+ @unittest.skip("Test is similar to test_inference_batch_single_identical")
+ def test_inference_batch_consistent(self):
+ pass
+
+ def test_np_output_type(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["output_type"] = "np"
+ output = pipe(**inputs).frames
+ self.assertTrue(isinstance(output, np.ndarray))
+ self.assertEqual(len(output.shape), 5)
+
+ def test_dict_tuple_outputs_equivalent(self, expected_max_difference=1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ output = pipe(**self.get_dummy_inputs(generator_device)).frames[0]
+ output_tuple = pipe(**self.get_dummy_inputs(generator_device), return_dict=False)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_tuple)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ @unittest.skip("Test is currently failing")
+ def test_float16_inference(self, expected_max_diff=5e-2):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ components = self.get_dummy_components()
+ pipe_fp16 = self.pipeline_class(**components)
+ for component in pipe_fp16.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe_fp16.to(torch_device, torch.float16)
+ pipe_fp16.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs).frames[0]
+
+ fp16_inputs = self.get_dummy_inputs(torch_device)
+ output_fp16 = pipe_fp16(**fp16_inputs).frames[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_fp16)).max()
+ self.assertLess(max_diff, expected_max_diff, "The outputs of the fp16 and fp32 pipelines are too different.")
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self, expected_max_diff=1e-2):
+ components = self.get_dummy_components()
+ for name, module in components.items():
+ if hasattr(module, "half"):
+ components[name] = module.to(torch_device).half()
+
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs).frames[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir, torch_dtype=torch.float16)
+ for component in pipe_loaded.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for name, component in pipe_loaded.components.items():
+ if hasattr(component, "dtype"):
+ self.assertTrue(
+ component.dtype == torch.float16,
+ f"`{name}.dtype` switched from `float16` to {component.dtype} after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs).frames[0]
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(
+ max_diff, expected_max_diff, "The output of the fp16 pipeline changed after saving and loading."
+ )
+
+ def test_save_load_optional_components(self, expected_max_difference=1e-4):
+ if not hasattr(self.pipeline_class, "_optional_components"):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ # set all optional components to None
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ output = pipe(**inputs).frames[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir, safe_serialization=False)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ for component in pipe_loaded.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(generator_device)
+ output_loaded = pipe_loaded(**inputs).frames[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ def test_save_load_local(self, expected_max_difference=9e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs).frames[0]
+
+ logger = logging.get_logger("diffusers.pipelines.pipeline_utils")
+ logger.setLevel(diffusers.logging.INFO)
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir, safe_serialization=False)
+
+ with CaptureLogger(logger) as cap_logger:
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+
+ for name in pipe_loaded.components.keys():
+ if name not in pipe_loaded._optional_components:
+ assert name in str(cap_logger)
+
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs).frames[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ @unittest.skipIf(torch_device != "cuda", reason="CUDA and CPU are required to switch devices")
+ def test_to_device(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.to("cpu")
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cpu" for device in model_devices))
+
+ output_cpu = pipe(**self.get_dummy_inputs("cpu")).frames[0]
+ self.assertTrue(np.isnan(output_cpu).sum() == 0)
+
+ pipe.to("cuda")
+ model_devices = [
+ component.device.type for component in pipe.components.values() if hasattr(component, "device")
+ ]
+ self.assertTrue(all(device == "cuda" for device in model_devices))
+
+ output_cuda = pipe(**self.get_dummy_inputs("cuda")).frames[0]
+ self.assertTrue(np.isnan(to_np(output_cuda)).sum() == 0)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes))
+
+ pipe.to(dtype=torch.float16)
+ model_dtypes = [component.dtype for component in pipe.components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes))
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_accelerate_available() or is_accelerate_version("<", "0.14.0"),
+ reason="CPU offload is only available with CUDA and `accelerate v0.14.0` or higher",
+ )
+ def test_sequential_cpu_offload_forward_pass(self, expected_max_diff=1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ output_without_offload = pipe(**inputs).frames[0]
+
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_dummy_inputs(generator_device)
+ output_with_offload = pipe(**inputs).frames[0]
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "CPU offloading should not affect the inference results")
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_accelerate_available() or is_accelerate_version("<", "0.17.0"),
+ reason="CPU offload is only available with CUDA and `accelerate v0.17.0` or higher",
+ )
+ def test_model_cpu_offload_forward_pass(self, expected_max_diff=2e-4):
+ generator_device = "cpu"
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(generator_device)
+ output_without_offload = pipe(**inputs).frames[0]
+
+ pipe.enable_model_cpu_offload()
+ inputs = self.get_dummy_inputs(generator_device)
+ output_with_offload = pipe(**inputs).frames[0]
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "CPU offloading should not affect the inference results")
+ offloaded_modules = [
+ v
+ for k, v in pipe.components.items()
+ if isinstance(v, torch.nn.Module) and k not in pipe._exclude_from_cpu_offload
+ ]
+ (
+ self.assertTrue(all(v.device.type == "cpu" for v in offloaded_modules)),
+ f"Not offloaded: {[v for v in offloaded_modules if v.device.type != 'cpu']}",
+ )
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ expected_max_diff = 9e-4
+
+ if not self.test_xformers_attention:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_without_offload = pipe(**inputs).frames[0]
+ output_without_offload = (
+ output_without_offload.cpu() if torch.is_tensor(output_without_offload) else output_without_offload
+ )
+
+ pipe.enable_xformers_memory_efficient_attention()
+ inputs = self.get_dummy_inputs(torch_device)
+ output_with_offload = pipe(**inputs).frames[0]
+ output_with_offload = (
+ output_with_offload.cpu() if torch.is_tensor(output_with_offload) else output_without_offload
+ )
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "XFormers attention should not affect the inference results")
+
+ def test_disable_cfg(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ inputs["max_guidance_scale"] = 1.0
+ output = pipe(**inputs).frames
+ self.assertEqual(len(output.shape), 5)
+
+
+@slow
+@require_torch_gpu
+class StableVideoDiffusionPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_sd_video(self):
+ pipe = StableVideoDiffusionPipeline.from_pretrained(
+ "stabilityai/stable-video-diffusion-img2vid",
+ variant="fp16",
+ torch_dtype=torch.float16,
+ )
+ pipe = pipe.to(torch_device)
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
+ )
+
+ generator = torch.Generator("cpu").manual_seed(0)
+ num_frames = 3
+
+ output = pipe(
+ image=image,
+ num_frames=num_frames,
+ generator=generator,
+ num_inference_steps=3,
+ output_type="np",
+ )
+
+ image = output.frames[0]
+ assert image.shape == (num_frames, 576, 1024, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.8592, 0.8645, 0.8499, 0.8722, 0.8769, 0.8421, 0.8557, 0.8528, 0.8285])
+ assert numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice.flatten()) < 1e-3
diff --git a/tests/pipelines/test_pipeline_utils.py b/tests/pipelines/test_pipeline_utils.py
new file mode 100644
index 0000000..51d987d
--- /dev/null
+++ b/tests/pipelines/test_pipeline_utils.py
@@ -0,0 +1,134 @@
+import unittest
+
+from diffusers.pipelines.pipeline_utils import is_safetensors_compatible
+
+
+class IsSafetensorsCompatibleTests(unittest.TestCase):
+ def test_all_is_compatible(self):
+ filenames = [
+ "safety_checker/pytorch_model.bin",
+ "safety_checker/model.safetensors",
+ "vae/diffusion_pytorch_model.bin",
+ "vae/diffusion_pytorch_model.safetensors",
+ "text_encoder/pytorch_model.bin",
+ "text_encoder/model.safetensors",
+ "unet/diffusion_pytorch_model.bin",
+ "unet/diffusion_pytorch_model.safetensors",
+ ]
+ self.assertTrue(is_safetensors_compatible(filenames))
+
+ def test_diffusers_model_is_compatible(self):
+ filenames = [
+ "unet/diffusion_pytorch_model.bin",
+ "unet/diffusion_pytorch_model.safetensors",
+ ]
+ self.assertTrue(is_safetensors_compatible(filenames))
+
+ def test_diffusers_model_is_not_compatible(self):
+ filenames = [
+ "safety_checker/pytorch_model.bin",
+ "safety_checker/model.safetensors",
+ "vae/diffusion_pytorch_model.bin",
+ "vae/diffusion_pytorch_model.safetensors",
+ "text_encoder/pytorch_model.bin",
+ "text_encoder/model.safetensors",
+ "unet/diffusion_pytorch_model.bin",
+ # Removed: 'unet/diffusion_pytorch_model.safetensors',
+ ]
+ self.assertFalse(is_safetensors_compatible(filenames))
+
+ def test_transformer_model_is_compatible(self):
+ filenames = [
+ "text_encoder/pytorch_model.bin",
+ "text_encoder/model.safetensors",
+ ]
+ self.assertTrue(is_safetensors_compatible(filenames))
+
+ def test_transformer_model_is_not_compatible(self):
+ filenames = [
+ "safety_checker/pytorch_model.bin",
+ "safety_checker/model.safetensors",
+ "vae/diffusion_pytorch_model.bin",
+ "vae/diffusion_pytorch_model.safetensors",
+ "text_encoder/pytorch_model.bin",
+ # Removed: 'text_encoder/model.safetensors',
+ "unet/diffusion_pytorch_model.bin",
+ "unet/diffusion_pytorch_model.safetensors",
+ ]
+ self.assertFalse(is_safetensors_compatible(filenames))
+
+ def test_all_is_compatible_variant(self):
+ filenames = [
+ "safety_checker/pytorch_model.fp16.bin",
+ "safety_checker/model.fp16.safetensors",
+ "vae/diffusion_pytorch_model.fp16.bin",
+ "vae/diffusion_pytorch_model.fp16.safetensors",
+ "text_encoder/pytorch_model.fp16.bin",
+ "text_encoder/model.fp16.safetensors",
+ "unet/diffusion_pytorch_model.fp16.bin",
+ "unet/diffusion_pytorch_model.fp16.safetensors",
+ ]
+ variant = "fp16"
+ self.assertTrue(is_safetensors_compatible(filenames, variant=variant))
+
+ def test_diffusers_model_is_compatible_variant(self):
+ filenames = [
+ "unet/diffusion_pytorch_model.fp16.bin",
+ "unet/diffusion_pytorch_model.fp16.safetensors",
+ ]
+ variant = "fp16"
+ self.assertTrue(is_safetensors_compatible(filenames, variant=variant))
+
+ def test_diffusers_model_is_compatible_variant_partial(self):
+ # pass variant but use the non-variant filenames
+ filenames = [
+ "unet/diffusion_pytorch_model.bin",
+ "unet/diffusion_pytorch_model.safetensors",
+ ]
+ variant = "fp16"
+ self.assertTrue(is_safetensors_compatible(filenames, variant=variant))
+
+ def test_diffusers_model_is_not_compatible_variant(self):
+ filenames = [
+ "safety_checker/pytorch_model.fp16.bin",
+ "safety_checker/model.fp16.safetensors",
+ "vae/diffusion_pytorch_model.fp16.bin",
+ "vae/diffusion_pytorch_model.fp16.safetensors",
+ "text_encoder/pytorch_model.fp16.bin",
+ "text_encoder/model.fp16.safetensors",
+ "unet/diffusion_pytorch_model.fp16.bin",
+ # Removed: 'unet/diffusion_pytorch_model.fp16.safetensors',
+ ]
+ variant = "fp16"
+ self.assertFalse(is_safetensors_compatible(filenames, variant=variant))
+
+ def test_transformer_model_is_compatible_variant(self):
+ filenames = [
+ "text_encoder/pytorch_model.fp16.bin",
+ "text_encoder/model.fp16.safetensors",
+ ]
+ variant = "fp16"
+ self.assertTrue(is_safetensors_compatible(filenames, variant=variant))
+
+ def test_transformer_model_is_compatible_variant_partial(self):
+ # pass variant but use the non-variant filenames
+ filenames = [
+ "text_encoder/pytorch_model.bin",
+ "text_encoder/model.safetensors",
+ ]
+ variant = "fp16"
+ self.assertTrue(is_safetensors_compatible(filenames, variant=variant))
+
+ def test_transformer_model_is_not_compatible_variant(self):
+ filenames = [
+ "safety_checker/pytorch_model.fp16.bin",
+ "safety_checker/model.fp16.safetensors",
+ "vae/diffusion_pytorch_model.fp16.bin",
+ "vae/diffusion_pytorch_model.fp16.safetensors",
+ "text_encoder/pytorch_model.fp16.bin",
+ # 'text_encoder/model.fp16.safetensors',
+ "unet/diffusion_pytorch_model.fp16.bin",
+ "unet/diffusion_pytorch_model.fp16.safetensors",
+ ]
+ variant = "fp16"
+ self.assertFalse(is_safetensors_compatible(filenames, variant=variant))
diff --git a/tests/pipelines/test_pipelines.py b/tests/pipelines/test_pipelines.py
new file mode 100644
index 0000000..8954456
--- /dev/null
+++ b/tests/pipelines/test_pipelines.py
@@ -0,0 +1,1932 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import json
+import os
+import random
+import shutil
+import sys
+import tempfile
+import traceback
+import unittest
+import unittest.mock as mock
+
+import numpy as np
+import PIL.Image
+import requests_mock
+import safetensors.torch
+import torch
+from parameterized import parameterized
+from PIL import Image
+from requests.exceptions import HTTPError
+from transformers import CLIPImageProcessor, CLIPModel, CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ ConfigMixin,
+ DDIMPipeline,
+ DDIMScheduler,
+ DDPMPipeline,
+ DDPMScheduler,
+ DiffusionPipeline,
+ DPMSolverMultistepScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ LMSDiscreteScheduler,
+ ModelMixin,
+ PNDMScheduler,
+ StableDiffusionImg2ImgPipeline,
+ StableDiffusionInpaintPipelineLegacy,
+ StableDiffusionPipeline,
+ UNet2DConditionModel,
+ UNet2DModel,
+ UniPCMultistepScheduler,
+ logging,
+)
+from diffusers.pipelines.pipeline_utils import _get_pipeline_class
+from diffusers.schedulers.scheduling_utils import SCHEDULER_CONFIG_NAME
+from diffusers.utils import (
+ CONFIG_NAME,
+ WEIGHTS_NAME,
+)
+from diffusers.utils.testing_utils import (
+ CaptureLogger,
+ enable_full_determinism,
+ floats_tensor,
+ get_tests_dir,
+ load_numpy,
+ nightly,
+ require_compel,
+ require_flax,
+ require_onnxruntime,
+ require_python39_or_higher,
+ require_torch_2,
+ require_torch_gpu,
+ run_test_in_subprocess,
+ slow,
+ torch_device,
+)
+from diffusers.utils.torch_utils import is_compiled_module
+
+
+enable_full_determinism()
+
+
+# Will be run via run_test_in_subprocess
+def _test_from_save_pretrained_dynamo(in_queue, out_queue, timeout):
+ error = None
+ try:
+ # 1. Load models
+ model = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ model = torch.compile(model)
+ scheduler = DDPMScheduler(num_train_timesteps=10)
+
+ ddpm = DDPMPipeline(model, scheduler)
+
+ # previous diffusers versions stripped compilation off
+ # compiled modules
+ assert is_compiled_module(ddpm.unet)
+
+ ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ ddpm.save_pretrained(tmpdirname)
+ new_ddpm = DDPMPipeline.from_pretrained(tmpdirname)
+ new_ddpm.to(torch_device)
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ image = ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ new_image = new_ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"
+ except Exception:
+ error = f"{traceback.format_exc()}"
+
+ results = {"error": error}
+ out_queue.put(results, timeout=timeout)
+ out_queue.join()
+
+
+class CustomEncoder(ModelMixin, ConfigMixin):
+ def __init__(self):
+ super().__init__()
+
+
+class CustomPipeline(DiffusionPipeline):
+ def __init__(self, encoder: CustomEncoder, scheduler: DDIMScheduler):
+ super().__init__()
+ self.register_modules(encoder=encoder, scheduler=scheduler)
+
+
+class DownloadTests(unittest.TestCase):
+ def test_one_request_upon_cached(self):
+ # TODO: For some reason this test fails on MPS where no HEAD call is made.
+ if torch_device == "mps":
+ return
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ with requests_mock.mock(real_http=True) as m:
+ DiffusionPipeline.download("hf-internal-testing/tiny-stable-diffusion-pipe", cache_dir=tmpdirname)
+
+ download_requests = [r.method for r in m.request_history]
+ assert download_requests.count("HEAD") == 15, "15 calls to files"
+ assert download_requests.count("GET") == 17, "15 calls to files + model_info + model_index.json"
+ assert (
+ len(download_requests) == 32
+ ), "2 calls per file (15 files) + send_telemetry, model_info and model_index.json"
+
+ with requests_mock.mock(real_http=True) as m:
+ DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None, cache_dir=tmpdirname
+ )
+
+ cache_requests = [r.method for r in m.request_history]
+ assert cache_requests.count("HEAD") == 1, "model_index.json is only HEAD"
+ assert cache_requests.count("GET") == 1, "model info is only GET"
+ assert (
+ len(cache_requests) == 2
+ ), "We should call only `model_info` to check for _commit hash and `send_telemetry`"
+
+ def test_less_downloads_passed_object(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ cached_folder = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None, cache_dir=tmpdirname
+ )
+
+ # make sure safety checker is not downloaded
+ assert "safety_checker" not in os.listdir(cached_folder)
+
+ # make sure rest is downloaded
+ assert "unet" in os.listdir(cached_folder)
+ assert "tokenizer" in os.listdir(cached_folder)
+ assert "vae" in os.listdir(cached_folder)
+ assert "model_index.json" in os.listdir(cached_folder)
+ assert "scheduler" in os.listdir(cached_folder)
+ assert "feature_extractor" in os.listdir(cached_folder)
+
+ def test_less_downloads_passed_object_calls(self):
+ # TODO: For some reason this test fails on MPS where no HEAD call is made.
+ if torch_device == "mps":
+ return
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ with requests_mock.mock(real_http=True) as m:
+ DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None, cache_dir=tmpdirname
+ )
+
+ download_requests = [r.method for r in m.request_history]
+ # 15 - 2 because no call to config or model file for `safety_checker`
+ assert download_requests.count("HEAD") == 13, "13 calls to files"
+ # 17 - 2 because no call to config or model file for `safety_checker`
+ assert download_requests.count("GET") == 15, "13 calls to files + model_info + model_index.json"
+ assert (
+ len(download_requests) == 28
+ ), "2 calls per file (13 files) + send_telemetry, model_info and model_index.json"
+
+ with requests_mock.mock(real_http=True) as m:
+ DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None, cache_dir=tmpdirname
+ )
+
+ cache_requests = [r.method for r in m.request_history]
+ assert cache_requests.count("HEAD") == 1, "model_index.json is only HEAD"
+ assert cache_requests.count("GET") == 1, "model info is only GET"
+ assert (
+ len(cache_requests) == 2
+ ), "We should call only `model_info` to check for _commit hash and `send_telemetry`"
+
+ def test_download_only_pytorch(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ # pipeline has Flax weights
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None, cache_dir=tmpdirname
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a flax file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe/blob/main/unet/diffusion_flax_model.msgpack
+ assert not any(f.endswith(".msgpack") for f in files)
+ # We need to never convert this tiny model to safetensors for this test to pass
+ assert not any(f.endswith(".safetensors") for f in files)
+
+ def test_force_safetensors_error(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ # pipeline has Flax weights
+ with self.assertRaises(EnvironmentError):
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe-no-safetensors",
+ safety_checker=None,
+ cache_dir=tmpdirname,
+ use_safetensors=True,
+ )
+
+ def test_download_safetensors(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ # pipeline has Flax weights
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe-safetensors",
+ safety_checker=None,
+ cache_dir=tmpdirname,
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a pytorch file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe/blob/main/unet/diffusion_flax_model.msgpack
+ assert not any(f.endswith(".bin") for f in files)
+
+ def test_download_safetensors_index(self):
+ for variant in ["fp16", None]:
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe-indexes",
+ cache_dir=tmpdirname,
+ use_safetensors=True,
+ variant=variant,
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a safetensors file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe-indexes/tree/main/text_encoder
+ if variant is None:
+ assert not any("fp16" in f for f in files)
+ else:
+ model_files = [f for f in files if "safetensors" in f]
+ assert all("fp16" in f for f in model_files)
+
+ assert len([f for f in files if ".safetensors" in f]) == 8
+ assert not any(".bin" in f for f in files)
+
+ def test_download_bin_index(self):
+ for variant in ["fp16", None]:
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-pipe-indexes",
+ cache_dir=tmpdirname,
+ use_safetensors=False,
+ variant=variant,
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a safetensors file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe-indexes/tree/main/text_encoder
+ if variant is None:
+ assert not any("fp16" in f for f in files)
+ else:
+ model_files = [f for f in files if "bin" in f]
+ assert all("fp16" in f for f in model_files)
+
+ assert len([f for f in files if ".bin" in f]) == 8
+ assert not any(".safetensors" in f for f in files)
+
+ def test_download_no_openvino_by_default(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-open-vino",
+ cache_dir=tmpdirname,
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # make sure that by default no openvino weights are downloaded
+ assert all((f.endswith(".json") or f.endswith(".bin") or f.endswith(".txt")) for f in files)
+ assert not any("openvino_" in f for f in files)
+
+ def test_download_no_onnx_by_default(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-stable-diffusion-xl-pipe",
+ cache_dir=tmpdirname,
+ use_safetensors=False,
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # make sure that by default no onnx weights are downloaded for non-ONNX pipelines
+ assert all((f.endswith(".json") or f.endswith(".bin") or f.endswith(".txt")) for f in files)
+ assert not any((f.endswith(".onnx") or f.endswith(".pb")) for f in files)
+
+ @require_onnxruntime
+ def test_download_onnx_by_default_for_onnx_pipelines(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = DiffusionPipeline.download(
+ "hf-internal-testing/tiny-random-OnnxStableDiffusionPipeline",
+ cache_dir=tmpdirname,
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # make sure that by default onnx weights are downloaded for ONNX pipelines
+ assert any((f.endswith(".json") or f.endswith(".bin") or f.endswith(".txt")) for f in files)
+ assert any((f.endswith(".onnx")) for f in files)
+ assert any((f.endswith(".pb")) for f in files)
+
+ def test_download_no_safety_checker(self):
+ prompt = "hello"
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe = pipe.to(torch_device)
+ generator = torch.manual_seed(0)
+ out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ pipe_2 = StableDiffusionPipeline.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
+ pipe_2 = pipe_2.to(torch_device)
+ generator = torch.manual_seed(0)
+ out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ assert np.max(np.abs(out - out_2)) < 1e-3
+
+ def test_load_no_safety_checker_explicit_locally(self):
+ prompt = "hello"
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe = pipe.to(torch_device)
+ generator = torch.manual_seed(0)
+ out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe_2 = StableDiffusionPipeline.from_pretrained(tmpdirname, safety_checker=None)
+ pipe_2 = pipe_2.to(torch_device)
+
+ generator = torch.manual_seed(0)
+
+ out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ assert np.max(np.abs(out - out_2)) < 1e-3
+
+ def test_load_no_safety_checker_default_locally(self):
+ prompt = "hello"
+ pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
+ pipe = pipe.to(torch_device)
+
+ generator = torch.manual_seed(0)
+ out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe_2 = StableDiffusionPipeline.from_pretrained(tmpdirname)
+ pipe_2 = pipe_2.to(torch_device)
+
+ generator = torch.manual_seed(0)
+
+ out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ assert np.max(np.abs(out - out_2)) < 1e-3
+
+ def test_cached_files_are_used_when_no_internet(self):
+ # A mock response for an HTTP head request to emulate server down
+ response_mock = mock.Mock()
+ response_mock.status_code = 500
+ response_mock.headers = {}
+ response_mock.raise_for_status.side_effect = HTTPError
+ response_mock.json.return_value = {}
+
+ # Download this model to make sure it's in the cache.
+ orig_pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ orig_comps = {k: v for k, v in orig_pipe.components.items() if hasattr(v, "parameters")}
+
+ # Under the mock environment we get a 500 error when trying to reach the model.
+ with mock.patch("requests.request", return_value=response_mock):
+ # Download this model to make sure it's in the cache.
+ pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ comps = {k: v for k, v in pipe.components.items() if hasattr(v, "parameters")}
+
+ for m1, m2 in zip(orig_comps.values(), comps.values()):
+ for p1, p2 in zip(m1.parameters(), m2.parameters()):
+ if p1.data.ne(p2.data).sum() > 0:
+ assert False, "Parameters not the same!"
+
+ def test_local_files_only_are_used_when_no_internet(self):
+ # A mock response for an HTTP head request to emulate server down
+ response_mock = mock.Mock()
+ response_mock.status_code = 500
+ response_mock.headers = {}
+ response_mock.raise_for_status.side_effect = HTTPError
+ response_mock.json.return_value = {}
+
+ # first check that with local files only the pipeline can only be used if cached
+ with self.assertRaises(FileNotFoundError):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ orig_pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", local_files_only=True, cache_dir=tmpdirname
+ )
+
+ # now download
+ orig_pipe = DiffusionPipeline.download("hf-internal-testing/tiny-stable-diffusion-torch")
+
+ # make sure it can be loaded with local_files_only
+ orig_pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", local_files_only=True
+ )
+ orig_comps = {k: v for k, v in orig_pipe.components.items() if hasattr(v, "parameters")}
+
+ # Under the mock environment we get a 500 error when trying to connect to the internet.
+ # Make sure it works local_files_only only works here!
+ with mock.patch("requests.request", return_value=response_mock):
+ # Download this model to make sure it's in the cache.
+ pipe = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
+ comps = {k: v for k, v in pipe.components.items() if hasattr(v, "parameters")}
+
+ for m1, m2 in zip(orig_comps.values(), comps.values()):
+ for p1, p2 in zip(m1.parameters(), m2.parameters()):
+ if p1.data.ne(p2.data).sum() > 0:
+ assert False, "Parameters not the same!"
+
+ def test_download_from_variant_folder(self):
+ for use_safetensors in [False, True]:
+ other_format = ".bin" if use_safetensors else ".safetensors"
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = StableDiffusionPipeline.download(
+ "hf-internal-testing/stable-diffusion-all-variants",
+ cache_dir=tmpdirname,
+ use_safetensors=use_safetensors,
+ )
+ all_root_files = [t[-1] for t in os.walk(tmpdirname)]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a variant file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/stable-diffusion-all-variants/tree/main/unet
+ assert len(files) == 15, f"We should only download 15 files, not {len(files)}"
+ assert not any(f.endswith(other_format) for f in files)
+ # no variants
+ assert not any(len(f.split(".")) == 3 for f in files)
+
+ def test_download_variant_all(self):
+ for use_safetensors in [False, True]:
+ other_format = ".bin" if use_safetensors else ".safetensors"
+ this_format = ".safetensors" if use_safetensors else ".bin"
+ variant = "fp16"
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = StableDiffusionPipeline.download(
+ "hf-internal-testing/stable-diffusion-all-variants",
+ cache_dir=tmpdirname,
+ variant=variant,
+ use_safetensors=use_safetensors,
+ )
+ all_root_files = [t[-1] for t in os.walk(tmpdirname)]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a non-variant file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/stable-diffusion-all-variants/tree/main/unet
+ assert len(files) == 15, f"We should only download 15 files, not {len(files)}"
+ # unet, vae, text_encoder, safety_checker
+ assert len([f for f in files if f.endswith(f"{variant}{this_format}")]) == 4
+ # all checkpoints should have variant ending
+ assert not any(f.endswith(this_format) and not f.endswith(f"{variant}{this_format}") for f in files)
+ assert not any(f.endswith(other_format) for f in files)
+
+ def test_download_variant_partly(self):
+ for use_safetensors in [False, True]:
+ other_format = ".bin" if use_safetensors else ".safetensors"
+ this_format = ".safetensors" if use_safetensors else ".bin"
+ variant = "no_ema"
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = StableDiffusionPipeline.download(
+ "hf-internal-testing/stable-diffusion-all-variants",
+ cache_dir=tmpdirname,
+ variant=variant,
+ use_safetensors=use_safetensors,
+ )
+ all_root_files = [t[-1] for t in os.walk(tmpdirname)]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ unet_files = os.listdir(os.path.join(tmpdirname, "unet"))
+
+ # Some of the downloaded files should be a non-variant file, check:
+ # https://huggingface.co/hf-internal-testing/stable-diffusion-all-variants/tree/main/unet
+ assert len(files) == 15, f"We should only download 15 files, not {len(files)}"
+ # only unet has "no_ema" variant
+ assert f"diffusion_pytorch_model.{variant}{this_format}" in unet_files
+ assert len([f for f in files if f.endswith(f"{variant}{this_format}")]) == 1
+ # vae, safety_checker and text_encoder should have no variant
+ assert sum(f.endswith(this_format) and not f.endswith(f"{variant}{this_format}") for f in files) == 3
+ assert not any(f.endswith(other_format) for f in files)
+
+ def test_download_broken_variant(self):
+ for use_safetensors in [False, True]:
+ # text encoder is missing no variant and "no_ema" variant weights, so the following can't work
+ for variant in [None, "no_ema"]:
+ with self.assertRaises(OSError) as error_context:
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/stable-diffusion-broken-variants",
+ cache_dir=tmpdirname,
+ variant=variant,
+ use_safetensors=use_safetensors,
+ )
+
+ assert "Error no file name" in str(error_context.exception)
+
+ # text encoder has fp16 variants so we can load it
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ tmpdirname = StableDiffusionPipeline.download(
+ "hf-internal-testing/stable-diffusion-broken-variants",
+ use_safetensors=use_safetensors,
+ cache_dir=tmpdirname,
+ variant="fp16",
+ )
+
+ all_root_files = [t[-1] for t in os.walk(tmpdirname)]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a non-variant file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/stable-diffusion-broken-variants/tree/main/unet
+ assert len(files) == 15, f"We should only download 15 files, not {len(files)}"
+ # only unet has "no_ema" variant
+
+ def test_local_save_load_index(self):
+ prompt = "hello"
+ for variant in [None, "fp16"]:
+ for use_safe in [True, False]:
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe-indexes",
+ variant=variant,
+ use_safetensors=use_safe,
+ safety_checker=None,
+ )
+ pipe = pipe.to(torch_device)
+ generator = torch.manual_seed(0)
+ out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe_2 = StableDiffusionPipeline.from_pretrained(
+ tmpdirname, safe_serialization=use_safe, variant=variant
+ )
+ pipe_2 = pipe_2.to(torch_device)
+
+ generator = torch.manual_seed(0)
+
+ out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+
+ assert np.max(np.abs(out - out_2)) < 1e-3
+
+ def test_text_inversion_download(self):
+ pipe = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe = pipe.to(torch_device)
+
+ num_tokens = len(pipe.tokenizer)
+
+ # single token load local
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ ten = {"<*>": torch.ones((32,))}
+ torch.save(ten, os.path.join(tmpdirname, "learned_embeds.bin"))
+
+ pipe.load_textual_inversion(tmpdirname)
+
+ token = pipe.tokenizer.convert_tokens_to_ids("<*>")
+ assert token == num_tokens, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 32
+ assert pipe._maybe_convert_prompt("<*>", pipe.tokenizer) == "<*>"
+
+ prompt = "hey <*>"
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # single token load local with weight name
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ ten = {"<**>": 2 * torch.ones((1, 32))}
+ torch.save(ten, os.path.join(tmpdirname, "learned_embeds.bin"))
+
+ pipe.load_textual_inversion(tmpdirname, weight_name="learned_embeds.bin")
+
+ token = pipe.tokenizer.convert_tokens_to_ids("<**>")
+ assert token == num_tokens + 1, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 64
+ assert pipe._maybe_convert_prompt("<**>", pipe.tokenizer) == "<**>"
+
+ prompt = "hey <**>"
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # multi token load
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ ten = {"<***>": torch.cat([3 * torch.ones((1, 32)), 4 * torch.ones((1, 32)), 5 * torch.ones((1, 32))])}
+ torch.save(ten, os.path.join(tmpdirname, "learned_embeds.bin"))
+
+ pipe.load_textual_inversion(tmpdirname)
+
+ token = pipe.tokenizer.convert_tokens_to_ids("<***>")
+ token_1 = pipe.tokenizer.convert_tokens_to_ids("<***>_1")
+ token_2 = pipe.tokenizer.convert_tokens_to_ids("<***>_2")
+
+ assert token == num_tokens + 2, "Added token must be at spot `num_tokens`"
+ assert token_1 == num_tokens + 3, "Added token must be at spot `num_tokens`"
+ assert token_2 == num_tokens + 4, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-3].sum().item() == 96
+ assert pipe.text_encoder.get_input_embeddings().weight[-2].sum().item() == 128
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 160
+ assert pipe._maybe_convert_prompt("<***>", pipe.tokenizer) == "<***> <***>_1 <***>_2"
+
+ prompt = "hey <***>"
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # multi token load a1111
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ ten = {
+ "string_to_param": {
+ "*": torch.cat([3 * torch.ones((1, 32)), 4 * torch.ones((1, 32)), 5 * torch.ones((1, 32))])
+ },
+ "name": "<****>",
+ }
+ torch.save(ten, os.path.join(tmpdirname, "a1111.bin"))
+
+ pipe.load_textual_inversion(tmpdirname, weight_name="a1111.bin")
+
+ token = pipe.tokenizer.convert_tokens_to_ids("<****>")
+ token_1 = pipe.tokenizer.convert_tokens_to_ids("<****>_1")
+ token_2 = pipe.tokenizer.convert_tokens_to_ids("<****>_2")
+
+ assert token == num_tokens + 5, "Added token must be at spot `num_tokens`"
+ assert token_1 == num_tokens + 6, "Added token must be at spot `num_tokens`"
+ assert token_2 == num_tokens + 7, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-3].sum().item() == 96
+ assert pipe.text_encoder.get_input_embeddings().weight[-2].sum().item() == 128
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 160
+ assert pipe._maybe_convert_prompt("<****>", pipe.tokenizer) == "<****> <****>_1 <****>_2"
+
+ prompt = "hey <****>"
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # multi embedding load
+ with tempfile.TemporaryDirectory() as tmpdirname1:
+ with tempfile.TemporaryDirectory() as tmpdirname2:
+ ten = {"<*****>": torch.ones((32,))}
+ torch.save(ten, os.path.join(tmpdirname1, "learned_embeds.bin"))
+
+ ten = {"<******>": 2 * torch.ones((1, 32))}
+ torch.save(ten, os.path.join(tmpdirname2, "learned_embeds.bin"))
+
+ pipe.load_textual_inversion([tmpdirname1, tmpdirname2])
+
+ token = pipe.tokenizer.convert_tokens_to_ids("<*****>")
+ assert token == num_tokens + 8, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-2].sum().item() == 32
+ assert pipe._maybe_convert_prompt("<*****>", pipe.tokenizer) == "<*****>"
+
+ token = pipe.tokenizer.convert_tokens_to_ids("<******>")
+ assert token == num_tokens + 9, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 64
+ assert pipe._maybe_convert_prompt("<******>", pipe.tokenizer) == "<******>"
+
+ prompt = "hey <*****> <******>"
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # single token state dict load
+ ten = {"": torch.ones((32,))}
+ pipe.load_textual_inversion(ten)
+
+ token = pipe.tokenizer.convert_tokens_to_ids("")
+ assert token == num_tokens + 10, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 32
+ assert pipe._maybe_convert_prompt("", pipe.tokenizer) == ""
+
+ prompt = "hey "
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # multi embedding state dict load
+ ten1 = {"": torch.ones((32,))}
+ ten2 = {"": 2 * torch.ones((1, 32))}
+
+ pipe.load_textual_inversion([ten1, ten2])
+
+ token = pipe.tokenizer.convert_tokens_to_ids("")
+ assert token == num_tokens + 11, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-2].sum().item() == 32
+ assert pipe._maybe_convert_prompt("", pipe.tokenizer) == ""
+
+ token = pipe.tokenizer.convert_tokens_to_ids("")
+ assert token == num_tokens + 12, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 64
+ assert pipe._maybe_convert_prompt("", pipe.tokenizer) == ""
+
+ prompt = "hey "
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # auto1111 multi-token state dict load
+ ten = {
+ "string_to_param": {
+ "*": torch.cat([3 * torch.ones((1, 32)), 4 * torch.ones((1, 32)), 5 * torch.ones((1, 32))])
+ },
+ "name": "",
+ }
+
+ pipe.load_textual_inversion(ten)
+
+ token = pipe.tokenizer.convert_tokens_to_ids("")
+ token_1 = pipe.tokenizer.convert_tokens_to_ids("_1")
+ token_2 = pipe.tokenizer.convert_tokens_to_ids("_2")
+
+ assert token == num_tokens + 13, "Added token must be at spot `num_tokens`"
+ assert token_1 == num_tokens + 14, "Added token must be at spot `num_tokens`"
+ assert token_2 == num_tokens + 15, "Added token must be at spot `num_tokens`"
+ assert pipe.text_encoder.get_input_embeddings().weight[-3].sum().item() == 96
+ assert pipe.text_encoder.get_input_embeddings().weight[-2].sum().item() == 128
+ assert pipe.text_encoder.get_input_embeddings().weight[-1].sum().item() == 160
+ assert pipe._maybe_convert_prompt("", pipe.tokenizer) == "_1 _2"
+
+ prompt = "hey "
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ # multiple references to multi embedding
+ ten = {"": torch.ones(3, 32)}
+ pipe.load_textual_inversion(ten)
+
+ assert (
+ pipe._maybe_convert_prompt("", pipe.tokenizer) == "_1 _2 _1 _2"
+ )
+
+ prompt = "hey "
+ out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+ assert out.shape == (1, 128, 128, 3)
+
+ def test_text_inversion_multi_tokens(self):
+ pipe1 = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe1 = pipe1.to(torch_device)
+
+ token1, token2 = "<*>", "<**>"
+ ten1 = torch.ones((32,))
+ ten2 = torch.ones((32,)) * 2
+
+ num_tokens = len(pipe1.tokenizer)
+
+ pipe1.load_textual_inversion(ten1, token=token1)
+ pipe1.load_textual_inversion(ten2, token=token2)
+ emb1 = pipe1.text_encoder.get_input_embeddings().weight
+
+ pipe2 = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe2 = pipe2.to(torch_device)
+ pipe2.load_textual_inversion([ten1, ten2], token=[token1, token2])
+ emb2 = pipe2.text_encoder.get_input_embeddings().weight
+
+ pipe3 = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe3 = pipe3.to(torch_device)
+ pipe3.load_textual_inversion(torch.stack([ten1, ten2], dim=0), token=[token1, token2])
+ emb3 = pipe3.text_encoder.get_input_embeddings().weight
+
+ assert len(pipe1.tokenizer) == len(pipe2.tokenizer) == len(pipe3.tokenizer) == num_tokens + 2
+ assert (
+ pipe1.tokenizer.convert_tokens_to_ids(token1)
+ == pipe2.tokenizer.convert_tokens_to_ids(token1)
+ == pipe3.tokenizer.convert_tokens_to_ids(token1)
+ == num_tokens
+ )
+ assert (
+ pipe1.tokenizer.convert_tokens_to_ids(token2)
+ == pipe2.tokenizer.convert_tokens_to_ids(token2)
+ == pipe3.tokenizer.convert_tokens_to_ids(token2)
+ == num_tokens + 1
+ )
+ assert emb1[num_tokens].sum().item() == emb2[num_tokens].sum().item() == emb3[num_tokens].sum().item()
+ assert (
+ emb1[num_tokens + 1].sum().item() == emb2[num_tokens + 1].sum().item() == emb3[num_tokens + 1].sum().item()
+ )
+
+ def test_download_ignore_files(self):
+ # Check https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe-ignore-files/blob/72f58636e5508a218c6b3f60550dc96445547817/model_index.json#L4
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ # pipeline has Flax weights
+ tmpdirname = DiffusionPipeline.download("hf-internal-testing/tiny-stable-diffusion-pipe-ignore-files")
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a pytorch file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe/blob/main/unet/diffusion_flax_model.msgpack
+ assert not any(f in ["vae/diffusion_pytorch_model.bin", "text_encoder/config.json"] for f in files)
+ assert len(files) == 14
+
+ def test_get_pipeline_class_from_flax(self):
+ flax_config = {"_class_name": "FlaxStableDiffusionPipeline"}
+ config = {"_class_name": "StableDiffusionPipeline"}
+
+ # when loading a PyTorch Pipeline from a FlaxPipeline `model_index.json`, e.g.: https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-lms-pipe/blob/7a9063578b325779f0f1967874a6771caa973cad/model_index.json#L2
+ # we need to make sure that we don't load the Flax Pipeline class, but instead the PyTorch pipeline class
+ assert _get_pipeline_class(DiffusionPipeline, flax_config) == _get_pipeline_class(DiffusionPipeline, config)
+
+
+class CustomPipelineTests(unittest.TestCase):
+ def test_load_custom_pipeline(self):
+ pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline"
+ )
+ pipeline = pipeline.to(torch_device)
+ # NOTE that `"CustomPipeline"` is not a class that is defined in this library, but solely on the Hub
+ # under https://huggingface.co/hf-internal-testing/diffusers-dummy-pipeline/blob/main/pipeline.py#L24
+ assert pipeline.__class__.__name__ == "CustomPipeline"
+
+ def test_load_custom_github(self):
+ pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="one_step_unet", custom_revision="main"
+ )
+
+ # make sure that on "main" pipeline gives only ones because of: https://github.com/huggingface/diffusers/pull/1690
+ with torch.no_grad():
+ output = pipeline()
+
+ assert output.numel() == output.sum()
+
+ # hack since Python doesn't like overwriting modules: https://stackoverflow.com/questions/3105801/unload-a-module-in-python
+ # Could in the future work with hashes instead.
+ del sys.modules["diffusers_modules.git.one_step_unet"]
+
+ pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="one_step_unet", custom_revision="0.10.2"
+ )
+ with torch.no_grad():
+ output = pipeline()
+
+ assert output.numel() != output.sum()
+
+ assert pipeline.__class__.__name__ == "UnetSchedulerOneForwardPipeline"
+
+ def test_run_custom_pipeline(self):
+ pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline"
+ )
+ pipeline = pipeline.to(torch_device)
+ images, output_str = pipeline(num_inference_steps=2, output_type="np")
+
+ assert images[0].shape == (1, 32, 32, 3)
+
+ # compare output to https://huggingface.co/hf-internal-testing/diffusers-dummy-pipeline/blob/main/pipeline.py#L102
+ assert output_str == "This is a test"
+
+ def test_remote_components(self):
+ # make sure that trust remote code has to be passed
+ with self.assertRaises(ValueError):
+ pipeline = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-sdxl-custom-components")
+
+ # Check that only loading custom componets "my_unet", "my_scheduler" works
+ pipeline = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-sdxl-custom-components", trust_remote_code=True
+ )
+
+ assert pipeline.config.unet == ("diffusers_modules.local.my_unet_model", "MyUNetModel")
+ assert pipeline.config.scheduler == ("diffusers_modules.local.my_scheduler", "MyScheduler")
+ assert pipeline.__class__.__name__ == "StableDiffusionXLPipeline"
+
+ pipeline = pipeline.to(torch_device)
+ images = pipeline("test", num_inference_steps=2, output_type="np")[0]
+
+ assert images.shape == (1, 64, 64, 3)
+
+ # Check that only loading custom componets "my_unet", "my_scheduler" and explicit custom pipeline works
+ pipeline = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-sdxl-custom-components", custom_pipeline="my_pipeline", trust_remote_code=True
+ )
+
+ assert pipeline.config.unet == ("diffusers_modules.local.my_unet_model", "MyUNetModel")
+ assert pipeline.config.scheduler == ("diffusers_modules.local.my_scheduler", "MyScheduler")
+ assert pipeline.__class__.__name__ == "MyPipeline"
+
+ pipeline = pipeline.to(torch_device)
+ images = pipeline("test", num_inference_steps=2, output_type="np")[0]
+
+ assert images.shape == (1, 64, 64, 3)
+
+ def test_remote_auto_custom_pipe(self):
+ # make sure that trust remote code has to be passed
+ with self.assertRaises(ValueError):
+ pipeline = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-sdxl-custom-all")
+
+ # Check that only loading custom componets "my_unet", "my_scheduler" and auto custom pipeline works
+ pipeline = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-sdxl-custom-all", trust_remote_code=True
+ )
+
+ assert pipeline.config.unet == ("diffusers_modules.local.my_unet_model", "MyUNetModel")
+ assert pipeline.config.scheduler == ("diffusers_modules.local.my_scheduler", "MyScheduler")
+ assert pipeline.__class__.__name__ == "MyPipeline"
+
+ pipeline = pipeline.to(torch_device)
+ images = pipeline("test", num_inference_steps=2, output_type="np")[0]
+
+ assert images.shape == (1, 64, 64, 3)
+
+ def test_local_custom_pipeline_repo(self):
+ local_custom_pipeline_path = get_tests_dir("fixtures/custom_pipeline")
+ pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline=local_custom_pipeline_path
+ )
+ pipeline = pipeline.to(torch_device)
+ images, output_str = pipeline(num_inference_steps=2, output_type="np")
+
+ assert pipeline.__class__.__name__ == "CustomLocalPipeline"
+ assert images[0].shape == (1, 32, 32, 3)
+ # compare to https://github.com/huggingface/diffusers/blob/main/tests/fixtures/custom_pipeline/pipeline.py#L102
+ assert output_str == "This is a local test"
+
+ def test_local_custom_pipeline_file(self):
+ local_custom_pipeline_path = get_tests_dir("fixtures/custom_pipeline")
+ local_custom_pipeline_path = os.path.join(local_custom_pipeline_path, "what_ever.py")
+ pipeline = DiffusionPipeline.from_pretrained(
+ "google/ddpm-cifar10-32", custom_pipeline=local_custom_pipeline_path
+ )
+ pipeline = pipeline.to(torch_device)
+ images, output_str = pipeline(num_inference_steps=2, output_type="np")
+
+ assert pipeline.__class__.__name__ == "CustomLocalPipeline"
+ assert images[0].shape == (1, 32, 32, 3)
+ # compare to https://github.com/huggingface/diffusers/blob/main/tests/fixtures/custom_pipeline/pipeline.py#L102
+ assert output_str == "This is a local test"
+
+ def test_custom_model_and_pipeline(self):
+ pipe = CustomPipeline(
+ encoder=CustomEncoder(),
+ scheduler=DDIMScheduler(),
+ )
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname, safe_serialization=False)
+
+ pipe_new = CustomPipeline.from_pretrained(tmpdirname)
+ pipe_new.save_pretrained(tmpdirname)
+
+ conf_1 = dict(pipe.config)
+ conf_2 = dict(pipe_new.config)
+
+ del conf_2["_name_or_path"]
+
+ assert conf_1 == conf_2
+
+ @slow
+ @require_torch_gpu
+ def test_download_from_git(self):
+ # Because adaptive_avg_pool2d_backward_cuda
+ # does not have a deterministic implementation.
+ clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
+
+ feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
+ clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16)
+
+ pipeline = DiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ custom_pipeline="clip_guided_stable_diffusion",
+ clip_model=clip_model,
+ feature_extractor=feature_extractor,
+ torch_dtype=torch.float16,
+ )
+ pipeline.enable_attention_slicing()
+ pipeline = pipeline.to(torch_device)
+
+ # NOTE that `"CLIPGuidedStableDiffusion"` is not a class that is defined in the pypi package of th e library, but solely on the community examples folder of GitHub under:
+ # https://github.com/huggingface/diffusers/blob/main/examples/community/clip_guided_stable_diffusion.py
+ assert pipeline.__class__.__name__ == "CLIPGuidedStableDiffusion"
+
+ image = pipeline("a prompt", num_inference_steps=2, output_type="np").images[0]
+ assert image.shape == (512, 512, 3)
+
+ def test_save_pipeline_change_config(self):
+ pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.save_pretrained(tmpdirname)
+ pipe = DiffusionPipeline.from_pretrained(tmpdirname)
+
+ assert pipe.scheduler.__class__.__name__ == "PNDMScheduler"
+
+ # let's make sure that changing the scheduler is correctly reflected
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.save_pretrained(tmpdirname)
+ pipe = DiffusionPipeline.from_pretrained(tmpdirname)
+
+ assert pipe.scheduler.__class__.__name__ == "DPMSolverMultistepScheduler"
+
+
+class PipelineFastTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def dummy_image(self):
+ batch_size = 1
+ num_channels = 3
+ sizes = (32, 32)
+
+ image = floats_tensor((batch_size, num_channels) + sizes, rng=random.Random(0)).to(torch_device)
+ return image
+
+ def dummy_uncond_unet(self, sample_size=32):
+ torch.manual_seed(0)
+ model = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=sample_size,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ return model
+
+ def dummy_cond_unet(self, sample_size=32):
+ torch.manual_seed(0)
+ model = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=sample_size,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ return model
+
+ @property
+ def dummy_vae(self):
+ torch.manual_seed(0)
+ model = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ return model
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config)
+
+ @property
+ def dummy_extractor(self):
+ def extract(*args, **kwargs):
+ class Out:
+ def __init__(self):
+ self.pixel_values = torch.ones([0])
+
+ def to(self, device):
+ self.pixel_values.to(device)
+ return self
+
+ return Out()
+
+ return extract
+
+ @parameterized.expand(
+ [
+ [DDIMScheduler, DDIMPipeline, 32],
+ [DDPMScheduler, DDPMPipeline, 32],
+ [DDIMScheduler, DDIMPipeline, (32, 64)],
+ [DDPMScheduler, DDPMPipeline, (64, 32)],
+ ]
+ )
+ def test_uncond_unet_components(self, scheduler_fn=DDPMScheduler, pipeline_fn=DDPMPipeline, sample_size=32):
+ unet = self.dummy_uncond_unet(sample_size)
+ scheduler = scheduler_fn()
+ pipeline = pipeline_fn(unet, scheduler).to(torch_device)
+
+ generator = torch.manual_seed(0)
+ out_image = pipeline(
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+ sample_size = (sample_size, sample_size) if isinstance(sample_size, int) else sample_size
+ assert out_image.shape == (1, *sample_size, 3)
+
+ def test_stable_diffusion_components(self):
+ """Test that components property works correctly"""
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ image = self.dummy_image().cpu().permute(0, 2, 3, 1)[0]
+ init_image = Image.fromarray(np.uint8(image)).convert("RGB")
+ mask_image = Image.fromarray(np.uint8(image + 4)).convert("RGB").resize((32, 32))
+
+ # make sure here that pndm scheduler skips prk
+ inpaint = StableDiffusionInpaintPipelineLegacy(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ ).to(torch_device)
+ img2img = StableDiffusionImg2ImgPipeline(**inpaint.components, image_encoder=None).to(torch_device)
+ text2img = StableDiffusionPipeline(**inpaint.components, image_encoder=None).to(torch_device)
+
+ prompt = "A painting of a squirrel eating a burger"
+
+ generator = torch.manual_seed(0)
+ image_inpaint = inpaint(
+ [prompt],
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ image=init_image,
+ mask_image=mask_image,
+ ).images
+ image_img2img = img2img(
+ [prompt],
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ image=init_image,
+ ).images
+ image_text2img = text2img(
+ [prompt],
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ ).images
+
+ assert image_inpaint.shape == (1, 32, 32, 3)
+ assert image_img2img.shape == (1, 32, 32, 3)
+ assert image_text2img.shape == (1, 64, 64, 3)
+
+ @require_torch_gpu
+ def test_pipe_false_offload_warn(self):
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ sd.enable_model_cpu_offload()
+
+ logger = logging.get_logger("diffusers.pipelines.pipeline_utils")
+ with CaptureLogger(logger) as cap_logger:
+ sd.to("cuda")
+
+ assert "It is strongly recommended against doing so" in str(cap_logger)
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ def test_set_scheduler(self):
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ sd.scheduler = DDIMScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, DDIMScheduler)
+ sd.scheduler = DDPMScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, DDPMScheduler)
+ sd.scheduler = PNDMScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, PNDMScheduler)
+ sd.scheduler = LMSDiscreteScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, LMSDiscreteScheduler)
+ sd.scheduler = EulerDiscreteScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, EulerDiscreteScheduler)
+ sd.scheduler = EulerAncestralDiscreteScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, EulerAncestralDiscreteScheduler)
+ sd.scheduler = DPMSolverMultistepScheduler.from_config(sd.scheduler.config)
+ assert isinstance(sd.scheduler, DPMSolverMultistepScheduler)
+
+ def test_set_component_to_none(self):
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ pipeline = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ prompt = "This is a flower"
+
+ out_image = pipeline(
+ prompt=prompt,
+ generator=generator,
+ num_inference_steps=1,
+ output_type="np",
+ ).images
+
+ pipeline.feature_extractor = None
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ out_image_2 = pipeline(
+ prompt=prompt,
+ generator=generator,
+ num_inference_steps=1,
+ output_type="np",
+ ).images
+
+ assert out_image.shape == (1, 64, 64, 3)
+ assert np.abs(out_image - out_image_2).max() < 1e-3
+
+ def test_optional_components_is_none(self):
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ items = {
+ "feature_extractor": self.dummy_extractor,
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": bert,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ # we don't add an image encoder
+ }
+
+ pipeline = StableDiffusionPipeline(**items)
+
+ assert sorted(pipeline.components.keys()) == sorted(["image_encoder"] + list(items.keys()))
+ assert pipeline.image_encoder is None
+
+ def test_set_scheduler_consistency(self):
+ unet = self.dummy_cond_unet()
+ pndm = PNDMScheduler.from_config("hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler")
+ ddim = DDIMScheduler.from_config("hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler")
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=pndm,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ pndm_config = sd.scheduler.config
+ sd.scheduler = DDPMScheduler.from_config(pndm_config)
+ sd.scheduler = PNDMScheduler.from_config(sd.scheduler.config)
+ pndm_config_2 = sd.scheduler.config
+ pndm_config_2 = {k: v for k, v in pndm_config_2.items() if k in pndm_config}
+
+ assert dict(pndm_config) == dict(pndm_config_2)
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=ddim,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ ddim_config = sd.scheduler.config
+ sd.scheduler = LMSDiscreteScheduler.from_config(ddim_config)
+ sd.scheduler = DDIMScheduler.from_config(sd.scheduler.config)
+ ddim_config_2 = sd.scheduler.config
+ ddim_config_2 = {k: v for k, v in ddim_config_2.items() if k in ddim_config}
+
+ assert dict(ddim_config) == dict(ddim_config_2)
+
+ def test_save_safe_serialization(self):
+ pipeline = StableDiffusionPipeline.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipeline.save_pretrained(tmpdirname, safe_serialization=True)
+
+ # Validate that the VAE safetensor exists and are of the correct format
+ vae_path = os.path.join(tmpdirname, "vae", "diffusion_pytorch_model.safetensors")
+ assert os.path.exists(vae_path), f"Could not find {vae_path}"
+ _ = safetensors.torch.load_file(vae_path)
+
+ # Validate that the UNet safetensor exists and are of the correct format
+ unet_path = os.path.join(tmpdirname, "unet", "diffusion_pytorch_model.safetensors")
+ assert os.path.exists(unet_path), f"Could not find {unet_path}"
+ _ = safetensors.torch.load_file(unet_path)
+
+ # Validate that the text encoder safetensor exists and are of the correct format
+ text_encoder_path = os.path.join(tmpdirname, "text_encoder", "model.safetensors")
+ assert os.path.exists(text_encoder_path), f"Could not find {text_encoder_path}"
+ _ = safetensors.torch.load_file(text_encoder_path)
+
+ pipeline = StableDiffusionPipeline.from_pretrained(tmpdirname)
+ assert pipeline.unet is not None
+ assert pipeline.vae is not None
+ assert pipeline.text_encoder is not None
+ assert pipeline.scheduler is not None
+ assert pipeline.feature_extractor is not None
+
+ def test_no_pytorch_download_when_doing_safetensors(self):
+ # by default we don't download
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ _ = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/diffusers-stable-diffusion-tiny-all", cache_dir=tmpdirname
+ )
+
+ path = os.path.join(
+ tmpdirname,
+ "models--hf-internal-testing--diffusers-stable-diffusion-tiny-all",
+ "snapshots",
+ "07838d72e12f9bcec1375b0482b80c1d399be843",
+ "unet",
+ )
+ # safetensors exists
+ assert os.path.exists(os.path.join(path, "diffusion_pytorch_model.safetensors"))
+ # pytorch does not
+ assert not os.path.exists(os.path.join(path, "diffusion_pytorch_model.bin"))
+
+ def test_no_safetensors_download_when_doing_pytorch(self):
+ use_safetensors = False
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ _ = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/diffusers-stable-diffusion-tiny-all",
+ cache_dir=tmpdirname,
+ use_safetensors=use_safetensors,
+ )
+
+ path = os.path.join(
+ tmpdirname,
+ "models--hf-internal-testing--diffusers-stable-diffusion-tiny-all",
+ "snapshots",
+ "07838d72e12f9bcec1375b0482b80c1d399be843",
+ "unet",
+ )
+ # safetensors does not exists
+ assert not os.path.exists(os.path.join(path, "diffusion_pytorch_model.safetensors"))
+ # pytorch does
+ assert os.path.exists(os.path.join(path, "diffusion_pytorch_model.bin"))
+
+ def test_optional_components(self):
+ unet = self.dummy_cond_unet()
+ pndm = PNDMScheduler.from_config("hf-internal-testing/tiny-stable-diffusion-torch", subfolder="scheduler")
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ orig_sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=pndm,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=unet,
+ feature_extractor=self.dummy_extractor,
+ )
+ sd = orig_sd
+
+ assert sd.config.requires_safety_checker is True
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd.save_pretrained(tmpdirname)
+
+ # Test that passing None works
+ sd = StableDiffusionPipeline.from_pretrained(
+ tmpdirname, feature_extractor=None, safety_checker=None, requires_safety_checker=False
+ )
+
+ assert sd.config.requires_safety_checker is False
+ assert sd.config.safety_checker == (None, None)
+ assert sd.config.feature_extractor == (None, None)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd.save_pretrained(tmpdirname)
+
+ # Test that loading previous None works
+ sd = StableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ assert sd.config.requires_safety_checker is False
+ assert sd.config.safety_checker == (None, None)
+ assert sd.config.feature_extractor == (None, None)
+
+ orig_sd.save_pretrained(tmpdirname)
+
+ # Test that loading without any directory works
+ shutil.rmtree(os.path.join(tmpdirname, "safety_checker"))
+ with open(os.path.join(tmpdirname, sd.config_name)) as f:
+ config = json.load(f)
+ config["safety_checker"] = [None, None]
+ with open(os.path.join(tmpdirname, sd.config_name), "w") as f:
+ json.dump(config, f)
+
+ sd = StableDiffusionPipeline.from_pretrained(tmpdirname, requires_safety_checker=False)
+ sd.save_pretrained(tmpdirname)
+ sd = StableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ assert sd.config.requires_safety_checker is False
+ assert sd.config.safety_checker == (None, None)
+ assert sd.config.feature_extractor == (None, None)
+
+ # Test that loading from deleted model index works
+ with open(os.path.join(tmpdirname, sd.config_name)) as f:
+ config = json.load(f)
+ del config["safety_checker"]
+ del config["feature_extractor"]
+ with open(os.path.join(tmpdirname, sd.config_name), "w") as f:
+ json.dump(config, f)
+
+ sd = StableDiffusionPipeline.from_pretrained(tmpdirname)
+
+ assert sd.config.requires_safety_checker is False
+ assert sd.config.safety_checker == (None, None)
+ assert sd.config.feature_extractor == (None, None)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd.save_pretrained(tmpdirname)
+
+ # Test that partially loading works
+ sd = StableDiffusionPipeline.from_pretrained(tmpdirname, feature_extractor=self.dummy_extractor)
+
+ assert sd.config.requires_safety_checker is False
+ assert sd.config.safety_checker == (None, None)
+ assert sd.config.feature_extractor != (None, None)
+
+ # Test that partially loading works
+ sd = StableDiffusionPipeline.from_pretrained(
+ tmpdirname,
+ feature_extractor=self.dummy_extractor,
+ safety_checker=unet,
+ requires_safety_checker=[True, True],
+ )
+
+ assert sd.config.requires_safety_checker == [True, True]
+ assert sd.config.safety_checker != (None, None)
+ assert sd.config.feature_extractor != (None, None)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd.save_pretrained(tmpdirname)
+ sd = StableDiffusionPipeline.from_pretrained(tmpdirname, feature_extractor=self.dummy_extractor)
+
+ assert sd.config.requires_safety_checker == [True, True]
+ assert sd.config.safety_checker != (None, None)
+ assert sd.config.feature_extractor != (None, None)
+
+ def test_name_or_path(self):
+ model_path = "hf-internal-testing/tiny-stable-diffusion-torch"
+ sd = DiffusionPipeline.from_pretrained(model_path)
+
+ assert sd.name_or_path == model_path
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ sd.save_pretrained(tmpdirname)
+ sd = DiffusionPipeline.from_pretrained(tmpdirname)
+
+ assert sd.name_or_path == tmpdirname
+
+ def test_error_no_variant_available(self):
+ variant = "fp16"
+ with self.assertRaises(ValueError) as error_context:
+ _ = StableDiffusionPipeline.download(
+ "hf-internal-testing/diffusers-stable-diffusion-tiny-all", variant=variant
+ )
+
+ assert "but no such modeling files are available" in str(error_context.exception)
+ assert variant in str(error_context.exception)
+
+ def test_pipe_to(self):
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ device_type = torch.device(torch_device).type
+
+ sd1 = sd.to(device_type)
+ sd2 = sd.to(torch.device(device_type))
+ sd3 = sd.to(device_type, torch.float32)
+ sd4 = sd.to(device=device_type)
+ sd5 = sd.to(torch_device=device_type)
+ sd6 = sd.to(device_type, dtype=torch.float32)
+ sd7 = sd.to(device_type, torch_dtype=torch.float32)
+
+ assert sd1.device.type == device_type
+ assert sd2.device.type == device_type
+ assert sd3.device.type == device_type
+ assert sd4.device.type == device_type
+ assert sd5.device.type == device_type
+ assert sd6.device.type == device_type
+ assert sd7.device.type == device_type
+
+ sd1 = sd.to(torch.float16)
+ sd2 = sd.to(None, torch.float16)
+ sd3 = sd.to(dtype=torch.float16)
+ sd4 = sd.to(dtype=torch.float16)
+ sd5 = sd.to(None, dtype=torch.float16)
+ sd6 = sd.to(None, torch_dtype=torch.float16)
+
+ assert sd1.dtype == torch.float16
+ assert sd2.dtype == torch.float16
+ assert sd3.dtype == torch.float16
+ assert sd4.dtype == torch.float16
+ assert sd5.dtype == torch.float16
+ assert sd6.dtype == torch.float16
+
+ sd1 = sd.to(device=device_type, dtype=torch.float16)
+ sd2 = sd.to(torch_device=device_type, torch_dtype=torch.float16)
+ sd3 = sd.to(device_type, torch.float16)
+
+ assert sd1.dtype == torch.float16
+ assert sd2.dtype == torch.float16
+ assert sd3.dtype == torch.float16
+
+ assert sd1.device.type == device_type
+ assert sd2.device.type == device_type
+ assert sd3.device.type == device_type
+
+ def test_pipe_same_device_id_offload(self):
+ unet = self.dummy_cond_unet()
+ scheduler = PNDMScheduler(skip_prk_steps=True)
+ vae = self.dummy_vae
+ bert = self.dummy_text_encoder
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ sd = StableDiffusionPipeline(
+ unet=unet,
+ scheduler=scheduler,
+ vae=vae,
+ text_encoder=bert,
+ tokenizer=tokenizer,
+ safety_checker=None,
+ feature_extractor=self.dummy_extractor,
+ )
+
+ sd.enable_model_cpu_offload(gpu_id=5)
+ assert sd._offload_gpu_id == 5
+ sd.maybe_free_model_hooks()
+ assert sd._offload_gpu_id == 5
+
+
+@slow
+@require_torch_gpu
+class PipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_smart_download(self):
+ model_id = "hf-internal-testing/unet-pipeline-dummy"
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ _ = DiffusionPipeline.from_pretrained(model_id, cache_dir=tmpdirname, force_download=True)
+ local_repo_name = "--".join(["models"] + model_id.split("/"))
+ snapshot_dir = os.path.join(tmpdirname, local_repo_name, "snapshots")
+ snapshot_dir = os.path.join(snapshot_dir, os.listdir(snapshot_dir)[0])
+
+ # inspect all downloaded files to make sure that everything is included
+ assert os.path.isfile(os.path.join(snapshot_dir, DiffusionPipeline.config_name))
+ assert os.path.isfile(os.path.join(snapshot_dir, CONFIG_NAME))
+ assert os.path.isfile(os.path.join(snapshot_dir, SCHEDULER_CONFIG_NAME))
+ assert os.path.isfile(os.path.join(snapshot_dir, WEIGHTS_NAME))
+ assert os.path.isfile(os.path.join(snapshot_dir, "scheduler", SCHEDULER_CONFIG_NAME))
+ assert os.path.isfile(os.path.join(snapshot_dir, "unet", WEIGHTS_NAME))
+ assert os.path.isfile(os.path.join(snapshot_dir, "unet", WEIGHTS_NAME))
+ # let's make sure the super large numpy file:
+ # https://huggingface.co/hf-internal-testing/unet-pipeline-dummy/blob/main/big_array.npy
+ # is not downloaded, but all the expected ones
+ assert not os.path.isfile(os.path.join(snapshot_dir, "big_array.npy"))
+
+ def test_warning_unused_kwargs(self):
+ model_id = "hf-internal-testing/unet-pipeline-dummy"
+ logger = logging.get_logger("diffusers.pipelines")
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ with CaptureLogger(logger) as cap_logger:
+ DiffusionPipeline.from_pretrained(
+ model_id,
+ not_used=True,
+ cache_dir=tmpdirname,
+ force_download=True,
+ )
+
+ assert (
+ cap_logger.out.strip().split("\n")[-1]
+ == "Keyword arguments {'not_used': True} are not expected by DDPMPipeline and will be ignored."
+ )
+
+ def test_from_save_pretrained(self):
+ # 1. Load models
+ model = UNet2DModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=3,
+ out_channels=3,
+ down_block_types=("DownBlock2D", "AttnDownBlock2D"),
+ up_block_types=("AttnUpBlock2D", "UpBlock2D"),
+ )
+ scheduler = DDPMScheduler(num_train_timesteps=10)
+
+ ddpm = DDPMPipeline(model, scheduler)
+ ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ ddpm.save_pretrained(tmpdirname)
+ new_ddpm = DDPMPipeline.from_pretrained(tmpdirname)
+ new_ddpm.to(torch_device)
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ image = ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ new_image = new_ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"
+
+ @require_python39_or_higher
+ @require_torch_2
+ def test_from_save_pretrained_dynamo(self):
+ run_test_in_subprocess(test_case=self, target_func=_test_from_save_pretrained_dynamo, inputs=None)
+
+ def test_from_pretrained_hub(self):
+ model_path = "google/ddpm-cifar10-32"
+
+ scheduler = DDPMScheduler(num_train_timesteps=10)
+
+ ddpm = DDPMPipeline.from_pretrained(model_path, scheduler=scheduler)
+ ddpm = ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ ddpm_from_hub = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler)
+ ddpm_from_hub = ddpm_from_hub.to(torch_device)
+ ddpm_from_hub.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ image = ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ new_image = ddpm_from_hub(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"
+
+ def test_from_pretrained_hub_pass_model(self):
+ model_path = "google/ddpm-cifar10-32"
+
+ scheduler = DDPMScheduler(num_train_timesteps=10)
+
+ # pass unet into DiffusionPipeline
+ unet = UNet2DModel.from_pretrained(model_path)
+ ddpm_from_hub_custom_model = DiffusionPipeline.from_pretrained(model_path, unet=unet, scheduler=scheduler)
+ ddpm_from_hub_custom_model = ddpm_from_hub_custom_model.to(torch_device)
+ ddpm_from_hub_custom_model.set_progress_bar_config(disable=None)
+
+ ddpm_from_hub = DiffusionPipeline.from_pretrained(model_path, scheduler=scheduler)
+ ddpm_from_hub = ddpm_from_hub.to(torch_device)
+ ddpm_from_hub_custom_model.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ image = ddpm_from_hub_custom_model(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ generator = torch.Generator(device=torch_device).manual_seed(0)
+ new_image = ddpm_from_hub(generator=generator, num_inference_steps=5, output_type="numpy").images
+
+ assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"
+
+ def test_output_format(self):
+ model_path = "google/ddpm-cifar10-32"
+
+ scheduler = DDIMScheduler.from_pretrained(model_path)
+ pipe = DDIMPipeline.from_pretrained(model_path, scheduler=scheduler)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ images = pipe(output_type="numpy").images
+ assert images.shape == (1, 32, 32, 3)
+ assert isinstance(images, np.ndarray)
+
+ images = pipe(output_type="pil", num_inference_steps=4).images
+ assert isinstance(images, list)
+ assert len(images) == 1
+ assert isinstance(images[0], PIL.Image.Image)
+
+ # use PIL by default
+ images = pipe(num_inference_steps=4).images
+ assert isinstance(images, list)
+ assert isinstance(images[0], PIL.Image.Image)
+
+ @require_flax
+ def test_from_flax_from_pt(self):
+ pipe_pt = StableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-torch", safety_checker=None
+ )
+ pipe_pt.to(torch_device)
+
+ from diffusers import FlaxStableDiffusionPipeline
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe_pt.save_pretrained(tmpdirname)
+
+ pipe_flax, params = FlaxStableDiffusionPipeline.from_pretrained(
+ tmpdirname, safety_checker=None, from_pt=True
+ )
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ pipe_flax.save_pretrained(tmpdirname, params=params)
+ pipe_pt_2 = StableDiffusionPipeline.from_pretrained(tmpdirname, safety_checker=None, from_flax=True)
+ pipe_pt_2.to(torch_device)
+
+ prompt = "Hello"
+
+ generator = torch.manual_seed(0)
+ image_0 = pipe_pt(
+ [prompt],
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ ).images[0]
+
+ generator = torch.manual_seed(0)
+ image_1 = pipe_pt_2(
+ [prompt],
+ generator=generator,
+ num_inference_steps=2,
+ output_type="np",
+ ).images[0]
+
+ assert np.abs(image_0 - image_1).sum() < 1e-5, "Models don't give the same forward pass"
+
+ @require_compel
+ def test_weighted_prompts_compel(self):
+ from compel import Compel
+
+ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+ pipe.enable_model_cpu_offload()
+ pipe.enable_attention_slicing()
+
+ compel = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
+
+ prompt = "a red cat playing with a ball{}"
+
+ prompts = [prompt.format(s) for s in ["", "++", "--"]]
+
+ prompt_embeds = compel(prompts)
+
+ generator = [torch.Generator(device="cpu").manual_seed(33) for _ in range(prompt_embeds.shape[0])]
+
+ images = pipe(
+ prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20, output_type="numpy"
+ ).images
+
+ for i, image in enumerate(images):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ f"/compel/forest_{i}.npy"
+ )
+
+ assert np.abs(image - expected_image).max() < 3e-1
+
+
+@nightly
+@require_torch_gpu
+class PipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_ddpm_ddim_equality_batched(self):
+ seed = 0
+ model_id = "google/ddpm-cifar10-32"
+
+ unet = UNet2DModel.from_pretrained(model_id)
+ ddpm_scheduler = DDPMScheduler()
+ ddim_scheduler = DDIMScheduler()
+
+ ddpm = DDPMPipeline(unet=unet, scheduler=ddpm_scheduler)
+ ddpm.to(torch_device)
+ ddpm.set_progress_bar_config(disable=None)
+
+ ddim = DDIMPipeline(unet=unet, scheduler=ddim_scheduler)
+ ddim.to(torch_device)
+ ddim.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=torch_device).manual_seed(seed)
+ ddpm_images = ddpm(batch_size=2, generator=generator, output_type="numpy").images
+
+ generator = torch.Generator(device=torch_device).manual_seed(seed)
+ ddim_images = ddim(
+ batch_size=2,
+ generator=generator,
+ num_inference_steps=1000,
+ eta=1.0,
+ output_type="numpy",
+ use_clipped_model_output=True, # Need this to make DDIM match DDPM
+ ).images
+
+ # the values aren't exactly equal, but the images look the same visually
+ assert np.abs(ddpm_images - ddim_images).max() < 1e-1
diff --git a/tests/pipelines/test_pipelines_auto.py b/tests/pipelines/test_pipelines_auto.py
new file mode 100644
index 0000000..284cdb0
--- /dev/null
+++ b/tests/pipelines/test_pipelines_auto.py
@@ -0,0 +1,353 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import os
+import shutil
+import unittest
+from collections import OrderedDict
+from pathlib import Path
+
+import torch
+from transformers import CLIPVisionConfig, CLIPVisionModelWithProjection
+
+from diffusers import (
+ AutoPipelineForImage2Image,
+ AutoPipelineForInpainting,
+ AutoPipelineForText2Image,
+ ControlNetModel,
+ DiffusionPipeline,
+)
+from diffusers.pipelines.auto_pipeline import (
+ AUTO_IMAGE2IMAGE_PIPELINES_MAPPING,
+ AUTO_INPAINT_PIPELINES_MAPPING,
+ AUTO_TEXT2IMAGE_PIPELINES_MAPPING,
+)
+from diffusers.utils.testing_utils import slow
+
+
+PRETRAINED_MODEL_REPO_MAPPING = OrderedDict(
+ [
+ ("stable-diffusion", "runwayml/stable-diffusion-v1-5"),
+ ("if", "DeepFloyd/IF-I-XL-v1.0"),
+ ("kandinsky", "kandinsky-community/kandinsky-2-1"),
+ ("kandinsky22", "kandinsky-community/kandinsky-2-2-decoder"),
+ ]
+)
+
+
+class AutoPipelineFastTest(unittest.TestCase):
+ @property
+ def dummy_image_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=1,
+ projection_dim=1,
+ num_hidden_layers=1,
+ num_attention_heads=1,
+ image_size=1,
+ intermediate_size=1,
+ patch_size=1,
+ )
+ return CLIPVisionModelWithProjection(config)
+
+ def test_from_pipe_consistent(self):
+ pipe = AutoPipelineForText2Image.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", requires_safety_checker=False
+ )
+ original_config = dict(pipe.config)
+
+ pipe = AutoPipelineForImage2Image.from_pipe(pipe)
+ assert dict(pipe.config) == original_config
+
+ pipe = AutoPipelineForText2Image.from_pipe(pipe)
+ assert dict(pipe.config) == original_config
+
+ def test_from_pipe_override(self):
+ pipe = AutoPipelineForText2Image.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", requires_safety_checker=False
+ )
+
+ pipe = AutoPipelineForImage2Image.from_pipe(pipe, requires_safety_checker=True)
+ assert pipe.config.requires_safety_checker is True
+
+ pipe = AutoPipelineForText2Image.from_pipe(pipe, requires_safety_checker=True)
+ assert pipe.config.requires_safety_checker is True
+
+ def test_from_pipe_consistent_sdxl(self):
+ pipe = AutoPipelineForImage2Image.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-xl-pipe",
+ requires_aesthetics_score=True,
+ force_zeros_for_empty_prompt=False,
+ )
+
+ original_config = dict(pipe.config)
+
+ pipe = AutoPipelineForText2Image.from_pipe(pipe)
+ pipe = AutoPipelineForImage2Image.from_pipe(pipe)
+
+ assert dict(pipe.config) == original_config
+
+ def test_kwargs_local_files_only(self):
+ repo = "hf-internal-testing/tiny-stable-diffusion-torch"
+ tmpdirname = DiffusionPipeline.download(repo)
+ tmpdirname = Path(tmpdirname)
+
+ # edit commit_id to so that it's not the latest commit
+ commit_id = tmpdirname.name
+ new_commit_id = commit_id + "hug"
+
+ ref_dir = tmpdirname.parent.parent / "refs/main"
+ with open(ref_dir, "w") as f:
+ f.write(new_commit_id)
+
+ new_tmpdirname = tmpdirname.parent / new_commit_id
+ os.rename(tmpdirname, new_tmpdirname)
+
+ try:
+ AutoPipelineForText2Image.from_pretrained(repo, local_files_only=True)
+ except OSError:
+ assert False, "not able to load local files"
+
+ shutil.rmtree(tmpdirname.parent.parent)
+
+ def test_from_pipe_controlnet_text2img(self):
+ pipe = AutoPipelineForText2Image.from_pretrained("hf-internal-testing/tiny-stable-diffusion-pipe")
+ controlnet = ControlNetModel.from_pretrained("hf-internal-testing/tiny-controlnet")
+
+ pipe = AutoPipelineForText2Image.from_pipe(pipe, controlnet=controlnet)
+ assert pipe.__class__.__name__ == "StableDiffusionControlNetPipeline"
+ assert "controlnet" in pipe.components
+
+ pipe = AutoPipelineForText2Image.from_pipe(pipe, controlnet=None)
+ assert pipe.__class__.__name__ == "StableDiffusionPipeline"
+ assert "controlnet" not in pipe.components
+
+ def test_from_pipe_controlnet_img2img(self):
+ pipe = AutoPipelineForImage2Image.from_pretrained("hf-internal-testing/tiny-stable-diffusion-pipe")
+ controlnet = ControlNetModel.from_pretrained("hf-internal-testing/tiny-controlnet")
+
+ pipe = AutoPipelineForImage2Image.from_pipe(pipe, controlnet=controlnet)
+ assert pipe.__class__.__name__ == "StableDiffusionControlNetImg2ImgPipeline"
+ assert "controlnet" in pipe.components
+
+ pipe = AutoPipelineForImage2Image.from_pipe(pipe, controlnet=None)
+ assert pipe.__class__.__name__ == "StableDiffusionImg2ImgPipeline"
+ assert "controlnet" not in pipe.components
+
+ def test_from_pipe_controlnet_inpaint(self):
+ pipe = AutoPipelineForInpainting.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
+ controlnet = ControlNetModel.from_pretrained("hf-internal-testing/tiny-controlnet")
+
+ pipe = AutoPipelineForInpainting.from_pipe(pipe, controlnet=controlnet)
+ assert pipe.__class__.__name__ == "StableDiffusionControlNetInpaintPipeline"
+ assert "controlnet" in pipe.components
+
+ pipe = AutoPipelineForInpainting.from_pipe(pipe, controlnet=None)
+ assert pipe.__class__.__name__ == "StableDiffusionInpaintPipeline"
+ assert "controlnet" not in pipe.components
+
+ def test_from_pipe_controlnet_new_task(self):
+ pipe_text2img = AutoPipelineForText2Image.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
+ controlnet = ControlNetModel.from_pretrained("hf-internal-testing/tiny-controlnet")
+
+ pipe_control_img2img = AutoPipelineForImage2Image.from_pipe(pipe_text2img, controlnet=controlnet)
+ assert pipe_control_img2img.__class__.__name__ == "StableDiffusionControlNetImg2ImgPipeline"
+ assert "controlnet" in pipe_control_img2img.components
+
+ pipe_inpaint = AutoPipelineForInpainting.from_pipe(pipe_control_img2img, controlnet=None)
+ assert pipe_inpaint.__class__.__name__ == "StableDiffusionInpaintPipeline"
+ assert "controlnet" not in pipe_inpaint.components
+
+ # testing `from_pipe` for text2img controlnet
+ ## 1. from a different controlnet pipe, without controlnet argument
+ pipe_control_text2img = AutoPipelineForText2Image.from_pipe(pipe_control_img2img)
+ assert pipe_control_text2img.__class__.__name__ == "StableDiffusionControlNetPipeline"
+ assert "controlnet" in pipe_control_text2img.components
+
+ ## 2. from a different controlnet pipe, with controlnet argument
+ pipe_control_text2img = AutoPipelineForText2Image.from_pipe(pipe_control_img2img, controlnet=controlnet)
+ assert pipe_control_text2img.__class__.__name__ == "StableDiffusionControlNetPipeline"
+ assert "controlnet" in pipe_control_text2img.components
+
+ ## 3. from same controlnet pipeline class, with a different controlnet component
+ pipe_control_text2img = AutoPipelineForText2Image.from_pipe(pipe_control_text2img, controlnet=controlnet)
+ assert pipe_control_text2img.__class__.__name__ == "StableDiffusionControlNetPipeline"
+ assert "controlnet" in pipe_control_text2img.components
+
+ # testing from_pipe for inpainting
+ ## 1. from a different controlnet pipeline class
+ pipe_control_inpaint = AutoPipelineForInpainting.from_pipe(pipe_control_img2img)
+ assert pipe_control_inpaint.__class__.__name__ == "StableDiffusionControlNetInpaintPipeline"
+ assert "controlnet" in pipe_control_inpaint.components
+
+ ## from a different controlnet pipe, with a different controlnet
+ pipe_control_inpaint = AutoPipelineForInpainting.from_pipe(pipe_control_img2img, controlnet=controlnet)
+ assert pipe_control_inpaint.__class__.__name__ == "StableDiffusionControlNetInpaintPipeline"
+ assert "controlnet" in pipe_control_inpaint.components
+
+ ## from same controlnet pipe, with a different controlnet
+ pipe_control_inpaint = AutoPipelineForInpainting.from_pipe(pipe_control_inpaint, controlnet=controlnet)
+ assert pipe_control_inpaint.__class__.__name__ == "StableDiffusionControlNetInpaintPipeline"
+ assert "controlnet" in pipe_control_inpaint.components
+
+ # testing from_pipe from img2img controlnet
+ ## from a different controlnet pipe, without controlnet argument
+ pipe_control_img2img = AutoPipelineForImage2Image.from_pipe(pipe_control_text2img)
+ assert pipe_control_img2img.__class__.__name__ == "StableDiffusionControlNetImg2ImgPipeline"
+ assert "controlnet" in pipe_control_img2img.components
+
+ # from a different controlnet pipe, with a different controlnet component
+ pipe_control_img2img = AutoPipelineForImage2Image.from_pipe(pipe_control_text2img, controlnet=controlnet)
+ assert pipe_control_img2img.__class__.__name__ == "StableDiffusionControlNetImg2ImgPipeline"
+ assert "controlnet" in pipe_control_img2img.components
+
+ # from same controlnet pipeline class, with a different controlnet
+ pipe_control_img2img = AutoPipelineForImage2Image.from_pipe(pipe_control_img2img, controlnet=controlnet)
+ assert pipe_control_img2img.__class__.__name__ == "StableDiffusionControlNetImg2ImgPipeline"
+ assert "controlnet" in pipe_control_img2img.components
+
+ def test_from_pipe_optional_components(self):
+ image_encoder = self.dummy_image_encoder
+
+ pipe = AutoPipelineForText2Image.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe",
+ image_encoder=image_encoder,
+ )
+
+ pipe = AutoPipelineForImage2Image.from_pipe(pipe)
+ assert pipe.image_encoder is not None
+
+ pipe = AutoPipelineForText2Image.from_pipe(pipe, image_encoder=None)
+ assert pipe.image_encoder is None
+
+
+@slow
+class AutoPipelineIntegrationTest(unittest.TestCase):
+ def test_pipe_auto(self):
+ for model_name, model_repo in PRETRAINED_MODEL_REPO_MAPPING.items():
+ # test txt2img
+ pipe_txt2img = AutoPipelineForText2Image.from_pretrained(
+ model_repo, variant="fp16", torch_dtype=torch.float16
+ )
+ self.assertIsInstance(pipe_txt2img, AUTO_TEXT2IMAGE_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForText2Image.from_pipe(pipe_txt2img)
+ self.assertIsInstance(pipe_to, AUTO_TEXT2IMAGE_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForImage2Image.from_pipe(pipe_txt2img)
+ self.assertIsInstance(pipe_to, AUTO_IMAGE2IMAGE_PIPELINES_MAPPING[model_name])
+
+ if "kandinsky" not in model_name:
+ pipe_to = AutoPipelineForInpainting.from_pipe(pipe_txt2img)
+ self.assertIsInstance(pipe_to, AUTO_INPAINT_PIPELINES_MAPPING[model_name])
+
+ del pipe_txt2img, pipe_to
+ gc.collect()
+
+ # test img2img
+
+ pipe_img2img = AutoPipelineForImage2Image.from_pretrained(
+ model_repo, variant="fp16", torch_dtype=torch.float16
+ )
+ self.assertIsInstance(pipe_img2img, AUTO_IMAGE2IMAGE_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForText2Image.from_pipe(pipe_img2img)
+ self.assertIsInstance(pipe_to, AUTO_TEXT2IMAGE_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForImage2Image.from_pipe(pipe_img2img)
+ self.assertIsInstance(pipe_to, AUTO_IMAGE2IMAGE_PIPELINES_MAPPING[model_name])
+
+ if "kandinsky" not in model_name:
+ pipe_to = AutoPipelineForInpainting.from_pipe(pipe_img2img)
+ self.assertIsInstance(pipe_to, AUTO_INPAINT_PIPELINES_MAPPING[model_name])
+
+ del pipe_img2img, pipe_to
+ gc.collect()
+
+ # test inpaint
+
+ if "kandinsky" not in model_name:
+ pipe_inpaint = AutoPipelineForInpainting.from_pretrained(
+ model_repo, variant="fp16", torch_dtype=torch.float16
+ )
+ self.assertIsInstance(pipe_inpaint, AUTO_INPAINT_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForText2Image.from_pipe(pipe_inpaint)
+ self.assertIsInstance(pipe_to, AUTO_TEXT2IMAGE_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForImage2Image.from_pipe(pipe_inpaint)
+ self.assertIsInstance(pipe_to, AUTO_IMAGE2IMAGE_PIPELINES_MAPPING[model_name])
+
+ pipe_to = AutoPipelineForInpainting.from_pipe(pipe_inpaint)
+ self.assertIsInstance(pipe_to, AUTO_INPAINT_PIPELINES_MAPPING[model_name])
+
+ del pipe_inpaint, pipe_to
+ gc.collect()
+
+ def test_from_pipe_consistent(self):
+ for model_name, model_repo in PRETRAINED_MODEL_REPO_MAPPING.items():
+ if model_name in ["kandinsky", "kandinsky22"]:
+ auto_pipes = [AutoPipelineForText2Image, AutoPipelineForImage2Image]
+ else:
+ auto_pipes = [AutoPipelineForText2Image, AutoPipelineForImage2Image, AutoPipelineForInpainting]
+
+ # test from_pretrained
+ for pipe_from_class in auto_pipes:
+ pipe_from = pipe_from_class.from_pretrained(model_repo, variant="fp16", torch_dtype=torch.float16)
+ pipe_from_config = dict(pipe_from.config)
+
+ for pipe_to_class in auto_pipes:
+ pipe_to = pipe_to_class.from_pipe(pipe_from)
+ self.assertEqual(dict(pipe_to.config), pipe_from_config)
+
+ del pipe_from, pipe_to
+ gc.collect()
+
+ def test_controlnet(self):
+ # test from_pretrained
+ model_repo = "runwayml/stable-diffusion-v1-5"
+ controlnet_repo = "lllyasviel/sd-controlnet-canny"
+
+ controlnet = ControlNetModel.from_pretrained(controlnet_repo, torch_dtype=torch.float16)
+
+ pipe_txt2img = AutoPipelineForText2Image.from_pretrained(
+ model_repo, controlnet=controlnet, torch_dtype=torch.float16
+ )
+ self.assertIsInstance(pipe_txt2img, AUTO_TEXT2IMAGE_PIPELINES_MAPPING["stable-diffusion-controlnet"])
+
+ pipe_img2img = AutoPipelineForImage2Image.from_pretrained(
+ model_repo, controlnet=controlnet, torch_dtype=torch.float16
+ )
+ self.assertIsInstance(pipe_img2img, AUTO_IMAGE2IMAGE_PIPELINES_MAPPING["stable-diffusion-controlnet"])
+
+ pipe_inpaint = AutoPipelineForInpainting.from_pretrained(
+ model_repo, controlnet=controlnet, torch_dtype=torch.float16
+ )
+ self.assertIsInstance(pipe_inpaint, AUTO_INPAINT_PIPELINES_MAPPING["stable-diffusion-controlnet"])
+
+ # test from_pipe
+ for pipe_from in [pipe_txt2img, pipe_img2img, pipe_inpaint]:
+ pipe_to = AutoPipelineForText2Image.from_pipe(pipe_from)
+ self.assertIsInstance(pipe_to, AUTO_TEXT2IMAGE_PIPELINES_MAPPING["stable-diffusion-controlnet"])
+ self.assertEqual(dict(pipe_to.config), dict(pipe_txt2img.config))
+
+ pipe_to = AutoPipelineForImage2Image.from_pipe(pipe_from)
+ self.assertIsInstance(pipe_to, AUTO_IMAGE2IMAGE_PIPELINES_MAPPING["stable-diffusion-controlnet"])
+ self.assertEqual(dict(pipe_to.config), dict(pipe_img2img.config))
+
+ pipe_to = AutoPipelineForInpainting.from_pipe(pipe_from)
+ self.assertIsInstance(pipe_to, AUTO_INPAINT_PIPELINES_MAPPING["stable-diffusion-controlnet"])
+ self.assertEqual(dict(pipe_to.config), dict(pipe_inpaint.config))
diff --git a/tests/pipelines/test_pipelines_combined.py b/tests/pipelines/test_pipelines_combined.py
new file mode 100644
index 0000000..adedd54
--- /dev/null
+++ b/tests/pipelines/test_pipelines_combined.py
@@ -0,0 +1,128 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import torch
+from huggingface_hub import ModelCard
+
+from diffusers import (
+ DDPMScheduler,
+ DiffusionPipeline,
+ KandinskyV22CombinedPipeline,
+ KandinskyV22Pipeline,
+ KandinskyV22PriorPipeline,
+)
+from diffusers.pipelines.pipeline_utils import CONNECTED_PIPES_KEYS
+
+
+def state_dicts_almost_equal(sd1, sd2):
+ sd1 = dict(sorted(sd1.items()))
+ sd2 = dict(sorted(sd2.items()))
+
+ models_are_equal = True
+ for ten1, ten2 in zip(sd1.values(), sd2.values()):
+ if (ten1 - ten2).abs().sum() > 1e-3:
+ models_are_equal = False
+
+ return models_are_equal
+
+
+class CombinedPipelineFastTest(unittest.TestCase):
+ def modelcard_has_connected_pipeline(self, model_id):
+ modelcard = ModelCard.load(model_id)
+ connected_pipes = {prefix: getattr(modelcard.data, prefix, [None])[0] for prefix in CONNECTED_PIPES_KEYS}
+ connected_pipes = {k: v for k, v in connected_pipes.items() if v is not None}
+
+ return len(connected_pipes) > 0
+
+ def test_correct_modelcard_format(self):
+ # hf-internal-testing/tiny-random-kandinsky-v22-prior has no metadata
+ assert not self.modelcard_has_connected_pipeline("hf-internal-testing/tiny-random-kandinsky-v22-prior")
+
+ # see https://huggingface.co/hf-internal-testing/tiny-random-kandinsky-v22-decoder/blob/8baff9897c6be017013e21b5c562e5a381646c7e/README.md?code=true#L2
+ assert self.modelcard_has_connected_pipeline("hf-internal-testing/tiny-random-kandinsky-v22-decoder")
+
+ def test_load_connected_checkpoint_when_specified(self):
+ pipeline_prior = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-random-kandinsky-v22-prior")
+ pipeline_prior_connected = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-random-kandinsky-v22-prior", load_connected_pipeline=True
+ )
+
+ # Passing `load_connected_pipeline` to prior is a no-op as the pipeline has no connected pipeline
+ assert pipeline_prior.__class__ == pipeline_prior_connected.__class__
+
+ pipeline = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-random-kandinsky-v22-decoder")
+ pipeline_connected = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-random-kandinsky-v22-decoder", load_connected_pipeline=True
+ )
+
+ # Passing `load_connected_pipeline` to decoder loads the combined pipeline
+ assert pipeline.__class__ != pipeline_connected.__class__
+ assert pipeline.__class__ == KandinskyV22Pipeline
+ assert pipeline_connected.__class__ == KandinskyV22CombinedPipeline
+
+ # check that loaded components match prior and decoder components
+ assert set(pipeline_connected.components.keys()) == set(
+ ["prior_" + k for k in pipeline_prior.components.keys()] + list(pipeline.components.keys())
+ )
+
+ def test_load_connected_checkpoint_default(self):
+ prior = KandinskyV22PriorPipeline.from_pretrained("hf-internal-testing/tiny-random-kandinsky-v22-prior")
+ decoder = KandinskyV22Pipeline.from_pretrained("hf-internal-testing/tiny-random-kandinsky-v22-decoder")
+
+ # check that combined pipeline loads both prior & decoder because of
+ # https://huggingface.co/hf-internal-testing/tiny-random-kandinsky-v22-decoder/blob/8baff9897c6be017013e21b5c562e5a381646c7e/README.md?code=true#L3
+ assert (
+ KandinskyV22CombinedPipeline._load_connected_pipes
+ ) # combined pipelines will download more checkpoints that just the one specified
+ pipeline = KandinskyV22CombinedPipeline.from_pretrained(
+ "hf-internal-testing/tiny-random-kandinsky-v22-decoder"
+ )
+
+ prior_comps = prior.components
+ decoder_comps = decoder.components
+ for k, component in pipeline.components.items():
+ if k.startswith("prior_"):
+ k = k[6:]
+ comp = prior_comps[k]
+ else:
+ comp = decoder_comps[k]
+
+ if isinstance(component, torch.nn.Module):
+ assert state_dicts_almost_equal(component.state_dict(), comp.state_dict())
+ elif hasattr(component, "config"):
+ assert dict(component.config) == dict(comp.config)
+ else:
+ assert component.__class__ == comp.__class__
+
+ def test_load_connected_checkpoint_with_passed_obj(self):
+ pipeline = KandinskyV22CombinedPipeline.from_pretrained(
+ "hf-internal-testing/tiny-random-kandinsky-v22-decoder"
+ )
+ prior_scheduler = DDPMScheduler.from_config(pipeline.prior_scheduler.config)
+ scheduler = DDPMScheduler.from_config(pipeline.scheduler.config)
+
+ # make sure we pass a different scheduler and prior_scheduler
+ assert pipeline.prior_scheduler.__class__ != prior_scheduler.__class__
+ assert pipeline.scheduler.__class__ != scheduler.__class__
+
+ pipeline_new = KandinskyV22CombinedPipeline.from_pretrained(
+ "hf-internal-testing/tiny-random-kandinsky-v22-decoder",
+ prior_scheduler=prior_scheduler,
+ scheduler=scheduler,
+ )
+ assert dict(pipeline_new.prior_scheduler.config) == dict(prior_scheduler.config)
+ assert dict(pipeline_new.scheduler.config) == dict(scheduler.config)
diff --git a/tests/pipelines/test_pipelines_common.py b/tests/pipelines/test_pipelines_common.py
new file mode 100644
index 0000000..2b29e3a
--- /dev/null
+++ b/tests/pipelines/test_pipelines_common.py
@@ -0,0 +1,1614 @@
+import contextlib
+import gc
+import inspect
+import io
+import json
+import os
+import re
+import tempfile
+import unittest
+import uuid
+from typing import Any, Callable, Dict, Union
+
+import numpy as np
+import PIL.Image
+import torch
+from huggingface_hub import ModelCard, delete_repo
+from huggingface_hub.utils import is_jinja_available
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AsymmetricAutoencoderKL,
+ AutoencoderKL,
+ AutoencoderTiny,
+ ConsistencyDecoderVAE,
+ DDIMScheduler,
+ DiffusionPipeline,
+ StableDiffusionPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.loaders import IPAdapterMixin
+from diffusers.models.unets.unet_3d_condition import UNet3DConditionModel
+from diffusers.models.unets.unet_i2vgen_xl import I2VGenXLUNet
+from diffusers.models.unets.unet_motion_model import UNetMotionModel
+from diffusers.pipelines.pipeline_utils import StableDiffusionMixin
+from diffusers.schedulers import KarrasDiffusionSchedulers
+from diffusers.utils import logging
+from diffusers.utils.import_utils import is_accelerate_available, is_accelerate_version, is_xformers_available
+from diffusers.utils.testing_utils import (
+ CaptureLogger,
+ require_torch,
+ torch_device,
+)
+
+from ..models.autoencoders.test_models_vae import (
+ get_asym_autoencoder_kl_config,
+ get_autoencoder_kl_config,
+ get_autoencoder_tiny_config,
+ get_consistency_vae_config,
+)
+from ..models.unets.test_models_unet_2d_condition import create_ip_adapter_state_dict
+from ..others.test_utils import TOKEN, USER, is_staging_test
+
+
+def to_np(tensor):
+ if isinstance(tensor, torch.Tensor):
+ tensor = tensor.detach().cpu().numpy()
+
+ return tensor
+
+
+def check_same_shape(tensor_list):
+ shapes = [tensor.shape for tensor in tensor_list]
+ return all(shape == shapes[0] for shape in shapes[1:])
+
+
+class SDFunctionTesterMixin:
+ """
+ This mixin is designed to be used with PipelineTesterMixin and unittest.TestCase classes.
+ It provides a set of common tests for PyTorch pipeline that inherit from StableDiffusionMixin, e.g. vae_slicing, vae_tiling, freeu, etc.
+ """
+
+ def test_vae_slicing(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ # components["scheduler"] = LMSDiscreteScheduler.from_config(components["scheduler"].config)
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ image_count = 4
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * image_count
+ if "image" in inputs: # fix batch size mismatch in I2V_Gen pipeline
+ inputs["image"] = [inputs["image"]] * image_count
+ output_1 = pipe(**inputs)
+
+ # make sure sliced vae decode yields the same result
+ pipe.enable_vae_slicing()
+ inputs = self.get_dummy_inputs(device)
+ inputs["prompt"] = [inputs["prompt"]] * image_count
+ if "image" in inputs:
+ inputs["image"] = [inputs["image"]] * image_count
+ inputs["return_dict"] = False
+ output_2 = pipe(**inputs)
+
+ assert np.abs(output_2[0].flatten() - output_1[0].flatten()).max() < 1e-2
+
+ def test_vae_tiling(self):
+ components = self.get_dummy_components()
+
+ # make sure here that pndm scheduler skips prk
+ if "safety_checker" in components:
+ components["safety_checker"] = None
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["return_dict"] = False
+
+ # Test that tiled decode at 512x512 yields the same result as the non-tiled decode
+ output_1 = pipe(**inputs)[0]
+
+ # make sure tiled vae decode yields the same result
+ pipe.enable_vae_tiling()
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["return_dict"] = False
+ output_2 = pipe(**inputs)[0]
+
+ assert np.abs(output_2 - output_1).max() < 5e-1
+
+ # test that tiled decode works with various shapes
+ shapes = [(1, 4, 73, 97), (1, 4, 97, 73), (1, 4, 49, 65), (1, 4, 65, 49)]
+ for shape in shapes:
+ zeros = torch.zeros(shape).to(torch_device)
+ pipe.vae.decode(zeros)
+
+ def test_freeu_enabled(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["return_dict"] = False
+ output = pipe(**inputs)[0]
+
+ pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["return_dict"] = False
+ output_freeu = pipe(**inputs)[0]
+
+ assert not np.allclose(
+ output[0, -3:, -3:, -1], output_freeu[0, -3:, -3:, -1]
+ ), "Enabling of FreeU should lead to different results."
+
+ def test_freeu_disabled(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["return_dict"] = False
+ output = pipe(**inputs)[0]
+
+ pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+ pipe.disable_freeu()
+
+ freeu_keys = {"s1", "s2", "b1", "b2"}
+ for upsample_block in pipe.unet.up_blocks:
+ for key in freeu_keys:
+ assert getattr(upsample_block, key) is None, f"Disabling of FreeU should have set {key} to None."
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["return_dict"] = False
+ output_no_freeu = pipe(**inputs)[0]
+ assert np.allclose(
+ output, output_no_freeu, atol=1e-2
+ ), f"Disabling of FreeU should lead to results similar to the default pipeline results but Max Abs Error={np.abs(output_no_freeu - output).max()}."
+
+ def test_fused_qkv_projections(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["return_dict"] = False
+ image = pipe(**inputs)[0]
+ original_image_slice = image[0, -3:, -3:, -1]
+
+ pipe.fuse_qkv_projections()
+ inputs = self.get_dummy_inputs(device)
+ inputs["return_dict"] = False
+ image_fused = pipe(**inputs)[0]
+ image_slice_fused = image_fused[0, -3:, -3:, -1]
+
+ pipe.unfuse_qkv_projections()
+ inputs = self.get_dummy_inputs(device)
+ inputs["return_dict"] = False
+ image_disabled = pipe(**inputs)[0]
+ image_slice_disabled = image_disabled[0, -3:, -3:, -1]
+
+ assert np.allclose(
+ original_image_slice, image_slice_fused, atol=1e-2, rtol=1e-2
+ ), "Fusion of QKV projections shouldn't affect the outputs."
+ assert np.allclose(
+ image_slice_fused, image_slice_disabled, atol=1e-2, rtol=1e-2
+ ), "Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."
+ assert np.allclose(
+ original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2
+ ), "Original outputs should match when fused QKV projections are disabled."
+
+
+class IPAdapterTesterMixin:
+ """
+ This mixin is designed to be used with PipelineTesterMixin and unittest.TestCase classes.
+ It provides a set of common tests for pipelines that support IP Adapters.
+ """
+
+ def test_pipeline_signature(self):
+ parameters = inspect.signature(self.pipeline_class.__call__).parameters
+
+ assert issubclass(self.pipeline_class, IPAdapterMixin)
+ self.assertIn(
+ "ip_adapter_image",
+ parameters,
+ "`ip_adapter_image` argument must be supported by the `__call__` method",
+ )
+ self.assertIn(
+ "ip_adapter_image_embeds",
+ parameters,
+ "`ip_adapter_image_embeds` argument must be supported by the `__call__` method",
+ )
+
+ def _get_dummy_image_embeds(self, cross_attention_dim: int = 32):
+ return torch.randn((2, 1, cross_attention_dim), device=torch_device)
+
+ def _modify_inputs_for_ip_adapter_test(self, inputs: Dict[str, Any]):
+ parameters = inspect.signature(self.pipeline_class.__call__).parameters
+ if "image" in parameters.keys() and "strength" in parameters.keys():
+ inputs["num_inference_steps"] = 4
+
+ inputs["output_type"] = "np"
+ inputs["return_dict"] = False
+ return inputs
+
+ def test_ip_adapter_single(self, expected_max_diff: float = 1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components).to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ cross_attention_dim = pipe.unet.config.get("cross_attention_dim", 32)
+
+ # forward pass without ip adapter
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ output_without_adapter = pipe(**inputs)[0]
+
+ adapter_state_dict = create_ip_adapter_state_dict(pipe.unet)
+ pipe.unet._load_ip_adapter_weights(adapter_state_dict)
+
+ # forward pass with single ip adapter, but scale=0 which should have no effect
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ inputs["ip_adapter_image_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
+ pipe.set_ip_adapter_scale(0.0)
+ output_without_adapter_scale = pipe(**inputs)[0]
+
+ # forward pass with single ip adapter, but with scale of adapter weights
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ inputs["ip_adapter_image_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
+ pipe.set_ip_adapter_scale(42.0)
+ output_with_adapter_scale = pipe(**inputs)[0]
+
+ max_diff_without_adapter_scale = np.abs(output_without_adapter_scale - output_without_adapter).max()
+ max_diff_with_adapter_scale = np.abs(output_with_adapter_scale - output_without_adapter).max()
+
+ self.assertLess(
+ max_diff_without_adapter_scale,
+ expected_max_diff,
+ "Output without ip-adapter must be same as normal inference",
+ )
+ self.assertGreater(
+ max_diff_with_adapter_scale, 1e-2, "Output with ip-adapter must be different from normal inference"
+ )
+
+ def test_ip_adapter_multi(self, expected_max_diff: float = 1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components).to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ cross_attention_dim = pipe.unet.config.get("cross_attention_dim", 32)
+
+ # forward pass without ip adapter
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ output_without_adapter = pipe(**inputs)[0]
+
+ adapter_state_dict_1 = create_ip_adapter_state_dict(pipe.unet)
+ adapter_state_dict_2 = create_ip_adapter_state_dict(pipe.unet)
+ pipe.unet._load_ip_adapter_weights([adapter_state_dict_1, adapter_state_dict_2])
+
+ # forward pass with multi ip adapter, but scale=0 which should have no effect
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ inputs["ip_adapter_image_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)] * 2
+ pipe.set_ip_adapter_scale([0.0, 0.0])
+ output_without_multi_adapter_scale = pipe(**inputs)[0]
+
+ # forward pass with multi ip adapter, but with scale of adapter weights
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ inputs["ip_adapter_image_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)] * 2
+ pipe.set_ip_adapter_scale([42.0, 42.0])
+ output_with_multi_adapter_scale = pipe(**inputs)[0]
+
+ max_diff_without_multi_adapter_scale = np.abs(
+ output_without_multi_adapter_scale - output_without_adapter
+ ).max()
+ max_diff_with_multi_adapter_scale = np.abs(output_with_multi_adapter_scale - output_without_adapter).max()
+ self.assertLess(
+ max_diff_without_multi_adapter_scale,
+ expected_max_diff,
+ "Output without multi-ip-adapter must be same as normal inference",
+ )
+ self.assertGreater(
+ max_diff_with_multi_adapter_scale,
+ 1e-2,
+ "Output with multi-ip-adapter scale must be different from normal inference",
+ )
+
+ def test_ip_adapter_cfg(self, expected_max_diff: float = 1e-4):
+ parameters = inspect.signature(self.pipeline_class.__call__).parameters
+
+ if "guidance_scale" not in parameters:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components).to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ cross_attention_dim = pipe.unet.config.get("cross_attention_dim", 32)
+
+ adapter_state_dict = create_ip_adapter_state_dict(pipe.unet)
+ pipe.unet._load_ip_adapter_weights(adapter_state_dict)
+ pipe.set_ip_adapter_scale(1.0)
+
+ # forward pass with CFG not applied
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ inputs["ip_adapter_image_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)[0].unsqueeze(0)]
+ inputs["guidance_scale"] = 1.0
+ out_no_cfg = pipe(**inputs)[0]
+
+ # forward pass with CFG applied
+ inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
+ inputs["ip_adapter_image_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
+ inputs["guidance_scale"] = 7.5
+ out_cfg = pipe(**inputs)[0]
+
+ assert out_cfg.shape == out_no_cfg.shape
+
+
+class PipelineLatentTesterMixin:
+ """
+ This mixin is designed to be used with PipelineTesterMixin and unittest.TestCase classes.
+ It provides a set of common tests for PyTorch pipeline that has vae, e.g.
+ equivalence of different input and output types, etc.
+ """
+
+ @property
+ def image_params(self) -> frozenset:
+ raise NotImplementedError(
+ "You need to set the attribute `image_params` in the child test class. "
+ "`image_params` are tested for if all accepted input image types (i.e. `pt`,`pil`,`np`) are producing same results"
+ )
+
+ @property
+ def image_latents_params(self) -> frozenset:
+ raise NotImplementedError(
+ "You need to set the attribute `image_latents_params` in the child test class. "
+ "`image_latents_params` are tested for if passing latents directly are producing same results"
+ )
+
+ def get_dummy_inputs_by_type(self, device, seed=0, input_image_type="pt", output_type="np"):
+ inputs = self.get_dummy_inputs(device, seed)
+
+ def convert_to_pt(image):
+ if isinstance(image, torch.Tensor):
+ input_image = image
+ elif isinstance(image, np.ndarray):
+ input_image = VaeImageProcessor.numpy_to_pt(image)
+ elif isinstance(image, PIL.Image.Image):
+ input_image = VaeImageProcessor.pil_to_numpy(image)
+ input_image = VaeImageProcessor.numpy_to_pt(input_image)
+ else:
+ raise ValueError(f"unsupported input_image_type {type(image)}")
+ return input_image
+
+ def convert_pt_to_type(image, input_image_type):
+ if input_image_type == "pt":
+ input_image = image
+ elif input_image_type == "np":
+ input_image = VaeImageProcessor.pt_to_numpy(image)
+ elif input_image_type == "pil":
+ input_image = VaeImageProcessor.pt_to_numpy(image)
+ input_image = VaeImageProcessor.numpy_to_pil(input_image)
+ else:
+ raise ValueError(f"unsupported input_image_type {input_image_type}.")
+ return input_image
+
+ for image_param in self.image_params:
+ if image_param in inputs.keys():
+ inputs[image_param] = convert_pt_to_type(
+ convert_to_pt(inputs[image_param]).to(device), input_image_type
+ )
+
+ inputs["output_type"] = output_type
+
+ return inputs
+
+ def test_pt_np_pil_outputs_equivalent(self, expected_max_diff=1e-4):
+ self._test_pt_np_pil_outputs_equivalent(expected_max_diff=expected_max_diff)
+
+ def _test_pt_np_pil_outputs_equivalent(self, expected_max_diff=1e-4, input_image_type="pt"):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ output_pt = pipe(
+ **self.get_dummy_inputs_by_type(torch_device, input_image_type=input_image_type, output_type="pt")
+ )[0]
+ output_np = pipe(
+ **self.get_dummy_inputs_by_type(torch_device, input_image_type=input_image_type, output_type="np")
+ )[0]
+ output_pil = pipe(
+ **self.get_dummy_inputs_by_type(torch_device, input_image_type=input_image_type, output_type="pil")
+ )[0]
+
+ max_diff = np.abs(output_pt.cpu().numpy().transpose(0, 2, 3, 1) - output_np).max()
+ self.assertLess(
+ max_diff, expected_max_diff, "`output_type=='pt'` generate different results from `output_type=='np'`"
+ )
+
+ max_diff = np.abs(np.array(output_pil[0]) - (output_np * 255).round()).max()
+ self.assertLess(max_diff, 2.0, "`output_type=='pil'` generate different results from `output_type=='np'`")
+
+ def test_pt_np_pil_inputs_equivalent(self):
+ if len(self.image_params) == 0:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ out_input_pt = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="pt"))[0]
+ out_input_np = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="np"))[0]
+ out_input_pil = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="pil"))[0]
+
+ max_diff = np.abs(out_input_pt - out_input_np).max()
+ self.assertLess(max_diff, 1e-4, "`input_type=='pt'` generate different result from `input_type=='np'`")
+ max_diff = np.abs(out_input_pil - out_input_np).max()
+ self.assertLess(max_diff, 1e-2, "`input_type=='pt'` generate different result from `input_type=='np'`")
+
+ def test_latents_input(self):
+ if len(self.image_latents_params) == 0:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.image_processor = VaeImageProcessor(do_resize=False, do_normalize=False)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ out = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="pt"))[0]
+
+ vae = components["vae"]
+ inputs = self.get_dummy_inputs_by_type(torch_device, input_image_type="pt")
+ generator = inputs["generator"]
+ for image_param in self.image_latents_params:
+ if image_param in inputs.keys():
+ inputs[image_param] = (
+ vae.encode(inputs[image_param]).latent_dist.sample(generator) * vae.config.scaling_factor
+ )
+ out_latents_inputs = pipe(**inputs)[0]
+
+ max_diff = np.abs(out - out_latents_inputs).max()
+ self.assertLess(max_diff, 1e-4, "passing latents as image input generate different result from passing image")
+
+ def test_multi_vae(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ block_out_channels = pipe.vae.config.block_out_channels
+ norm_num_groups = pipe.vae.config.norm_num_groups
+
+ vae_classes = [AutoencoderKL, AsymmetricAutoencoderKL, ConsistencyDecoderVAE, AutoencoderTiny]
+ configs = [
+ get_autoencoder_kl_config(block_out_channels, norm_num_groups),
+ get_asym_autoencoder_kl_config(block_out_channels, norm_num_groups),
+ get_consistency_vae_config(block_out_channels, norm_num_groups),
+ get_autoencoder_tiny_config(block_out_channels),
+ ]
+
+ out_np = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="np"))[0]
+
+ for vae_cls, config in zip(vae_classes, configs):
+ vae = vae_cls(**config)
+ vae = vae.to(torch_device)
+ components["vae"] = vae
+ vae_pipe = self.pipeline_class(**components)
+ out_vae_np = vae_pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="np"))[0]
+
+ assert out_vae_np.shape == out_np.shape
+
+
+@require_torch
+class PipelineKarrasSchedulerTesterMixin:
+ """
+ This mixin is designed to be used with unittest.TestCase classes.
+ It provides a set of common tests for each PyTorch pipeline that makes use of KarrasDiffusionSchedulers
+ equivalence of dict and tuple outputs, etc.
+ """
+
+ def test_karras_schedulers_shape(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+
+ # make sure that PNDM does not need warm-up
+ pipe.scheduler.register_to_config(skip_prk_steps=True)
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["num_inference_steps"] = 2
+
+ if "strength" in inputs:
+ inputs["num_inference_steps"] = 4
+ inputs["strength"] = 0.5
+
+ outputs = []
+ for scheduler_enum in KarrasDiffusionSchedulers:
+ if "KDPM2" in scheduler_enum.name:
+ inputs["num_inference_steps"] = 5
+
+ scheduler_cls = getattr(diffusers, scheduler_enum.name)
+ pipe.scheduler = scheduler_cls.from_config(pipe.scheduler.config)
+ output = pipe(**inputs)[0]
+ outputs.append(output)
+
+ if "KDPM2" in scheduler_enum.name:
+ inputs["num_inference_steps"] = 2
+
+ assert check_same_shape(outputs)
+
+
+@require_torch
+class PipelineTesterMixin:
+ """
+ This mixin is designed to be used with unittest.TestCase classes.
+ It provides a set of common tests for each PyTorch pipeline, e.g. saving and loading the pipeline,
+ equivalence of dict and tuple outputs, etc.
+ """
+
+ # Canonical parameters that are passed to `__call__` regardless
+ # of the type of pipeline. They are always optional and have common
+ # sense default values.
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "num_images_per_prompt",
+ "generator",
+ "latents",
+ "output_type",
+ "return_dict",
+ ]
+ )
+
+ # set these parameters to False in the child class if the pipeline does not support the corresponding functionality
+ test_attention_slicing = True
+
+ test_xformers_attention = True
+
+ def get_generator(self, seed):
+ device = torch_device if torch_device != "mps" else "cpu"
+ generator = torch.Generator(device).manual_seed(seed)
+ return generator
+
+ @property
+ def pipeline_class(self) -> Union[Callable, DiffusionPipeline]:
+ raise NotImplementedError(
+ "You need to set the attribute `pipeline_class = ClassNameOfPipeline` in the child test class. "
+ "See existing pipeline tests for reference."
+ )
+
+ def get_dummy_components(self):
+ raise NotImplementedError(
+ "You need to implement `get_dummy_components(self)` in the child test class. "
+ "See existing pipeline tests for reference."
+ )
+
+ def get_dummy_inputs(self, device, seed=0):
+ raise NotImplementedError(
+ "You need to implement `get_dummy_inputs(self, device, seed)` in the child test class. "
+ "See existing pipeline tests for reference."
+ )
+
+ @property
+ def params(self) -> frozenset:
+ raise NotImplementedError(
+ "You need to set the attribute `params` in the child test class. "
+ "`params` are checked for if all values are present in `__call__`'s signature."
+ " You can set `params` using one of the common set of parameters defined in `pipeline_params.py`"
+ " e.g., `TEXT_TO_IMAGE_PARAMS` defines the common parameters used in text to "
+ "image pipelines, including prompts and prompt embedding overrides."
+ "If your pipeline's set of arguments has minor changes from one of the common sets of arguments, "
+ "do not make modifications to the existing common sets of arguments. I.e. a text to image pipeline "
+ "with non-configurable height and width arguments should set the attribute as "
+ "`params = TEXT_TO_IMAGE_PARAMS - {'height', 'width'}`. "
+ "See existing pipeline tests for reference."
+ )
+
+ @property
+ def batch_params(self) -> frozenset:
+ raise NotImplementedError(
+ "You need to set the attribute `batch_params` in the child test class. "
+ "`batch_params` are the parameters required to be batched when passed to the pipeline's "
+ "`__call__` method. `pipeline_params.py` provides some common sets of parameters such as "
+ "`TEXT_TO_IMAGE_BATCH_PARAMS`, `IMAGE_VARIATION_BATCH_PARAMS`, etc... If your pipeline's "
+ "set of batch arguments has minor changes from one of the common sets of batch arguments, "
+ "do not make modifications to the existing common sets of batch arguments. I.e. a text to "
+ "image pipeline `negative_prompt` is not batched should set the attribute as "
+ "`batch_params = TEXT_TO_IMAGE_BATCH_PARAMS - {'negative_prompt'}`. "
+ "See existing pipeline tests for reference."
+ )
+
+ @property
+ def callback_cfg_params(self) -> frozenset:
+ raise NotImplementedError(
+ "You need to set the attribute `callback_cfg_params` in the child test class that requires to run test_callback_cfg. "
+ "`callback_cfg_params` are the parameters that needs to be passed to the pipeline's callback "
+ "function when dynamically adjusting `guidance_scale`. They are variables that require special"
+ "treatment when `do_classifier_free_guidance` is `True`. `pipeline_params.py` provides some common"
+ " sets of parameters such as `TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS`. If your pipeline's "
+ "set of cfg arguments has minor changes from one of the common sets of cfg arguments, "
+ "do not make modifications to the existing common sets of cfg arguments. I.e. for inpaint pipeine, you "
+ " need to adjust batch size of `mask` and `masked_image_latents` so should set the attribute as"
+ "`callback_cfg_params = TEXT_TO_IMAGE_CFG_PARAMS.union({'mask', 'masked_image_latents'})`"
+ )
+
+ def tearDown(self):
+ # clean up the VRAM after each test in case of CUDA runtime errors
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_save_load_local(self, expected_max_difference=5e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs)[0]
+
+ logger = logging.get_logger("diffusers.pipelines.pipeline_utils")
+ logger.setLevel(diffusers.logging.INFO)
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir, safe_serialization=False)
+
+ with CaptureLogger(logger) as cap_logger:
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+
+ for component in pipe_loaded.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ for name in pipe_loaded.components.keys():
+ if name not in pipe_loaded._optional_components:
+ assert name in str(cap_logger)
+
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ def test_pipeline_call_signature(self):
+ self.assertTrue(
+ hasattr(self.pipeline_class, "__call__"), f"{self.pipeline_class} should have a `__call__` method"
+ )
+
+ parameters = inspect.signature(self.pipeline_class.__call__).parameters
+
+ optional_parameters = set()
+
+ for k, v in parameters.items():
+ if v.default != inspect._empty:
+ optional_parameters.add(k)
+
+ parameters = set(parameters.keys())
+ parameters.remove("self")
+ parameters.discard("kwargs") # kwargs can be added if arguments of pipeline call function are deprecated
+
+ remaining_required_parameters = set()
+
+ for param in self.params:
+ if param not in parameters:
+ remaining_required_parameters.add(param)
+
+ self.assertTrue(
+ len(remaining_required_parameters) == 0,
+ f"Required parameters not present: {remaining_required_parameters}",
+ )
+
+ remaining_required_optional_parameters = set()
+
+ for param in self.required_optional_params:
+ if param not in optional_parameters:
+ remaining_required_optional_parameters.add(param)
+
+ self.assertTrue(
+ len(remaining_required_optional_parameters) == 0,
+ f"Required optional parameters not present: {remaining_required_optional_parameters}",
+ )
+
+ def test_inference_batch_consistent(self, batch_sizes=[2]):
+ self._test_inference_batch_consistent(batch_sizes=batch_sizes)
+
+ def _test_inference_batch_consistent(
+ self, batch_sizes=[2], additional_params_copy_to_batched_inputs=["num_inference_steps"], batch_generator=True
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["generator"] = self.get_generator(0)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # prepare batched inputs
+ batched_inputs = []
+ for batch_size in batch_sizes:
+ batched_input = {}
+ batched_input.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ # make unequal batch sizes
+ batched_input[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+
+ # make last batch super long
+ batched_input[name][-1] = 100 * "very long"
+
+ else:
+ batched_input[name] = batch_size * [value]
+
+ if batch_generator and "generator" in inputs:
+ batched_input["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_input["batch_size"] = batch_size
+
+ batched_inputs.append(batched_input)
+
+ logger.setLevel(level=diffusers.logging.WARNING)
+ for batch_size, batched_input in zip(batch_sizes, batched_inputs):
+ output = pipe(**batched_input)
+ assert len(output[0]) == batch_size
+
+ def test_inference_batch_single_identical(self, batch_size=3, expected_max_diff=1e-4):
+ self._test_inference_batch_single_identical(batch_size=batch_size, expected_max_diff=expected_max_diff)
+
+ def _test_inference_batch_single_identical(
+ self,
+ batch_size=2,
+ expected_max_diff=1e-4,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for components in pipe.components.values():
+ if hasattr(components, "set_default_attn_processor"):
+ components.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is has been used in self.get_dummy_inputs
+ inputs["generator"] = self.get_generator(0)
+
+ logger = logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batched_inputs.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+ batched_inputs[name][-1] = 100 * "very long"
+
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ if "generator" in inputs:
+ batched_inputs["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_inputs["batch_size"] = batch_size
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ output = pipe(**inputs)
+ output_batch = pipe(**batched_inputs)
+
+ assert output_batch[0].shape[0] == batch_size
+
+ max_diff = np.abs(to_np(output_batch[0][0]) - to_np(output[0][0])).max()
+ assert max_diff < expected_max_diff
+
+ def test_dict_tuple_outputs_equivalent(self, expected_max_difference=1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ output = pipe(**self.get_dummy_inputs(generator_device))[0]
+ output_tuple = pipe(**self.get_dummy_inputs(generator_device), return_dict=False)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_tuple)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ def test_components_function(self):
+ init_components = self.get_dummy_components()
+ init_components = {k: v for k, v in init_components.items() if not isinstance(v, (str, int, float))}
+
+ pipe = self.pipeline_class(**init_components)
+
+ self.assertTrue(hasattr(pipe, "components"))
+ self.assertTrue(set(pipe.components.keys()) == set(init_components.keys()))
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_float16_inference(self, expected_max_diff=5e-2):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ components = self.get_dummy_components()
+ pipe_fp16 = self.pipeline_class(**components)
+ for component in pipe_fp16.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe_fp16.to(torch_device, torch.float16)
+ pipe_fp16.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is used inside dummy inputs
+ if "generator" in inputs:
+ inputs["generator"] = self.get_generator(0)
+
+ output = pipe(**inputs)[0]
+
+ fp16_inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is used inside dummy inputs
+ if "generator" in fp16_inputs:
+ fp16_inputs["generator"] = self.get_generator(0)
+
+ output_fp16 = pipe_fp16(**fp16_inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_fp16)).max()
+ self.assertLess(max_diff, expected_max_diff, "The outputs of the fp16 and fp32 pipelines are too different.")
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self, expected_max_diff=1e-2):
+ components = self.get_dummy_components()
+ for name, module in components.items():
+ if hasattr(module, "half"):
+ components[name] = module.to(torch_device).half()
+
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir, torch_dtype=torch.float16)
+ for component in pipe_loaded.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for name, component in pipe_loaded.components.items():
+ if hasattr(component, "dtype"):
+ self.assertTrue(
+ component.dtype == torch.float16,
+ f"`{name}.dtype` switched from `float16` to {component.dtype} after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(
+ max_diff, expected_max_diff, "The output of the fp16 pipeline changed after saving and loading."
+ )
+
+ def test_save_load_optional_components(self, expected_max_difference=1e-4):
+ if not hasattr(self.pipeline_class, "_optional_components"):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ # set all optional components to None
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir, safe_serialization=False)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ for component in pipe_loaded.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(generator_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ @unittest.skipIf(torch_device != "cuda", reason="CUDA and CPU are required to switch devices")
+ def test_to_device(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.to("cpu")
+ model_devices = [component.device.type for component in components.values() if hasattr(component, "device")]
+ self.assertTrue(all(device == "cpu" for device in model_devices))
+
+ output_cpu = pipe(**self.get_dummy_inputs("cpu"))[0]
+ self.assertTrue(np.isnan(output_cpu).sum() == 0)
+
+ pipe.to("cuda")
+ model_devices = [component.device.type for component in components.values() if hasattr(component, "device")]
+ self.assertTrue(all(device == "cuda" for device in model_devices))
+
+ output_cuda = pipe(**self.get_dummy_inputs("cuda"))[0]
+ self.assertTrue(np.isnan(to_np(output_cuda)).sum() == 0)
+
+ def test_to_dtype(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ model_dtypes = [component.dtype for component in components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float32 for dtype in model_dtypes))
+
+ pipe.to(dtype=torch.float16)
+ model_dtypes = [component.dtype for component in components.values() if hasattr(component, "dtype")]
+ self.assertTrue(all(dtype == torch.float16 for dtype in model_dtypes))
+
+ def test_attention_slicing_forward_pass(self, expected_max_diff=1e-3):
+ self._test_attention_slicing_forward_pass(expected_max_diff=expected_max_diff)
+
+ def _test_attention_slicing_forward_pass(
+ self, test_max_difference=True, test_mean_pixel_difference=True, expected_max_diff=1e-3
+ ):
+ if not self.test_attention_slicing:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ output_without_slicing = pipe(**inputs)[0]
+
+ pipe.enable_attention_slicing(slice_size=1)
+ inputs = self.get_dummy_inputs(generator_device)
+ output_with_slicing = pipe(**inputs)[0]
+
+ if test_max_difference:
+ max_diff = np.abs(to_np(output_with_slicing) - to_np(output_without_slicing)).max()
+ self.assertLess(max_diff, expected_max_diff, "Attention slicing should not affect the inference results")
+
+ if test_mean_pixel_difference:
+ assert_mean_pixel_difference(to_np(output_with_slicing[0]), to_np(output_without_slicing[0]))
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_accelerate_available() or is_accelerate_version("<", "0.14.0"),
+ reason="CPU offload is only available with CUDA and `accelerate v0.14.0` or higher",
+ )
+ def test_sequential_cpu_offload_forward_pass(self, expected_max_diff=1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+ output_without_offload = pipe(**inputs)[0]
+
+ pipe.enable_sequential_cpu_offload()
+
+ inputs = self.get_dummy_inputs(generator_device)
+ output_with_offload = pipe(**inputs)[0]
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "CPU offloading should not affect the inference results")
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_accelerate_available() or is_accelerate_version("<", "0.17.0"),
+ reason="CPU offload is only available with CUDA and `accelerate v0.17.0` or higher",
+ )
+ def test_model_cpu_offload_forward_pass(self, expected_max_diff=2e-4):
+ generator_device = "cpu"
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(generator_device)
+ output_without_offload = pipe(**inputs)[0]
+
+ pipe.enable_model_cpu_offload()
+ inputs = self.get_dummy_inputs(generator_device)
+ output_with_offload = pipe(**inputs)[0]
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "CPU offloading should not affect the inference results")
+ offloaded_modules = [
+ v
+ for k, v in pipe.components.items()
+ if isinstance(v, torch.nn.Module) and k not in pipe._exclude_from_cpu_offload
+ ]
+ (
+ self.assertTrue(all(v.device.type == "cpu" for v in offloaded_modules)),
+ f"Not offloaded: {[v for v in offloaded_modules if v.device.type != 'cpu']}",
+ )
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass()
+
+ def _test_xformers_attention_forwardGenerator_pass(
+ self, test_max_difference=True, test_mean_pixel_difference=True, expected_max_diff=1e-4
+ ):
+ if not self.test_xformers_attention:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ output_without_offload = pipe(**inputs)[0]
+ output_without_offload = (
+ output_without_offload.cpu() if torch.is_tensor(output_without_offload) else output_without_offload
+ )
+
+ pipe.enable_xformers_memory_efficient_attention()
+ inputs = self.get_dummy_inputs(torch_device)
+ output_with_offload = pipe(**inputs)[0]
+ output_with_offload = (
+ output_with_offload.cpu() if torch.is_tensor(output_with_offload) else output_without_offload
+ )
+
+ if test_max_difference:
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "XFormers attention should not affect the inference results")
+
+ if test_mean_pixel_difference:
+ assert_mean_pixel_difference(output_with_offload[0], output_without_offload[0])
+
+ def test_progress_bar(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ with io.StringIO() as stderr, contextlib.redirect_stderr(stderr):
+ _ = pipe(**inputs)
+ stderr = stderr.getvalue()
+ # we can't calculate the number of progress steps beforehand e.g. for strength-dependent img2img,
+ # so we just match "5" in "#####| 1/5 [00:01<00:00]"
+ max_steps = re.search("/(.*?) ", stderr).group(1)
+ self.assertTrue(max_steps is not None and len(max_steps) > 0)
+ self.assertTrue(
+ f"{max_steps}/{max_steps}" in stderr, "Progress bar should be enabled and stopped at the max step"
+ )
+
+ pipe.set_progress_bar_config(disable=True)
+ with io.StringIO() as stderr, contextlib.redirect_stderr(stderr):
+ _ = pipe(**inputs)
+ self.assertTrue(stderr.getvalue() == "", "Progress bar should be disabled")
+
+ def test_num_images_per_prompt(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+
+ if "num_images_per_prompt" not in sig.parameters:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ batch_sizes = [1, 2]
+ num_images_per_prompts = [1, 2]
+
+ for batch_size in batch_sizes:
+ for num_images_per_prompt in num_images_per_prompts:
+ inputs = self.get_dummy_inputs(torch_device)
+
+ for key in inputs.keys():
+ if key in self.batch_params:
+ inputs[key] = batch_size * [inputs[key]]
+
+ images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt)[0]
+
+ assert images.shape[0] == batch_size * num_images_per_prompt
+
+ def test_cfg(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+
+ if "guidance_scale" not in sig.parameters:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ inputs["guidance_scale"] = 1.0
+ out_no_cfg = pipe(**inputs)[0]
+
+ inputs["guidance_scale"] = 7.5
+ out_cfg = pipe(**inputs)[0]
+
+ assert out_cfg.shape == out_no_cfg.shape
+
+ def test_callback_inputs(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+ has_callback_tensor_inputs = "callback_on_step_end_tensor_inputs" in sig.parameters
+ has_callback_step_end = "callback_on_step_end" in sig.parameters
+
+ if not (has_callback_tensor_inputs and has_callback_step_end):
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_inputs_subset(pipe, i, t, callback_kwargs):
+ # interate over callback args
+ for tensor_name, tensor_value in callback_kwargs.items():
+ # check that we're only passing in allowed tensor inputs
+ assert tensor_name in pipe._callback_tensor_inputs
+
+ return callback_kwargs
+
+ def callback_inputs_all(pipe, i, t, callback_kwargs):
+ for tensor_name in pipe._callback_tensor_inputs:
+ assert tensor_name in callback_kwargs
+
+ # interate over callback args
+ for tensor_name, tensor_value in callback_kwargs.items():
+ # check that we're only passing in allowed tensor inputs
+ assert tensor_name in pipe._callback_tensor_inputs
+
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ # Test passing in a subset
+ inputs["callback_on_step_end"] = callback_inputs_subset
+ inputs["callback_on_step_end_tensor_inputs"] = ["latents"]
+ inputs["output_type"] = "latent"
+ output = pipe(**inputs)[0]
+
+ # Test passing in a everything
+ inputs["callback_on_step_end"] = callback_inputs_all
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["output_type"] = "latent"
+ output = pipe(**inputs)[0]
+
+ def callback_inputs_change_tensor(pipe, i, t, callback_kwargs):
+ is_last = i == (pipe.num_timesteps - 1)
+ if is_last:
+ callback_kwargs["latents"] = torch.zeros_like(callback_kwargs["latents"])
+ return callback_kwargs
+
+ inputs["callback_on_step_end"] = callback_inputs_change_tensor
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["output_type"] = "latent"
+ output = pipe(**inputs)[0]
+ assert output.abs().sum() == 0
+
+ def test_callback_cfg(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+ has_callback_tensor_inputs = "callback_on_step_end_tensor_inputs" in sig.parameters
+ has_callback_step_end = "callback_on_step_end" in sig.parameters
+
+ if not (has_callback_tensor_inputs and has_callback_step_end):
+ return
+
+ if "guidance_scale" not in sig.parameters:
+ return
+
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_increase_guidance(pipe, i, t, callback_kwargs):
+ pipe._guidance_scale += 1.0
+
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+
+ # use cfg guidance because some pipelines modify the shape of the latents
+ # outside of the denoising loop
+ inputs["guidance_scale"] = 2.0
+ inputs["callback_on_step_end"] = callback_increase_guidance
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ _ = pipe(**inputs)[0]
+
+ # we increase the guidance scale by 1.0 at every step
+ # check that the guidance scale is increased by the number of scheduler timesteps
+ # accounts for models that modify the number of inference steps based on strength
+ assert pipe.guidance_scale == (inputs["guidance_scale"] + pipe.num_timesteps)
+
+ def test_StableDiffusionMixin_component(self):
+ """Any pipeline that have LDMFuncMixin should have vae and unet components."""
+ if not issubclass(self.pipeline_class, StableDiffusionMixin):
+ return
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ self.assertTrue(hasattr(pipe, "vae") and isinstance(pipe.vae, (AutoencoderKL, AutoencoderTiny)))
+ self.assertTrue(
+ hasattr(pipe, "unet")
+ and isinstance(pipe.unet, (UNet2DConditionModel, UNet3DConditionModel, I2VGenXLUNet, UNetMotionModel))
+ )
+
+
+@is_staging_test
+class PipelinePushToHubTester(unittest.TestCase):
+ identifier = uuid.uuid4()
+ repo_id = f"test-pipeline-{identifier}"
+ org_repo_id = f"valid_org/{repo_id}-org"
+
+ def get_pipeline_components(self):
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ dummy_vocab = {"<|startoftext|>": 0, "<|endoftext|>": 1, "!": 2}
+ vocab_path = os.path.join(tmpdir, "vocab.json")
+ with open(vocab_path, "w") as f:
+ json.dump(dummy_vocab, f)
+
+ merges = "ฤ t\nฤ t h"
+ merges_path = os.path.join(tmpdir, "merges.txt")
+ with open(merges_path, "w") as f:
+ f.writelines(merges)
+ tokenizer = CLIPTokenizer(vocab_file=vocab_path, merges_file=merges_path)
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "safety_checker": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def test_push_to_hub(self):
+ components = self.get_pipeline_components()
+ pipeline = StableDiffusionPipeline(**components)
+ pipeline.push_to_hub(self.repo_id, token=TOKEN)
+
+ new_model = UNet2DConditionModel.from_pretrained(f"{USER}/{self.repo_id}", subfolder="unet")
+ unet = components["unet"]
+ for p1, p2 in zip(unet.parameters(), new_model.parameters()):
+ self.assertTrue(torch.equal(p1, p2))
+
+ # Reset repo
+ delete_repo(token=TOKEN, repo_id=self.repo_id)
+
+ # Push to hub via save_pretrained
+ with tempfile.TemporaryDirectory() as tmp_dir:
+ pipeline.save_pretrained(tmp_dir, repo_id=self.repo_id, push_to_hub=True, token=TOKEN)
+
+ new_model = UNet2DConditionModel.from_pretrained(f"{USER}/{self.repo_id}", subfolder="unet")
+ for p1, p2 in zip(unet.parameters(), new_model.parameters()):
+ self.assertTrue(torch.equal(p1, p2))
+
+ # Reset repo
+ delete_repo(self.repo_id, token=TOKEN)
+
+ def test_push_to_hub_in_organization(self):
+ components = self.get_pipeline_components()
+ pipeline = StableDiffusionPipeline(**components)
+ pipeline.push_to_hub(self.org_repo_id, token=TOKEN)
+
+ new_model = UNet2DConditionModel.from_pretrained(self.org_repo_id, subfolder="unet")
+ unet = components["unet"]
+ for p1, p2 in zip(unet.parameters(), new_model.parameters()):
+ self.assertTrue(torch.equal(p1, p2))
+
+ # Reset repo
+ delete_repo(token=TOKEN, repo_id=self.org_repo_id)
+
+ # Push to hub via save_pretrained
+ with tempfile.TemporaryDirectory() as tmp_dir:
+ pipeline.save_pretrained(tmp_dir, push_to_hub=True, token=TOKEN, repo_id=self.org_repo_id)
+
+ new_model = UNet2DConditionModel.from_pretrained(self.org_repo_id, subfolder="unet")
+ for p1, p2 in zip(unet.parameters(), new_model.parameters()):
+ self.assertTrue(torch.equal(p1, p2))
+
+ # Reset repo
+ delete_repo(self.org_repo_id, token=TOKEN)
+
+ @unittest.skipIf(
+ not is_jinja_available(),
+ reason="Model card tests cannot be performed without Jinja installed.",
+ )
+ def test_push_to_hub_library_name(self):
+ components = self.get_pipeline_components()
+ pipeline = StableDiffusionPipeline(**components)
+ pipeline.push_to_hub(self.repo_id, token=TOKEN)
+
+ model_card = ModelCard.load(f"{USER}/{self.repo_id}", token=TOKEN).data
+ assert model_card.library_name == "diffusers"
+
+ # Reset repo
+ delete_repo(self.repo_id, token=TOKEN)
+
+
+# For SDXL and its derivative pipelines (such as ControlNet), we have the text encoders
+# and the tokenizers as optional components. So, we need to override the `test_save_load_optional_components()`
+# test for all such pipelines. This requires us to use a custom `encode_prompt()` function.
+class SDXLOptionalComponentsTesterMixin:
+ def encode_prompt(
+ self, tokenizers, text_encoders, prompt: str, num_images_per_prompt: int = 1, negative_prompt: str = None
+ ):
+ device = text_encoders[0].device
+
+ if isinstance(prompt, str):
+ prompt = [prompt]
+ batch_size = len(prompt)
+
+ prompt_embeds_list = []
+ for tokenizer, text_encoder in zip(tokenizers, text_encoders):
+ text_inputs = tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+
+ text_input_ids = text_inputs.input_ids
+
+ prompt_embeds = text_encoder(text_input_ids.to(device), output_hidden_states=True)
+ pooled_prompt_embeds = prompt_embeds[0]
+ prompt_embeds = prompt_embeds.hidden_states[-2]
+ prompt_embeds_list.append(prompt_embeds)
+
+ prompt_embeds = torch.concat(prompt_embeds_list, dim=-1)
+
+ if negative_prompt is None:
+ negative_prompt_embeds = torch.zeros_like(prompt_embeds)
+ negative_pooled_prompt_embeds = torch.zeros_like(pooled_prompt_embeds)
+ else:
+ negative_prompt = batch_size * [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
+
+ negative_prompt_embeds_list = []
+ for tokenizer, text_encoder in zip(tokenizers, text_encoders):
+ uncond_input = tokenizer(
+ negative_prompt,
+ padding="max_length",
+ max_length=tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+
+ negative_prompt_embeds = text_encoder(uncond_input.input_ids.to(device), output_hidden_states=True)
+ negative_pooled_prompt_embeds = negative_prompt_embeds[0]
+ negative_prompt_embeds = negative_prompt_embeds.hidden_states[-2]
+ negative_prompt_embeds_list.append(negative_prompt_embeds)
+
+ negative_prompt_embeds = torch.concat(negative_prompt_embeds_list, dim=-1)
+
+ bs_embed, seq_len, _ = prompt_embeds.shape
+
+ # duplicate text embeddings for each generation per prompt, using mps friendly method
+ prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+ prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
+
+ # for classifier-free guidance
+ # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
+ seq_len = negative_prompt_embeds.shape[1]
+
+ negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
+ negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+
+ pooled_prompt_embeds = pooled_prompt_embeds.repeat(1, num_images_per_prompt).view(
+ bs_embed * num_images_per_prompt, -1
+ )
+
+ # for classifier-free guidance
+ negative_pooled_prompt_embeds = negative_pooled_prompt_embeds.repeat(1, num_images_per_prompt).view(
+ bs_embed * num_images_per_prompt, -1
+ )
+
+ return prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
+
+ def _test_save_load_optional_components(self, expected_max_difference=1e-4):
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ for optional_component in pipe._optional_components:
+ setattr(pipe, optional_component, None)
+
+ for component in pipe.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ generator_device = "cpu"
+ inputs = self.get_dummy_inputs(generator_device)
+
+ tokenizer = components.pop("tokenizer")
+ tokenizer_2 = components.pop("tokenizer_2")
+ text_encoder = components.pop("text_encoder")
+ text_encoder_2 = components.pop("text_encoder_2")
+
+ tokenizers = [tokenizer, tokenizer_2] if tokenizer is not None else [tokenizer_2]
+ text_encoders = [text_encoder, text_encoder_2] if text_encoder is not None else [text_encoder_2]
+ prompt = inputs.pop("prompt")
+ (
+ prompt_embeds,
+ negative_prompt_embeds,
+ pooled_prompt_embeds,
+ negative_pooled_prompt_embeds,
+ ) = self.encode_prompt(tokenizers, text_encoders, prompt)
+ inputs["prompt_embeds"] = prompt_embeds
+ inputs["negative_prompt_embeds"] = negative_prompt_embeds
+ inputs["pooled_prompt_embeds"] = pooled_prompt_embeds
+ inputs["negative_pooled_prompt_embeds"] = negative_pooled_prompt_embeds
+
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir)
+ for component in pipe_loaded.components.values():
+ if hasattr(component, "set_default_attn_processor"):
+ component.set_default_attn_processor()
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for optional_component in pipe._optional_components:
+ self.assertTrue(
+ getattr(pipe_loaded, optional_component) is None,
+ f"`{optional_component}` did not stay set to None after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(generator_device)
+ _ = inputs.pop("prompt")
+ inputs["prompt_embeds"] = prompt_embeds
+ inputs["negative_prompt_embeds"] = negative_prompt_embeds
+ inputs["pooled_prompt_embeds"] = pooled_prompt_embeds
+ inputs["negative_pooled_prompt_embeds"] = negative_pooled_prompt_embeds
+
+ output_loaded = pipe_loaded(**inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+
+# Some models (e.g. unCLIP) are extremely likely to significantly deviate depending on which hardware is used.
+# This helper function is used to check that the image doesn't deviate on average more than 10 pixels from a
+# reference image.
+def assert_mean_pixel_difference(image, expected_image, expected_max_diff=10):
+ image = np.asarray(DiffusionPipeline.numpy_to_pil(image)[0], dtype=np.float32)
+ expected_image = np.asarray(DiffusionPipeline.numpy_to_pil(expected_image)[0], dtype=np.float32)
+ avg_diff = np.abs(image - expected_image).mean()
+ assert avg_diff < expected_max_diff, f"Error image deviates {avg_diff} pixels on average"
diff --git a/tests/pipelines/test_pipelines_flax.py b/tests/pipelines/test_pipelines_flax.py
new file mode 100644
index 0000000..3e7a2b3
--- /dev/null
+++ b/tests/pipelines/test_pipelines_flax.py
@@ -0,0 +1,260 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import tempfile
+import unittest
+
+import numpy as np
+
+from diffusers.utils import is_flax_available
+from diffusers.utils.testing_utils import require_flax, slow
+
+
+if is_flax_available():
+ import jax
+ import jax.numpy as jnp
+ from flax.jax_utils import replicate
+ from flax.training.common_utils import shard
+
+ from diffusers import FlaxDDIMScheduler, FlaxDiffusionPipeline, FlaxStableDiffusionPipeline
+
+
+@require_flax
+class DownloadTests(unittest.TestCase):
+ def test_download_only_pytorch(self):
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ # pipeline has Flax weights
+ _ = FlaxDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None, cache_dir=tmpdirname
+ )
+
+ all_root_files = [t[-1] for t in os.walk(os.path.join(tmpdirname, os.listdir(tmpdirname)[0], "snapshots"))]
+ files = [item for sublist in all_root_files for item in sublist]
+
+ # None of the downloaded files should be a PyTorch file even if we have some here:
+ # https://huggingface.co/hf-internal-testing/tiny-stable-diffusion-pipe/blob/main/unet/diffusion_pytorch_model.bin
+ assert not any(f.endswith(".bin") for f in files)
+
+
+@slow
+@require_flax
+class FlaxPipelineTests(unittest.TestCase):
+ def test_dummy_all_tpus(self):
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None
+ )
+
+ prompt = (
+ "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of"
+ " field, close up, split lighting, cinematic"
+ )
+
+ prng_seed = jax.random.PRNGKey(0)
+ num_inference_steps = 4
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = pipeline.prepare_inputs(prompt)
+
+ # shard inputs and rng
+ params = replicate(params)
+ prng_seed = jax.random.split(prng_seed, num_samples)
+ prompt_ids = shard(prompt_ids)
+
+ images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+
+ assert images.shape == (num_samples, 1, 64, 64, 3)
+ if jax.device_count() == 8:
+ assert np.abs(np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 4.1514745) < 1e-3
+ assert np.abs(np.abs(images, dtype=np.float32).sum() - 49947.875) < 5e-1
+
+ images_pil = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
+ assert len(images_pil) == num_samples
+
+ def test_stable_diffusion_v1_4(self):
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", revision="flax", safety_checker=None
+ )
+
+ prompt = (
+ "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of"
+ " field, close up, split lighting, cinematic"
+ )
+
+ prng_seed = jax.random.PRNGKey(0)
+ num_inference_steps = 50
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = pipeline.prepare_inputs(prompt)
+
+ # shard inputs and rng
+ params = replicate(params)
+ prng_seed = jax.random.split(prng_seed, num_samples)
+ prompt_ids = shard(prompt_ids)
+
+ images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+
+ assert images.shape == (num_samples, 1, 512, 512, 3)
+ if jax.device_count() == 8:
+ assert np.abs((np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 0.05652401)) < 1e-2
+ assert np.abs((np.abs(images, dtype=np.float32).sum() - 2383808.2)) < 5e-1
+
+ def test_stable_diffusion_v1_4_bfloat_16(self):
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", revision="bf16", dtype=jnp.bfloat16, safety_checker=None
+ )
+
+ prompt = (
+ "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of"
+ " field, close up, split lighting, cinematic"
+ )
+
+ prng_seed = jax.random.PRNGKey(0)
+ num_inference_steps = 50
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = pipeline.prepare_inputs(prompt)
+
+ # shard inputs and rng
+ params = replicate(params)
+ prng_seed = jax.random.split(prng_seed, num_samples)
+ prompt_ids = shard(prompt_ids)
+
+ images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+
+ assert images.shape == (num_samples, 1, 512, 512, 3)
+ if jax.device_count() == 8:
+ assert np.abs((np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 0.04003906)) < 5e-2
+ assert np.abs((np.abs(images, dtype=np.float32).sum() - 2373516.75)) < 5e-1
+
+ def test_stable_diffusion_v1_4_bfloat_16_with_safety(self):
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4", revision="bf16", dtype=jnp.bfloat16
+ )
+
+ prompt = (
+ "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of"
+ " field, close up, split lighting, cinematic"
+ )
+
+ prng_seed = jax.random.PRNGKey(0)
+ num_inference_steps = 50
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = pipeline.prepare_inputs(prompt)
+
+ # shard inputs and rng
+ params = replicate(params)
+ prng_seed = jax.random.split(prng_seed, num_samples)
+ prompt_ids = shard(prompt_ids)
+
+ images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+
+ assert images.shape == (num_samples, 1, 512, 512, 3)
+ if jax.device_count() == 8:
+ assert np.abs((np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 0.04003906)) < 5e-2
+ assert np.abs((np.abs(images, dtype=np.float32).sum() - 2373516.75)) < 5e-1
+
+ def test_stable_diffusion_v1_4_bfloat_16_ddim(self):
+ scheduler = FlaxDDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ set_alpha_to_one=False,
+ steps_offset=1,
+ )
+
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ revision="bf16",
+ dtype=jnp.bfloat16,
+ scheduler=scheduler,
+ safety_checker=None,
+ )
+ scheduler_state = scheduler.create_state()
+
+ params["scheduler"] = scheduler_state
+
+ prompt = (
+ "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of"
+ " field, close up, split lighting, cinematic"
+ )
+
+ prng_seed = jax.random.PRNGKey(0)
+ num_inference_steps = 50
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prompt_ids = pipeline.prepare_inputs(prompt)
+
+ # shard inputs and rng
+ params = replicate(params)
+ prng_seed = jax.random.split(prng_seed, num_samples)
+ prompt_ids = shard(prompt_ids)
+
+ images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
+
+ assert images.shape == (num_samples, 1, 512, 512, 3)
+ if jax.device_count() == 8:
+ assert np.abs((np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 0.045043945)) < 5e-2
+ assert np.abs((np.abs(images, dtype=np.float32).sum() - 2347693.5)) < 5e-1
+
+ def test_jax_memory_efficient_attention(self):
+ prompt = (
+ "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of"
+ " field, close up, split lighting, cinematic"
+ )
+
+ num_samples = jax.device_count()
+ prompt = num_samples * [prompt]
+ prng_seed = jax.random.split(jax.random.PRNGKey(0), num_samples)
+
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ revision="bf16",
+ dtype=jnp.bfloat16,
+ safety_checker=None,
+ )
+
+ params = replicate(params)
+ prompt_ids = pipeline.prepare_inputs(prompt)
+ prompt_ids = shard(prompt_ids)
+ images = pipeline(prompt_ids, params, prng_seed, jit=True).images
+ assert images.shape == (num_samples, 1, 512, 512, 3)
+ slice = images[2, 0, 256, 10:17, 1]
+
+ # With memory efficient attention
+ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
+ "CompVis/stable-diffusion-v1-4",
+ revision="bf16",
+ dtype=jnp.bfloat16,
+ safety_checker=None,
+ use_memory_efficient_attention=True,
+ )
+
+ params = replicate(params)
+ prompt_ids = pipeline.prepare_inputs(prompt)
+ prompt_ids = shard(prompt_ids)
+ images_eff = pipeline(prompt_ids, params, prng_seed, jit=True).images
+ assert images_eff.shape == (num_samples, 1, 512, 512, 3)
+ slice_eff = images[2, 0, 256, 10:17, 1]
+
+ # I checked the results visually and they are very similar. However, I saw that the max diff is `1` and the `sum`
+ # over the 8 images is exactly `256`, which is very suspicious. Testing a random slice for now.
+ assert abs(slice_eff - slice).max() < 1e-2
diff --git a/tests/pipelines/test_pipelines_onnx_common.py b/tests/pipelines/test_pipelines_onnx_common.py
new file mode 100644
index 0000000..575ecd0
--- /dev/null
+++ b/tests/pipelines/test_pipelines_onnx_common.py
@@ -0,0 +1,12 @@
+from diffusers.utils.testing_utils import require_onnxruntime
+
+
+@require_onnxruntime
+class OnnxPipelineTesterMixin:
+ """
+ This mixin is designed to be used with unittest.TestCase classes.
+ It provides a set of common tests for each ONNXRuntime pipeline, e.g. saving and loading the pipeline,
+ equivalence of dict and tuple outputs, etc.
+ """
+
+ pass
diff --git a/tests/pipelines/text_to_video_synthesis/__init__.py b/tests/pipelines/text_to_video_synthesis/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/text_to_video_synthesis/test_text_to_video.py b/tests/pipelines/text_to_video_synthesis/test_text_to_video.py
new file mode 100644
index 0000000..9dc4801
--- /dev/null
+++ b/tests/pipelines/text_to_video_synthesis/test_text_to_video.py
@@ -0,0 +1,215 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ TextToVideoSDPipeline,
+ UNet3DConditionModel,
+)
+from diffusers.utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_numpy,
+ numpy_cosine_similarity_distance,
+ require_torch_gpu,
+ skip_mps,
+ slow,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, SDFunctionTesterMixin
+
+
+enable_full_determinism()
+
+
+@skip_mps
+class TextToVideoSDPipelineFastTests(PipelineTesterMixin, SDFunctionTesterMixin, unittest.TestCase):
+ pipeline_class = TextToVideoSDPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ # No `output_type`.
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "return_dict",
+ "callback",
+ "callback_steps",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet3DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock3D", "DownBlock3D"),
+ up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"),
+ cross_attention_dim=4,
+ attention_head_dim=4,
+ norm_num_groups=2,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=(8,),
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=32,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=4,
+ intermediate_size=16,
+ layer_norm_eps=1e-05,
+ num_attention_heads=2,
+ num_hidden_layers=2,
+ pad_token_id=1,
+ vocab_size=1000,
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "pt",
+ }
+ return inputs
+
+ def test_text_to_video_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = TextToVideoSDPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["output_type"] = "np"
+ frames = sd_pipe(**inputs).frames
+
+ image_slice = frames[0][0][-3:, -3:, -1]
+
+ assert frames[0][0].shape == (32, 32, 3)
+ expected_slice = np.array([0.7537, 0.1752, 0.6157, 0.5508, 0.4240, 0.4110, 0.4838, 0.5648, 0.5094])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ @unittest.skipIf(torch_device != "cuda", reason="Feature isn't heavily used. Test in CUDA environment only.")
+ def test_attention_slicing_forward_pass(self):
+ self._test_attention_slicing_forward_pass(test_mean_pixel_difference=False, expected_max_diff=3e-3)
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False, expected_max_diff=1e-2)
+
+ # (todo): sayakpaul
+ @unittest.skip(reason="Batching needs to be properly figured out first for this pipeline.")
+ def test_inference_batch_consistent(self):
+ pass
+
+ # (todo): sayakpaul
+ @unittest.skip(reason="Batching needs to be properly figured out first for this pipeline.")
+ def test_inference_batch_single_identical(self):
+ pass
+
+ @unittest.skip(reason="`num_images_per_prompt` argument is not supported for this pipeline.")
+ def test_num_images_per_prompt(self):
+ pass
+
+ def test_progress_bar(self):
+ return super().test_progress_bar()
+
+
+@slow
+@skip_mps
+@require_torch_gpu
+class TextToVideoSDPipelineSlowTests(unittest.TestCase):
+ def test_two_step_model(self):
+ expected_video = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text-to-video/video_2step.npy"
+ )
+
+ pipe = TextToVideoSDPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b")
+ pipe = pipe.to(torch_device)
+
+ prompt = "Spiderman is surfing"
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ video_frames = pipe(prompt, generator=generator, num_inference_steps=2, output_type="np").frames
+ assert numpy_cosine_similarity_distance(expected_video.flatten(), video_frames.flatten()) < 1e-4
+
+ def test_two_step_model_with_freeu(self):
+ expected_video = []
+
+ pipe = TextToVideoSDPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b")
+ pipe = pipe.to(torch_device)
+
+ prompt = "Spiderman is surfing"
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
+ video_frames = pipe(prompt, generator=generator, num_inference_steps=2, output_type="np").frames
+ video = video_frames[0, 0, -3:, -3:, -1].flatten()
+
+ expected_video = [0.3643, 0.3455, 0.3831, 0.3923, 0.2978, 0.3247, 0.3278, 0.3201, 0.3475]
+
+ assert np.abs(expected_video - video).mean() < 5e-2
diff --git a/tests/pipelines/text_to_video_synthesis/test_text_to_video_zero.py b/tests/pipelines/text_to_video_synthesis/test_text_to_video_zero.py
new file mode 100644
index 0000000..b93d9ee
--- /dev/null
+++ b/tests/pipelines/text_to_video_synthesis/test_text_to_video_zero.py
@@ -0,0 +1,42 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import torch
+
+from diffusers import DDIMScheduler, TextToVideoZeroPipeline
+from diffusers.utils.testing_utils import load_pt, nightly, require_torch_gpu
+
+from ..test_pipelines_common import assert_mean_pixel_difference
+
+
+@nightly
+@require_torch_gpu
+class TextToVideoZeroPipelineSlowTests(unittest.TestCase):
+ def test_full_model(self):
+ model_id = "runwayml/stable-diffusion-v1-5"
+ pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ generator = torch.Generator(device="cuda").manual_seed(0)
+
+ prompt = "A bear is playing a guitar on Times Square"
+ result = pipe(prompt=prompt, generator=generator).images
+
+ expected_result = load_pt(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text-to-video/A bear is playing a guitar on Times Square.pt"
+ )
+
+ assert_mean_pixel_difference(result, expected_result)
diff --git a/tests/pipelines/text_to_video_synthesis/test_text_to_video_zero_sdxl.py b/tests/pipelines/text_to_video_synthesis/test_text_to_video_zero_sdxl.py
new file mode 100644
index 0000000..1d1d945
--- /dev/null
+++ b/tests/pipelines/text_to_video_synthesis/test_text_to_video_zero_sdxl.py
@@ -0,0 +1,405 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+import inspect
+import io
+import re
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import AutoencoderKL, DDIMScheduler, TextToVideoZeroSDXLPipeline, UNet2DConditionModel
+from diffusers.utils.import_utils import is_accelerate_available, is_accelerate_version
+from diffusers.utils.testing_utils import enable_full_determinism, nightly, require_torch_gpu, torch_device
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+def to_np(tensor):
+ if isinstance(tensor, torch.Tensor):
+ tensor = tensor.detach().cpu().numpy()
+
+ return tensor
+
+
+class TextToVideoZeroSDXLPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = TextToVideoZeroSDXLPipeline
+ params = TEXT_TO_IMAGE_PARAMS
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+ generator_device = "cpu"
+
+ def get_dummy_components(self, seed=0):
+ torch.manual_seed(seed)
+ unet = UNet2DConditionModel(
+ block_out_channels=(2, 4),
+ layers_per_block=2,
+ sample_size=2,
+ norm_num_groups=2,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ # SD2-specific config below
+ attention_head_dim=(2, 4),
+ use_linear_projection=True,
+ addition_embed_type="text_time",
+ addition_time_embed_dim=8,
+ transformer_layers_per_block=(1, 2),
+ projection_class_embeddings_input_dim=80, # 6 * 8 + 32
+ cross_attention_dim=64,
+ )
+ scheduler = DDIMScheduler(
+ num_train_timesteps=1000,
+ beta_start=0.0001,
+ beta_end=0.02,
+ beta_schedule="linear",
+ trained_betas=None,
+ clip_sample=True,
+ set_alpha_to_one=True,
+ steps_offset=0,
+ prediction_type="epsilon",
+ thresholding=False,
+ dynamic_thresholding_ratio=0.995,
+ clip_sample_range=1.0,
+ sample_max_value=1.0,
+ timestep_spacing="leading",
+ rescale_betas_zero_snr=False,
+ )
+ torch.manual_seed(seed)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=128,
+ )
+ torch.manual_seed(seed)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ # SD2-specific config below
+ hidden_act="gelu",
+ projection_dim=32,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+ tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_encoder_2": text_encoder_2,
+ "tokenizer_2": tokenizer_2,
+ "image_encoder": None,
+ "feature_extractor": None,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A panda dancing in Antarctica",
+ "generator": generator,
+ "num_inference_steps": 5,
+ "t0": 1,
+ "t1": 3,
+ "height": 64,
+ "width": 64,
+ "video_length": 3,
+ "output_type": "np",
+ }
+ return inputs
+
+ def get_generator(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ return generator
+
+ def test_text_to_video_zero_sdxl(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+ result = pipe(**inputs).images
+
+ first_frame_slice = result[0, -3:, -3:, -1]
+ last_frame_slice = result[-1, -3:, -3:, 0]
+
+ expected_slice1 = np.array([0.48, 0.58, 0.53, 0.59, 0.50, 0.44, 0.60, 0.65, 0.52])
+ expected_slice2 = np.array([0.66, 0.49, 0.40, 0.70, 0.47, 0.51, 0.73, 0.65, 0.52])
+
+ assert np.abs(first_frame_slice.flatten() - expected_slice1).max() < 1e-2
+ assert np.abs(last_frame_slice.flatten() - expected_slice2).max() < 1e-2
+
+ @unittest.skip(
+ reason="Cannot call `set_default_attn_processor` as this pipeline uses a specific attention processor."
+ )
+ def test_attention_slicing_forward_pass(self):
+ pass
+
+ def test_cfg(self):
+ sig = inspect.signature(self.pipeline_class.__call__)
+ if "guidance_scale" not in sig.parameters:
+ return
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+
+ inputs["guidance_scale"] = 1.0
+ out_no_cfg = pipe(**inputs)[0]
+
+ inputs["guidance_scale"] = 7.5
+ out_cfg = pipe(**inputs)[0]
+
+ assert out_cfg.shape == out_no_cfg.shape
+
+ def test_dict_tuple_outputs_equivalent(self, expected_max_difference=1e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(self.generator_device))[0]
+ output_tuple = pipe(**self.get_dummy_inputs(self.generator_device), return_dict=False)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_tuple)).max()
+ self.assertLess(max_diff, expected_max_difference)
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_float16_inference(self, expected_max_diff=5e-2):
+ components = self.get_dummy_components()
+ for name, module in components.items():
+ if hasattr(module, "half"):
+ components[name] = module.to(torch_device).half()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ components = self.get_dummy_components()
+ pipe_fp16 = self.pipeline_class(**components)
+ pipe_fp16.to(torch_device, torch.float16)
+ pipe_fp16.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+ # # Reset generator in case it is used inside dummy inputs
+ if "generator" in inputs:
+ inputs["generator"] = self.get_generator(self.generator_device)
+
+ output = pipe(**inputs)[0]
+
+ fp16_inputs = self.get_dummy_inputs(self.generator_device)
+ # Reset generator in case it is used inside dummy inputs
+ if "generator" in fp16_inputs:
+ fp16_inputs["generator"] = self.get_generator(self.generator_device)
+
+ output_fp16 = pipe_fp16(**fp16_inputs)[0]
+
+ max_diff = np.abs(to_np(output) - to_np(output_fp16)).max()
+ self.assertLess(max_diff, expected_max_diff, "The outputs of the fp16 and fp32 pipelines are too different.")
+
+ @unittest.skip(reason="Batching needs to be properly figured out first for this pipeline.")
+ def test_inference_batch_consistent(self):
+ pass
+
+ @unittest.skip(
+ reason="Cannot call `set_default_attn_processor` as this pipeline uses a specific attention processor."
+ )
+ def test_inference_batch_single_identical(self):
+ pass
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_accelerate_available() or is_accelerate_version("<", "0.17.0"),
+ reason="CPU offload is only available with CUDA and `accelerate v0.17.0` or higher",
+ )
+ def test_model_cpu_offload_forward_pass(self, expected_max_diff=2e-4):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+ output_without_offload = pipe(**inputs)[0]
+
+ pipe.enable_model_cpu_offload()
+ inputs = self.get_dummy_inputs(self.generator_device)
+ output_with_offload = pipe(**inputs)[0]
+
+ max_diff = np.abs(to_np(output_with_offload) - to_np(output_without_offload)).max()
+ self.assertLess(max_diff, expected_max_diff, "CPU offloading should not affect the inference results")
+
+ @unittest.skip(reason="`num_images_per_prompt` argument is not supported for this pipeline.")
+ def test_pipeline_call_signature(self):
+ pass
+
+ def test_progress_bar(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+ with io.StringIO() as stderr, contextlib.redirect_stderr(stderr):
+ _ = pipe(**inputs)
+ stderr = stderr.getvalue()
+ # we can't calculate the number of progress steps beforehand e.g. for strength-dependent img2img,
+ # so we just match "5" in "#####| 1/5 [00:01<00:00]"
+ max_steps = re.search("/(.*?) ", stderr).group(1)
+ self.assertTrue(max_steps is not None and len(max_steps) > 0)
+ self.assertTrue(
+ f"{max_steps}/{max_steps}" in stderr, "Progress bar should be enabled and stopped at the max step"
+ )
+
+ pipe.set_progress_bar_config(disable=True)
+ with io.StringIO() as stderr, contextlib.redirect_stderr(stderr):
+ _ = pipe(**inputs)
+ self.assertTrue(stderr.getvalue() == "", "Progress bar should be disabled")
+
+ @unittest.skipIf(torch_device != "cuda", reason="float16 requires CUDA")
+ def test_save_load_float16(self, expected_max_diff=1e-2):
+ components = self.get_dummy_components()
+ for name, module in components.items():
+ if hasattr(module, "half"):
+ components[name] = module.to(torch_device).half()
+
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+ output = pipe(**inputs)[0]
+
+ with tempfile.TemporaryDirectory() as tmpdir:
+ pipe.save_pretrained(tmpdir)
+ pipe_loaded = self.pipeline_class.from_pretrained(tmpdir, torch_dtype=torch.float16)
+ pipe_loaded.to(torch_device)
+ pipe_loaded.set_progress_bar_config(disable=None)
+
+ for name, component in pipe_loaded.components.items():
+ if hasattr(component, "dtype"):
+ self.assertTrue(
+ component.dtype == torch.float16,
+ f"`{name}.dtype` switched from `float16` to {component.dtype} after loading.",
+ )
+
+ inputs = self.get_dummy_inputs(self.generator_device)
+ output_loaded = pipe_loaded(**inputs)[0]
+ max_diff = np.abs(to_np(output) - to_np(output_loaded)).max()
+ self.assertLess(
+ max_diff, expected_max_diff, "The output of the fp16 pipeline changed after saving and loading."
+ )
+
+ @unittest.skip(
+ reason="Cannot call `set_default_attn_processor` as this pipeline uses a specific attention processor."
+ )
+ def test_save_load_local(self):
+ pass
+
+ @unittest.skip(
+ reason="Cannot call `set_default_attn_processor` as this pipeline uses a specific attention processor."
+ )
+ def test_save_load_optional_components(self):
+ pass
+
+ @unittest.skip(
+ reason="Cannot call `set_default_attn_processor` as this pipeline uses a specific attention processor."
+ )
+ def test_sequential_cpu_offload_forward_pass(self):
+ pass
+
+ @unittest.skipIf(torch_device != "cuda", reason="CUDA and CPU are required to switch devices")
+ def test_to_device(self):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe.to("cpu")
+ model_devices = [component.device.type for component in components.values() if hasattr(component, "device")]
+ self.assertTrue(all(device == "cpu" for device in model_devices))
+
+ output_cpu = pipe(**self.get_dummy_inputs("cpu"))[0] # generator set to cpu
+ self.assertTrue(np.isnan(output_cpu).sum() == 0)
+
+ pipe.to("cuda")
+ model_devices = [component.device.type for component in components.values() if hasattr(component, "device")]
+ self.assertTrue(all(device == "cuda" for device in model_devices))
+
+ output_cuda = pipe(**self.get_dummy_inputs("cpu"))[0] # generator set to cpu
+ self.assertTrue(np.isnan(to_np(output_cuda)).sum() == 0)
+
+ @unittest.skip(
+ reason="Cannot call `set_default_attn_processor` as this pipeline uses a specific attention processor."
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ pass
+
+
+@nightly
+@require_torch_gpu
+class TextToVideoZeroSDXLPipelineSlowTests(unittest.TestCase):
+ def test_full_model(self):
+ model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+ pipe = TextToVideoZeroSDXLPipeline.from_pretrained(
+ model_id, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
+ )
+ pipe.enable_model_cpu_offload()
+ pipe.enable_vae_slicing()
+
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ generator = torch.Generator(device="cpu").manual_seed(0)
+
+ prompt = "A panda dancing in Antarctica"
+ result = pipe(prompt=prompt, generator=generator).images
+
+ first_frame_slice = result[0, -3:, -3:, -1]
+ last_frame_slice = result[-1, -3:, -3:, 0]
+
+ expected_slice1 = np.array([0.57, 0.57, 0.57, 0.57, 0.57, 0.56, 0.55, 0.56, 0.56])
+ expected_slice2 = np.array([0.54, 0.53, 0.53, 0.53, 0.53, 0.52, 0.53, 0.53, 0.53])
+
+ assert np.abs(first_frame_slice.flatten() - expected_slice1).max() < 1e-2
+ assert np.abs(last_frame_slice.flatten() - expected_slice2).max() < 1e-2
diff --git a/tests/pipelines/text_to_video_synthesis/test_video_to_video.py b/tests/pipelines/text_to_video_synthesis/test_video_to_video.py
new file mode 100644
index 0000000..7f28d12
--- /dev/null
+++ b/tests/pipelines/text_to_video_synthesis/test_video_to_video.py
@@ -0,0 +1,224 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+ AutoencoderKL,
+ DDIMScheduler,
+ UNet3DConditionModel,
+ VideoToVideoSDPipeline,
+)
+from diffusers.utils import is_xformers_available
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ is_flaky,
+ nightly,
+ numpy_cosine_similarity_distance,
+ skip_mps,
+ torch_device,
+)
+
+from ..pipeline_params import (
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+@skip_mps
+class VideoToVideoSDPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = VideoToVideoSDPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS.union({"video"}) - {"image", "width", "height"}
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS.union({"video"}) - {"image"}
+ required_optional_params = PipelineTesterMixin.required_optional_params - {"latents"}
+ test_attention_slicing = False
+
+ # No `output_type`.
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "latents",
+ "return_dict",
+ "callback",
+ "callback_steps",
+ ]
+ )
+
+ def get_dummy_components(self):
+ torch.manual_seed(0)
+ unet = UNet3DConditionModel(
+ block_out_channels=(4, 8),
+ layers_per_block=1,
+ sample_size=32,
+ in_channels=4,
+ out_channels=4,
+ down_block_types=("CrossAttnDownBlock3D", "DownBlock3D"),
+ up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"),
+ cross_attention_dim=32,
+ attention_head_dim=4,
+ norm_num_groups=2,
+ )
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=True,
+ set_alpha_to_one=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[
+ 8,
+ ],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=[
+ "DownEncoderBlock2D",
+ ],
+ up_block_types=["UpDecoderBlock2D"],
+ latent_channels=4,
+ sample_size=32,
+ norm_num_groups=2,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ hidden_act="gelu",
+ projection_dim=512,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ }
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ # 3 frames
+ video = floats_tensor((1, 3, 3, 32, 32), rng=random.Random(seed)).to(device)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "A painting of a squirrel eating a burger",
+ "video": video,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "pt",
+ }
+ return inputs
+
+ def test_text_to_video_default_case(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ sd_pipe = VideoToVideoSDPipeline(**components)
+ sd_pipe = sd_pipe.to(device)
+ sd_pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(device)
+ inputs["output_type"] = "np"
+ frames = sd_pipe(**inputs).frames
+ image_slice = frames[0][0][-3:, -3:, -1]
+
+ assert frames[0][0].shape == (32, 32, 3)
+ expected_slice = np.array([0.6391, 0.5350, 0.5202, 0.5521, 0.5453, 0.5393, 0.6652, 0.5270, 0.5185])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+ @is_flaky()
+ def test_save_load_optional_components(self):
+ super().test_save_load_optional_components(expected_max_difference=0.001)
+
+ @is_flaky()
+ def test_dict_tuple_outputs_equivalent(self):
+ super().test_dict_tuple_outputs_equivalent()
+
+ @is_flaky()
+ def test_save_load_local(self):
+ super().test_save_load_local()
+
+ @unittest.skipIf(
+ torch_device != "cuda" or not is_xformers_available(),
+ reason="XFormers attention is only available with CUDA and `xformers` installed",
+ )
+ def test_xformers_attention_forwardGenerator_pass(self):
+ self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False, expected_max_diff=5e-3)
+
+ # (todo): sayakpaul
+ @unittest.skip(reason="Batching needs to be properly figured out first for this pipeline.")
+ def test_inference_batch_consistent(self):
+ pass
+
+ # (todo): sayakpaul
+ @unittest.skip(reason="Batching needs to be properly figured out first for this pipeline.")
+ def test_inference_batch_single_identical(self):
+ pass
+
+ @unittest.skip(reason="`num_images_per_prompt` argument is not supported for this pipeline.")
+ def test_num_images_per_prompt(self):
+ pass
+
+ def test_progress_bar(self):
+ return super().test_progress_bar()
+
+
+@nightly
+@skip_mps
+class VideoToVideoSDPipelineSlowTests(unittest.TestCase):
+ def test_two_step_model(self):
+ pipe = VideoToVideoSDPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
+
+ # 10 frames
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ video = torch.randn((1, 10, 3, 320, 576), generator=generator)
+
+ prompt = "Spiderman is surfing"
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ video_frames = pipe(prompt, video=video, generator=generator, num_inference_steps=3, output_type="np").frames
+
+ expected_array = np.array(
+ [0.17114258, 0.13720703, 0.08886719, 0.14819336, 0.1730957, 0.24584961, 0.22021484, 0.35180664, 0.2607422]
+ )
+ output_array = video_frames[0, 0, :3, :3, 0].flatten()
+ assert numpy_cosine_similarity_distance(expected_array, output_array) < 1e-3
diff --git a/tests/pipelines/unclip/__init__.py b/tests/pipelines/unclip/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/unclip/test_unclip.py b/tests/pipelines/unclip/test_unclip.py
new file mode 100644
index 0000000..60c5c52
--- /dev/null
+++ b/tests/pipelines/unclip/test_unclip.py
@@ -0,0 +1,507 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import PriorTransformer, UnCLIPPipeline, UnCLIPScheduler, UNet2DConditionModel, UNet2DModel
+from diffusers.pipelines.unclip.text_proj import UnCLIPTextProjModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ skip_mps,
+ torch_device,
+)
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class UnCLIPPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = UnCLIPPipeline
+ params = TEXT_TO_IMAGE_PARAMS - {
+ "negative_prompt",
+ "height",
+ "width",
+ "negative_prompt_embeds",
+ "guidance_scale",
+ "prompt_embeds",
+ "cross_attention_kwargs",
+ }
+ batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+ required_optional_params = [
+ "generator",
+ "return_dict",
+ "prior_num_inference_steps",
+ "decoder_num_inference_steps",
+ "super_res_num_inference_steps",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config)
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "num_attention_heads": 2,
+ "attention_head_dim": 12,
+ "embedding_dim": self.text_embedder_hidden_size,
+ "num_layers": 1,
+ }
+
+ model = PriorTransformer(**model_kwargs)
+ return model
+
+ @property
+ def dummy_text_proj(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "clip_embeddings_dim": self.text_embedder_hidden_size,
+ "time_embed_dim": self.time_embed_dim,
+ "cross_attention_dim": self.cross_attention_dim,
+ }
+
+ model = UnCLIPTextProjModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_decoder(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "sample_size": 32,
+ # RGB in channels
+ "in_channels": 3,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 6,
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": "identity",
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_super_res_kwargs(self):
+ return {
+ "sample_size": 64,
+ "layers_per_block": 1,
+ "down_block_types": ("ResnetDownsampleBlock2D", "ResnetDownsampleBlock2D"),
+ "up_block_types": ("ResnetUpsampleBlock2D", "ResnetUpsampleBlock2D"),
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "in_channels": 6,
+ "out_channels": 3,
+ }
+
+ @property
+ def dummy_super_res_first(self):
+ torch.manual_seed(0)
+
+ model = UNet2DModel(**self.dummy_super_res_kwargs)
+ return model
+
+ @property
+ def dummy_super_res_last(self):
+ # seeded differently to get different unet than `self.dummy_super_res_first`
+ torch.manual_seed(1)
+
+ model = UNet2DModel(**self.dummy_super_res_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ decoder = self.dummy_decoder
+ text_proj = self.dummy_text_proj
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ super_res_first = self.dummy_super_res_first
+ super_res_last = self.dummy_super_res_last
+
+ prior_scheduler = UnCLIPScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="sample",
+ num_train_timesteps=1000,
+ clip_sample_range=5.0,
+ )
+
+ decoder_scheduler = UnCLIPScheduler(
+ variance_type="learned_range",
+ prediction_type="epsilon",
+ num_train_timesteps=1000,
+ )
+
+ super_res_scheduler = UnCLIPScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="epsilon",
+ num_train_timesteps=1000,
+ )
+
+ components = {
+ "prior": prior,
+ "decoder": decoder,
+ "text_proj": text_proj,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "super_res_first": super_res_first,
+ "super_res_last": super_res_last,
+ "prior_scheduler": prior_scheduler,
+ "decoder_scheduler": decoder_scheduler,
+ "super_res_scheduler": super_res_scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "prior_num_inference_steps": 2,
+ "decoder_num_inference_steps": 2,
+ "super_res_num_inference_steps": 2,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def test_unclip(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(
+ **self.get_dummy_inputs(device),
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [
+ 0.9997,
+ 0.9988,
+ 0.0028,
+ 0.9997,
+ 0.9984,
+ 0.9965,
+ 0.0029,
+ 0.9986,
+ 0.0025,
+ ]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_unclip_passed_text_embed(self):
+ device = torch.device("cpu")
+
+ class DummyScheduler:
+ init_noise_sigma = 1
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ prior = components["prior"]
+ decoder = components["decoder"]
+ super_res_first = components["super_res_first"]
+ tokenizer = components["tokenizer"]
+ text_encoder = components["text_encoder"]
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ dtype = prior.dtype
+ batch_size = 1
+
+ shape = (batch_size, prior.config.embedding_dim)
+ prior_latents = pipe.prepare_latents(
+ shape, dtype=dtype, device=device, generator=generator, latents=None, scheduler=DummyScheduler()
+ )
+ shape = (batch_size, decoder.config.in_channels, decoder.config.sample_size, decoder.config.sample_size)
+ decoder_latents = pipe.prepare_latents(
+ shape, dtype=dtype, device=device, generator=generator, latents=None, scheduler=DummyScheduler()
+ )
+
+ shape = (
+ batch_size,
+ super_res_first.config.in_channels // 2,
+ super_res_first.config.sample_size,
+ super_res_first.config.sample_size,
+ )
+ super_res_latents = pipe.prepare_latents(
+ shape, dtype=dtype, device=device, generator=generator, latents=None, scheduler=DummyScheduler()
+ )
+
+ pipe.set_progress_bar_config(disable=None)
+
+ prompt = "this is a prompt example"
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ output = pipe(
+ [prompt],
+ generator=generator,
+ prior_num_inference_steps=2,
+ decoder_num_inference_steps=2,
+ super_res_num_inference_steps=2,
+ prior_latents=prior_latents,
+ decoder_latents=decoder_latents,
+ super_res_latents=super_res_latents,
+ output_type="np",
+ )
+ image = output.images
+
+ text_inputs = tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=tokenizer.model_max_length,
+ return_tensors="pt",
+ )
+ text_model_output = text_encoder(text_inputs.input_ids)
+ text_attention_mask = text_inputs.attention_mask
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ image_from_text = pipe(
+ generator=generator,
+ prior_num_inference_steps=2,
+ decoder_num_inference_steps=2,
+ super_res_num_inference_steps=2,
+ prior_latents=prior_latents,
+ decoder_latents=decoder_latents,
+ super_res_latents=super_res_latents,
+ text_model_output=text_model_output,
+ text_attention_mask=text_attention_mask,
+ output_type="np",
+ )[0]
+
+ # make sure passing text embeddings manually is identical
+ assert np.abs(image - image_from_text).max() < 1e-4
+
+ # Overriding PipelineTesterMixin::test_attention_slicing_forward_pass
+ # because UnCLIP GPU undeterminism requires a looser check.
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+
+ self._test_attention_slicing_forward_pass(test_max_difference=test_max_difference, expected_max_diff=0.01)
+
+ # Overriding PipelineTesterMixin::test_inference_batch_single_identical
+ # because UnCLIP undeterminism requires a looser check.
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ additional_params_copy_to_batched_inputs = [
+ "prior_num_inference_steps",
+ "decoder_num_inference_steps",
+ "super_res_num_inference_steps",
+ ]
+
+ self._test_inference_batch_single_identical(
+ additional_params_copy_to_batched_inputs=additional_params_copy_to_batched_inputs, expected_max_diff=5e-3
+ )
+
+ def test_inference_batch_consistent(self):
+ additional_params_copy_to_batched_inputs = [
+ "prior_num_inference_steps",
+ "decoder_num_inference_steps",
+ "super_res_num_inference_steps",
+ ]
+
+ if torch_device == "mps":
+ # TODO: MPS errors with larger batch sizes
+ batch_sizes = [2, 3]
+ self._test_inference_batch_consistent(
+ batch_sizes=batch_sizes,
+ additional_params_copy_to_batched_inputs=additional_params_copy_to_batched_inputs,
+ )
+ else:
+ self._test_inference_batch_consistent(
+ additional_params_copy_to_batched_inputs=additional_params_copy_to_batched_inputs
+ )
+
+ @skip_mps
+ def test_dict_tuple_outputs_equivalent(self):
+ return super().test_dict_tuple_outputs_equivalent()
+
+ @skip_mps
+ def test_save_load_local(self):
+ return super().test_save_load_local(expected_max_difference=5e-3)
+
+ @skip_mps
+ def test_save_load_optional_components(self):
+ return super().test_save_load_optional_components()
+
+ @unittest.skip("UnCLIP produces very large differences in fp16 vs fp32. Test is not useful.")
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1.0)
+
+
+@nightly
+class UnCLIPPipelineCPUIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_unclip_karlo_cpu_fp32(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/unclip/karlo_v1_alpha_horse_cpu.npy"
+ )
+
+ pipeline = UnCLIPPipeline.from_pretrained("kakaobrain/karlo-v1-alpha")
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.manual_seed(0)
+ output = pipeline(
+ "horse",
+ num_images_per_prompt=1,
+ generator=generator,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (256, 256, 3)
+ assert np.abs(expected_image - image).max() < 1e-1
+
+
+@nightly
+@require_torch_gpu
+class UnCLIPPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_unclip_karlo(self):
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/unclip/karlo_v1_alpha_horse_fp16.npy"
+ )
+
+ pipeline = UnCLIPPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=torch.float16)
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipeline(
+ "horse",
+ generator=generator,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (256, 256, 3)
+
+ assert_mean_pixel_difference(image, expected_image)
+
+ def test_unclip_pipeline_with_sequential_cpu_offloading(self):
+ torch.cuda.empty_cache()
+ torch.cuda.reset_max_memory_allocated()
+ torch.cuda.reset_peak_memory_stats()
+
+ pipe = UnCLIPPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=torch.float16)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+ pipe.enable_sequential_cpu_offload()
+
+ _ = pipe(
+ "horse",
+ num_images_per_prompt=1,
+ prior_num_inference_steps=2,
+ decoder_num_inference_steps=2,
+ super_res_num_inference_steps=2,
+ output_type="np",
+ )
+
+ mem_bytes = torch.cuda.max_memory_allocated()
+ # make sure that less than 7 GB is allocated
+ assert mem_bytes < 7 * 10**9
diff --git a/tests/pipelines/unclip/test_unclip_image_variation.py b/tests/pipelines/unclip/test_unclip_image_variation.py
new file mode 100644
index 0000000..ab3aea5
--- /dev/null
+++ b/tests/pipelines/unclip/test_unclip_image_variation.py
@@ -0,0 +1,531 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextConfig,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer,
+ CLIPVisionConfig,
+ CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+ DiffusionPipeline,
+ UnCLIPImageVariationPipeline,
+ UnCLIPScheduler,
+ UNet2DConditionModel,
+ UNet2DModel,
+)
+from diffusers.pipelines.unclip.text_proj import UnCLIPTextProjModel
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ load_numpy,
+ nightly,
+ require_torch_gpu,
+ skip_mps,
+ torch_device,
+)
+
+from ..pipeline_params import IMAGE_VARIATION_BATCH_PARAMS, IMAGE_VARIATION_PARAMS
+from ..test_pipelines_common import PipelineTesterMixin, assert_mean_pixel_difference
+
+
+enable_full_determinism()
+
+
+class UnCLIPImageVariationPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = UnCLIPImageVariationPipeline
+ params = IMAGE_VARIATION_PARAMS - {"height", "width", "guidance_scale"}
+ batch_params = IMAGE_VARIATION_BATCH_PARAMS
+
+ required_optional_params = [
+ "generator",
+ "return_dict",
+ "decoder_num_inference_steps",
+ "super_res_num_inference_steps",
+ ]
+ test_xformers_attention = False
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def cross_attention_dim(self):
+ return 100
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModelWithProjection(config)
+
+ @property
+ def dummy_image_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPVisionConfig(
+ hidden_size=self.text_embedder_hidden_size,
+ projection_dim=self.text_embedder_hidden_size,
+ num_hidden_layers=5,
+ num_attention_heads=4,
+ image_size=32,
+ intermediate_size=37,
+ patch_size=1,
+ )
+ return CLIPVisionModelWithProjection(config)
+
+ @property
+ def dummy_text_proj(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "clip_embeddings_dim": self.text_embedder_hidden_size,
+ "time_embed_dim": self.time_embed_dim,
+ "cross_attention_dim": self.cross_attention_dim,
+ }
+
+ model = UnCLIPTextProjModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_decoder(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "sample_size": 32,
+ # RGB in channels
+ "in_channels": 3,
+ # Out channels is double in channels because predicts mean and variance
+ "out_channels": 6,
+ "down_block_types": ("ResnetDownsampleBlock2D", "SimpleCrossAttnDownBlock2D"),
+ "up_block_types": ("SimpleCrossAttnUpBlock2D", "ResnetUpsampleBlock2D"),
+ "mid_block_type": "UNetMidBlock2DSimpleCrossAttn",
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "layers_per_block": 1,
+ "cross_attention_dim": self.cross_attention_dim,
+ "attention_head_dim": 4,
+ "resnet_time_scale_shift": "scale_shift",
+ "class_embed_type": "identity",
+ }
+
+ model = UNet2DConditionModel(**model_kwargs)
+ return model
+
+ @property
+ def dummy_super_res_kwargs(self):
+ return {
+ "sample_size": 64,
+ "layers_per_block": 1,
+ "down_block_types": ("ResnetDownsampleBlock2D", "ResnetDownsampleBlock2D"),
+ "up_block_types": ("ResnetUpsampleBlock2D", "ResnetUpsampleBlock2D"),
+ "block_out_channels": (self.block_out_channels_0, self.block_out_channels_0 * 2),
+ "in_channels": 6,
+ "out_channels": 3,
+ }
+
+ @property
+ def dummy_super_res_first(self):
+ torch.manual_seed(0)
+
+ model = UNet2DModel(**self.dummy_super_res_kwargs)
+ return model
+
+ @property
+ def dummy_super_res_last(self):
+ # seeded differently to get different unet than `self.dummy_super_res_first`
+ torch.manual_seed(1)
+
+ model = UNet2DModel(**self.dummy_super_res_kwargs)
+ return model
+
+ def get_dummy_components(self):
+ decoder = self.dummy_decoder
+ text_proj = self.dummy_text_proj
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ super_res_first = self.dummy_super_res_first
+ super_res_last = self.dummy_super_res_last
+
+ decoder_scheduler = UnCLIPScheduler(
+ variance_type="learned_range",
+ prediction_type="epsilon",
+ num_train_timesteps=1000,
+ )
+
+ super_res_scheduler = UnCLIPScheduler(
+ variance_type="fixed_small_log",
+ prediction_type="epsilon",
+ num_train_timesteps=1000,
+ )
+
+ feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+
+ image_encoder = self.dummy_image_encoder
+
+ return {
+ "decoder": decoder,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "text_proj": text_proj,
+ "feature_extractor": feature_extractor,
+ "image_encoder": image_encoder,
+ "super_res_first": super_res_first,
+ "super_res_last": super_res_last,
+ "decoder_scheduler": decoder_scheduler,
+ "super_res_scheduler": super_res_scheduler,
+ }
+
+ def get_dummy_inputs(self, device, seed=0, pil_image=True):
+ input_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ if pil_image:
+ input_image = input_image * 0.5 + 0.5
+ input_image = input_image.clamp(0, 1)
+ input_image = input_image.cpu().permute(0, 2, 3, 1).float().numpy()
+ input_image = DiffusionPipeline.numpy_to_pil(input_image)[0]
+
+ return {
+ "image": input_image,
+ "generator": generator,
+ "decoder_num_inference_steps": 2,
+ "super_res_num_inference_steps": 2,
+ "output_type": "np",
+ }
+
+ def test_unclip_image_variation_input_tensor(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ pipeline_inputs = self.get_dummy_inputs(device, pil_image=False)
+
+ output = pipe(**pipeline_inputs)
+ image = output.images
+
+ tuple_pipeline_inputs = self.get_dummy_inputs(device, pil_image=False)
+
+ image_from_tuple = pipe(
+ **tuple_pipeline_inputs,
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array(
+ [
+ 0.9997,
+ 0.0002,
+ 0.9997,
+ 0.9997,
+ 0.9969,
+ 0.0023,
+ 0.9997,
+ 0.9969,
+ 0.9970,
+ ]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_unclip_image_variation_input_image(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ pipeline_inputs = self.get_dummy_inputs(device, pil_image=True)
+
+ output = pipe(**pipeline_inputs)
+ image = output.images
+
+ tuple_pipeline_inputs = self.get_dummy_inputs(device, pil_image=True)
+
+ image_from_tuple = pipe(
+ **tuple_pipeline_inputs,
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.9997, 0.0003, 0.9997, 0.9997, 0.9970, 0.0024, 0.9997, 0.9971, 0.9971])
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_unclip_image_variation_input_list_images(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ pipeline_inputs = self.get_dummy_inputs(device, pil_image=True)
+ pipeline_inputs["image"] = [
+ pipeline_inputs["image"],
+ pipeline_inputs["image"],
+ ]
+
+ output = pipe(**pipeline_inputs)
+ image = output.images
+
+ tuple_pipeline_inputs = self.get_dummy_inputs(device, pil_image=True)
+ tuple_pipeline_inputs["image"] = [
+ tuple_pipeline_inputs["image"],
+ tuple_pipeline_inputs["image"],
+ ]
+
+ image_from_tuple = pipe(
+ **tuple_pipeline_inputs,
+ return_dict=False,
+ )[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (2, 64, 64, 3)
+
+ expected_slice = np.array(
+ [
+ 0.9997,
+ 0.9989,
+ 0.0008,
+ 0.0021,
+ 0.9960,
+ 0.0018,
+ 0.0014,
+ 0.0002,
+ 0.9933,
+ ]
+ )
+
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ def test_unclip_passed_image_embed(self):
+ device = torch.device("cpu")
+
+ class DummyScheduler:
+ init_noise_sigma = 1
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=device).manual_seed(0)
+ dtype = pipe.decoder.dtype
+ batch_size = 1
+
+ shape = (
+ batch_size,
+ pipe.decoder.config.in_channels,
+ pipe.decoder.config.sample_size,
+ pipe.decoder.config.sample_size,
+ )
+ decoder_latents = pipe.prepare_latents(
+ shape, dtype=dtype, device=device, generator=generator, latents=None, scheduler=DummyScheduler()
+ )
+
+ shape = (
+ batch_size,
+ pipe.super_res_first.config.in_channels // 2,
+ pipe.super_res_first.config.sample_size,
+ pipe.super_res_first.config.sample_size,
+ )
+ super_res_latents = pipe.prepare_latents(
+ shape, dtype=dtype, device=device, generator=generator, latents=None, scheduler=DummyScheduler()
+ )
+
+ pipeline_inputs = self.get_dummy_inputs(device, pil_image=False)
+
+ img_out_1 = pipe(
+ **pipeline_inputs, decoder_latents=decoder_latents, super_res_latents=super_res_latents
+ ).images
+
+ pipeline_inputs = self.get_dummy_inputs(device, pil_image=False)
+ # Don't pass image, instead pass embedding
+ image = pipeline_inputs.pop("image")
+ image_embeddings = pipe.image_encoder(image).image_embeds
+
+ img_out_2 = pipe(
+ **pipeline_inputs,
+ decoder_latents=decoder_latents,
+ super_res_latents=super_res_latents,
+ image_embeddings=image_embeddings,
+ ).images
+
+ # make sure passing text embeddings manually is identical
+ assert np.abs(img_out_1 - img_out_2).max() < 1e-4
+
+ # Overriding PipelineTesterMixin::test_attention_slicing_forward_pass
+ # because UnCLIP GPU undeterminism requires a looser check.
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+
+ # Check is relaxed because there is not a torch 2.0 sliced attention added kv processor
+ expected_max_diff = 1e-2
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference, expected_max_diff=expected_max_diff
+ )
+
+ # Overriding PipelineTesterMixin::test_inference_batch_single_identical
+ # because UnCLIP undeterminism requires a looser check.
+ @unittest.skip("UnCLIP produces very large differences. Test is not useful.")
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ additional_params_copy_to_batched_inputs = [
+ "decoder_num_inference_steps",
+ "super_res_num_inference_steps",
+ ]
+ self._test_inference_batch_single_identical(
+ additional_params_copy_to_batched_inputs=additional_params_copy_to_batched_inputs, expected_max_diff=5e-3
+ )
+
+ def test_inference_batch_consistent(self):
+ additional_params_copy_to_batched_inputs = [
+ "decoder_num_inference_steps",
+ "super_res_num_inference_steps",
+ ]
+
+ if torch_device == "mps":
+ # TODO: MPS errors with larger batch sizes
+ batch_sizes = [2, 3]
+ self._test_inference_batch_consistent(
+ batch_sizes=batch_sizes,
+ additional_params_copy_to_batched_inputs=additional_params_copy_to_batched_inputs,
+ )
+ else:
+ self._test_inference_batch_consistent(
+ additional_params_copy_to_batched_inputs=additional_params_copy_to_batched_inputs
+ )
+
+ @skip_mps
+ def test_dict_tuple_outputs_equivalent(self):
+ return super().test_dict_tuple_outputs_equivalent()
+
+ @unittest.skip("UnCLIP produces very large difference. Test is not useful.")
+ @skip_mps
+ def test_save_load_local(self):
+ return super().test_save_load_local(expected_max_difference=4e-3)
+
+ @skip_mps
+ def test_save_load_optional_components(self):
+ return super().test_save_load_optional_components()
+
+ @unittest.skip("UnCLIP produces very large difference in fp16 vs fp32. Test is not useful.")
+ def test_float16_inference(self):
+ super().test_float16_inference(expected_max_diff=1.0)
+
+
+@nightly
+@require_torch_gpu
+class UnCLIPImageVariationPipelineIntegrationTests(unittest.TestCase):
+ def tearDown(self):
+ # clean up the VRAM after each test
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def test_unclip_image_variation_karlo(self):
+ input_image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unclip/cat.png"
+ )
+ expected_image = load_numpy(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
+ "/unclip/karlo_v1_alpha_cat_variation_fp16.npy"
+ )
+
+ pipeline = UnCLIPImageVariationPipeline.from_pretrained(
+ "kakaobrain/karlo-v1-alpha-image-variations", torch_dtype=torch.float16
+ )
+ pipeline = pipeline.to(torch_device)
+ pipeline.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device="cpu").manual_seed(0)
+ output = pipeline(
+ input_image,
+ generator=generator,
+ output_type="np",
+ )
+
+ image = output.images[0]
+
+ assert image.shape == (256, 256, 3)
+
+ assert_mean_pixel_difference(image, expected_image, 15)
diff --git a/tests/pipelines/unidiffuser/__init__.py b/tests/pipelines/unidiffuser/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/unidiffuser/test_unidiffuser.py b/tests/pipelines/unidiffuser/test_unidiffuser.py
new file mode 100644
index 0000000..ba8026d
--- /dev/null
+++ b/tests/pipelines/unidiffuser/test_unidiffuser.py
@@ -0,0 +1,790 @@
+import gc
+import random
+import traceback
+import unittest
+
+import numpy as np
+import torch
+from PIL import Image
+from transformers import (
+ CLIPImageProcessor,
+ CLIPTextModel,
+ CLIPTokenizer,
+ CLIPVisionModelWithProjection,
+ GPT2Tokenizer,
+)
+
+from diffusers import (
+ AutoencoderKL,
+ DPMSolverMultistepScheduler,
+ UniDiffuserModel,
+ UniDiffuserPipeline,
+ UniDiffuserTextDecoder,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ nightly,
+ require_torch_2,
+ require_torch_gpu,
+ run_test_in_subprocess,
+ torch_device,
+)
+from diffusers.utils.torch_utils import randn_tensor
+
+from ..pipeline_params import (
+ IMAGE_TO_IMAGE_IMAGE_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
+ TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+# Will be run via run_test_in_subprocess
+def _test_unidiffuser_compile(in_queue, out_queue, timeout):
+ error = None
+ try:
+ inputs = in_queue.get(timeout=timeout)
+ torch_device = inputs.pop("torch_device")
+ seed = inputs.pop("seed")
+ inputs["generator"] = torch.Generator(device=torch_device).manual_seed(seed)
+
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1")
+ # pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+ pipe = pipe.to(torch_device)
+
+ pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ image = pipe(**inputs).images
+ image_slice = image[0, -3:, -3:, -1].flatten()
+
+ assert image.shape == (1, 512, 512, 3)
+ expected_slice = np.array([0.2402, 0.2375, 0.2285, 0.2378, 0.2407, 0.2263, 0.2354, 0.2307, 0.2520])
+ assert np.abs(image_slice - expected_slice).max() < 1e-1
+ except Exception:
+ error = f"{traceback.format_exc()}"
+
+ results = {"error": error}
+ out_queue.put(results, timeout=timeout)
+ out_queue.join()
+
+
+class UniDiffuserPipelineFastTests(
+ PipelineTesterMixin, PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, unittest.TestCase
+):
+ pipeline_class = UniDiffuserPipeline
+ params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS
+ batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
+ image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+ # vae_latents, not latents, is the argument that corresponds to VAE latent inputs
+ image_latents_params = frozenset(["vae_latents"])
+
+ def get_dummy_components(self):
+ unet = UniDiffuserModel.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="unet",
+ )
+
+ scheduler = DPMSolverMultistepScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ solver_order=3,
+ )
+
+ vae = AutoencoderKL.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="vae",
+ )
+
+ text_encoder = CLIPTextModel.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="text_encoder",
+ )
+ clip_tokenizer = CLIPTokenizer.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="clip_tokenizer",
+ )
+
+ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="image_encoder",
+ )
+ # From the Stable Diffusion Image Variation pipeline tests
+ clip_image_processor = CLIPImageProcessor(crop_size=32, size=32)
+ # image_processor = CLIPImageProcessor.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ text_tokenizer = GPT2Tokenizer.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="text_tokenizer",
+ )
+ text_decoder = UniDiffuserTextDecoder.from_pretrained(
+ "hf-internal-testing/unidiffuser-diffusers-test",
+ subfolder="text_decoder",
+ )
+
+ components = {
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "image_encoder": image_encoder,
+ "clip_image_processor": clip_image_processor,
+ "clip_tokenizer": clip_tokenizer,
+ "text_decoder": text_decoder,
+ "text_tokenizer": text_tokenizer,
+ "unet": unet,
+ "scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image.cpu().permute(0, 2, 3, 1)[0]
+ image = Image.fromarray(np.uint8(image)).convert("RGB")
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "an elephant under the sea",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ }
+ return inputs
+
+ def get_fixed_latents(self, device, seed=0):
+ if isinstance(device, str):
+ device = torch.device(device)
+ generator = torch.Generator(device=device).manual_seed(seed)
+ # Hardcode the shapes for now.
+ prompt_latents = randn_tensor((1, 77, 32), generator=generator, device=device, dtype=torch.float32)
+ vae_latents = randn_tensor((1, 4, 16, 16), generator=generator, device=device, dtype=torch.float32)
+ clip_latents = randn_tensor((1, 1, 32), generator=generator, device=device, dtype=torch.float32)
+
+ latents = {
+ "prompt_latents": prompt_latents,
+ "vae_latents": vae_latents,
+ "clip_latents": clip_latents,
+ }
+ return latents
+
+ def get_dummy_inputs_with_latents(self, device, seed=0):
+ # image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ # image = image.cpu().permute(0, 2, 3, 1)[0]
+ # image = Image.fromarray(np.uint8(image)).convert("RGB")
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg",
+ )
+ image = image.resize((32, 32))
+ latents = self.get_fixed_latents(device, seed=seed)
+
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+
+ inputs = {
+ "prompt": "an elephant under the sea",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 2,
+ "guidance_scale": 6.0,
+ "output_type": "numpy",
+ "prompt_latents": latents.get("prompt_latents"),
+ "vae_latents": latents.get("vae_latents"),
+ "clip_latents": latents.get("clip_latents"),
+ }
+ return inputs
+
+ def test_unidiffuser_default_joint_v0(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'joint'
+ unidiffuser_pipe.set_joint_mode()
+ assert unidiffuser_pipe.mode == "joint"
+
+ # inputs = self.get_dummy_inputs(device)
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ sample = unidiffuser_pipe(**inputs)
+ image = sample.images
+ text = sample.text
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.5760, 0.6270, 0.6571, 0.4965, 0.4638, 0.5663, 0.5254, 0.5068, 0.5716])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 1e-3
+
+ expected_text_prefix = " no no no "
+ assert text[0][:10] == expected_text_prefix
+
+ def test_unidiffuser_default_joint_no_cfg_v0(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'joint'
+ unidiffuser_pipe.set_joint_mode()
+ assert unidiffuser_pipe.mode == "joint"
+
+ # inputs = self.get_dummy_inputs(device)
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ # Set guidance scale to 1.0 to turn off CFG
+ inputs["guidance_scale"] = 1.0
+ sample = unidiffuser_pipe(**inputs)
+ image = sample.images
+ text = sample.text
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.5760, 0.6270, 0.6571, 0.4965, 0.4638, 0.5663, 0.5254, 0.5068, 0.5716])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 1e-3
+
+ expected_text_prefix = " no no no "
+ assert text[0][:10] == expected_text_prefix
+
+ def test_unidiffuser_default_text2img_v0(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'text2img'
+ unidiffuser_pipe.set_text_to_image_mode()
+ assert unidiffuser_pipe.mode == "text2img"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete image for text-conditioned image generation
+ del inputs["image"]
+ image = unidiffuser_pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.5758, 0.6269, 0.6570, 0.4967, 0.4639, 0.5664, 0.5257, 0.5067, 0.5715])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_unidiffuser_default_image_0(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img'
+ unidiffuser_pipe.set_image_mode()
+ assert unidiffuser_pipe.mode == "img"
+
+ inputs = self.get_dummy_inputs(device)
+ # Delete prompt and image for unconditional ("marginal") text generation.
+ del inputs["prompt"]
+ del inputs["image"]
+ image = unidiffuser_pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.5760, 0.6270, 0.6571, 0.4966, 0.4638, 0.5663, 0.5254, 0.5068, 0.5715])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_unidiffuser_default_text_v0(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img'
+ unidiffuser_pipe.set_text_mode()
+ assert unidiffuser_pipe.mode == "text"
+
+ inputs = self.get_dummy_inputs(device)
+ # Delete prompt and image for unconditional ("marginal") text generation.
+ del inputs["prompt"]
+ del inputs["image"]
+ text = unidiffuser_pipe(**inputs).text
+
+ expected_text_prefix = " no no no "
+ assert text[0][:10] == expected_text_prefix
+
+ def test_unidiffuser_default_img2text_v0(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img2text'
+ unidiffuser_pipe.set_image_to_text_mode()
+ assert unidiffuser_pipe.mode == "img2text"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete text for image-conditioned text generation
+ del inputs["prompt"]
+ text = unidiffuser_pipe(**inputs).text
+
+ expected_text_prefix = " no no no "
+ assert text[0][:10] == expected_text_prefix
+
+ def test_unidiffuser_default_joint_v1(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unidiffuser_pipe = UniDiffuserPipeline.from_pretrained("hf-internal-testing/unidiffuser-test-v1")
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'joint'
+ unidiffuser_pipe.set_joint_mode()
+ assert unidiffuser_pipe.mode == "joint"
+
+ # inputs = self.get_dummy_inputs(device)
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ inputs["data_type"] = 1
+ sample = unidiffuser_pipe(**inputs)
+ image = sample.images
+ text = sample.text
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.5760, 0.6270, 0.6571, 0.4965, 0.4638, 0.5663, 0.5254, 0.5068, 0.5716])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 1e-3
+
+ expected_text_prefix = " no no no "
+ assert text[0][:10] == expected_text_prefix
+
+ def test_unidiffuser_default_text2img_v1(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unidiffuser_pipe = UniDiffuserPipeline.from_pretrained("hf-internal-testing/unidiffuser-test-v1")
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'text2img'
+ unidiffuser_pipe.set_text_to_image_mode()
+ assert unidiffuser_pipe.mode == "text2img"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete image for text-conditioned image generation
+ del inputs["image"]
+ image = unidiffuser_pipe(**inputs).images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.5758, 0.6269, 0.6570, 0.4967, 0.4639, 0.5664, 0.5257, 0.5067, 0.5715])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+
+ def test_unidiffuser_default_img2text_v1(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ unidiffuser_pipe = UniDiffuserPipeline.from_pretrained("hf-internal-testing/unidiffuser-test-v1")
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img2text'
+ unidiffuser_pipe.set_image_to_text_mode()
+ assert unidiffuser_pipe.mode == "img2text"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete text for image-conditioned text generation
+ del inputs["prompt"]
+ text = unidiffuser_pipe(**inputs).text
+
+ expected_text_prefix = " no no no "
+ assert text[0][:10] == expected_text_prefix
+
+ def test_unidiffuser_text2img_multiple_images(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'text2img'
+ unidiffuser_pipe.set_text_to_image_mode()
+ assert unidiffuser_pipe.mode == "text2img"
+
+ inputs = self.get_dummy_inputs(device)
+ # Delete image for text-conditioned image generation
+ del inputs["image"]
+ inputs["num_images_per_prompt"] = 2
+ inputs["num_prompts_per_image"] = 3
+ image = unidiffuser_pipe(**inputs).images
+ assert image.shape == (2, 32, 32, 3)
+
+ def test_unidiffuser_img2text_multiple_prompts(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img2text'
+ unidiffuser_pipe.set_image_to_text_mode()
+ assert unidiffuser_pipe.mode == "img2text"
+
+ inputs = self.get_dummy_inputs(device)
+ # Delete text for image-conditioned text generation
+ del inputs["prompt"]
+ inputs["num_images_per_prompt"] = 2
+ inputs["num_prompts_per_image"] = 3
+ text = unidiffuser_pipe(**inputs).text
+
+ assert len(text) == 3
+
+ def test_unidiffuser_text2img_multiple_images_with_latents(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'text2img'
+ unidiffuser_pipe.set_text_to_image_mode()
+ assert unidiffuser_pipe.mode == "text2img"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete image for text-conditioned image generation
+ del inputs["image"]
+ inputs["num_images_per_prompt"] = 2
+ inputs["num_prompts_per_image"] = 3
+ image = unidiffuser_pipe(**inputs).images
+ assert image.shape == (2, 32, 32, 3)
+
+ def test_unidiffuser_img2text_multiple_prompts_with_latents(self):
+ device = "cpu" # ensure determinism for the device-dependent torch.Generator
+ components = self.get_dummy_components()
+ unidiffuser_pipe = UniDiffuserPipeline(**components)
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img2text'
+ unidiffuser_pipe.set_image_to_text_mode()
+ assert unidiffuser_pipe.mode == "img2text"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete text for image-conditioned text generation
+ del inputs["prompt"]
+ inputs["num_images_per_prompt"] = 2
+ inputs["num_prompts_per_image"] = 3
+ text = unidiffuser_pipe(**inputs).text
+
+ assert len(text) == 3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=2e-4)
+
+ @require_torch_gpu
+ def test_unidiffuser_default_joint_v1_cuda_fp16(self):
+ device = "cuda"
+ unidiffuser_pipe = UniDiffuserPipeline.from_pretrained(
+ "hf-internal-testing/unidiffuser-test-v1", torch_dtype=torch.float16
+ )
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'joint'
+ unidiffuser_pipe.set_joint_mode()
+ assert unidiffuser_pipe.mode == "joint"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ inputs["data_type"] = 1
+ sample = unidiffuser_pipe(**inputs)
+ image = sample.images
+ text = sample.text
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.5049, 0.5498, 0.5854, 0.3052, 0.4460, 0.6489, 0.5122, 0.4810, 0.6138])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 1e-3
+
+ expected_text_prefix = '" This This'
+ assert text[0][: len(expected_text_prefix)] == expected_text_prefix
+
+ @require_torch_gpu
+ def test_unidiffuser_default_text2img_v1_cuda_fp16(self):
+ device = "cuda"
+ unidiffuser_pipe = UniDiffuserPipeline.from_pretrained(
+ "hf-internal-testing/unidiffuser-test-v1", torch_dtype=torch.float16
+ )
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'text2img'
+ unidiffuser_pipe.set_text_to_image_mode()
+ assert unidiffuser_pipe.mode == "text2img"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete prompt and image for joint inference.
+ del inputs["image"]
+ inputs["data_type"] = 1
+ sample = unidiffuser_pipe(**inputs)
+ image = sample.images
+ assert image.shape == (1, 32, 32, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.5054, 0.5498, 0.5854, 0.3052, 0.4458, 0.6489, 0.5122, 0.4810, 0.6138])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 1e-3
+
+ @require_torch_gpu
+ def test_unidiffuser_default_img2text_v1_cuda_fp16(self):
+ device = "cuda"
+ unidiffuser_pipe = UniDiffuserPipeline.from_pretrained(
+ "hf-internal-testing/unidiffuser-test-v1", torch_dtype=torch.float16
+ )
+ unidiffuser_pipe = unidiffuser_pipe.to(device)
+ unidiffuser_pipe.set_progress_bar_config(disable=None)
+
+ # Set mode to 'img2text'
+ unidiffuser_pipe.set_image_to_text_mode()
+ assert unidiffuser_pipe.mode == "img2text"
+
+ inputs = self.get_dummy_inputs_with_latents(device)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ inputs["data_type"] = 1
+ text = unidiffuser_pipe(**inputs).text
+
+ expected_text_prefix = '" This This'
+ assert text[0][: len(expected_text_prefix)] == expected_text_prefix
+
+
+@nightly
+@require_torch_gpu
+class UniDiffuserPipelineSlowTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, seed=0, generate_latents=False):
+ generator = torch.manual_seed(seed)
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
+ )
+ inputs = {
+ "prompt": "an elephant under the sea",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 8.0,
+ "output_type": "numpy",
+ }
+ if generate_latents:
+ latents = self.get_fixed_latents(device, seed=seed)
+ for latent_name, latent_tensor in latents.items():
+ inputs[latent_name] = latent_tensor
+ return inputs
+
+ def get_fixed_latents(self, device, seed=0):
+ if isinstance(device, str):
+ device = torch.device(device)
+ latent_device = torch.device("cpu")
+ generator = torch.Generator(device=latent_device).manual_seed(seed)
+ # Hardcode the shapes for now.
+ prompt_latents = randn_tensor((1, 77, 768), generator=generator, device=device, dtype=torch.float32)
+ vae_latents = randn_tensor((1, 4, 64, 64), generator=generator, device=device, dtype=torch.float32)
+ clip_latents = randn_tensor((1, 1, 512), generator=generator, device=device, dtype=torch.float32)
+
+ # Move latents onto desired device.
+ prompt_latents = prompt_latents.to(device)
+ vae_latents = vae_latents.to(device)
+ clip_latents = clip_latents.to(device)
+
+ latents = {
+ "prompt_latents": prompt_latents,
+ "vae_latents": vae_latents,
+ "clip_latents": clip_latents,
+ }
+ return latents
+
+ def test_unidiffuser_default_joint_v1(self):
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1")
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ # inputs = self.get_dummy_inputs(device)
+ inputs = self.get_inputs(device=torch_device, generate_latents=True)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ sample = pipe(**inputs)
+ image = sample.images
+ text = sample.text
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.2402, 0.2375, 0.2285, 0.2378, 0.2407, 0.2263, 0.2354, 0.2307, 0.2520])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 1e-1
+
+ expected_text_prefix = "a living room"
+ assert text[0][: len(expected_text_prefix)] == expected_text_prefix
+
+ def test_unidiffuser_default_text2img_v1(self):
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1")
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(device=torch_device, generate_latents=True)
+ del inputs["image"]
+ sample = pipe(**inputs)
+ image = sample.images
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.0242, 0.0103, 0.0022, 0.0129, 0.0000, 0.0090, 0.0376, 0.0508, 0.0005])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_unidiffuser_default_img2text_v1(self):
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1")
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(device=torch_device, generate_latents=True)
+ del inputs["prompt"]
+ sample = pipe(**inputs)
+ text = sample.text
+
+ expected_text_prefix = "An astronaut"
+ assert text[0][: len(expected_text_prefix)] == expected_text_prefix
+
+ @unittest.skip(reason="Skip torch.compile test to speed up the slow test suite.")
+ @require_torch_2
+ def test_unidiffuser_compile(self, seed=0):
+ inputs = self.get_inputs(torch_device, seed=seed, generate_latents=True)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ # Can't pickle a Generator object
+ del inputs["generator"]
+ inputs["torch_device"] = torch_device
+ inputs["seed"] = seed
+ run_test_in_subprocess(test_case=self, target_func=_test_unidiffuser_compile, inputs=inputs)
+
+
+@nightly
+@require_torch_gpu
+class UniDiffuserPipelineNightlyTests(unittest.TestCase):
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def get_inputs(self, device, seed=0, generate_latents=False):
+ generator = torch.manual_seed(seed)
+ image = load_image(
+ "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
+ )
+ inputs = {
+ "prompt": "an elephant under the sea",
+ "image": image,
+ "generator": generator,
+ "num_inference_steps": 3,
+ "guidance_scale": 8.0,
+ "output_type": "numpy",
+ }
+ if generate_latents:
+ latents = self.get_fixed_latents(device, seed=seed)
+ for latent_name, latent_tensor in latents.items():
+ inputs[latent_name] = latent_tensor
+ return inputs
+
+ def get_fixed_latents(self, device, seed=0):
+ if isinstance(device, str):
+ device = torch.device(device)
+ latent_device = torch.device("cpu")
+ generator = torch.Generator(device=latent_device).manual_seed(seed)
+ # Hardcode the shapes for now.
+ prompt_latents = randn_tensor((1, 77, 768), generator=generator, device=device, dtype=torch.float32)
+ vae_latents = randn_tensor((1, 4, 64, 64), generator=generator, device=device, dtype=torch.float32)
+ clip_latents = randn_tensor((1, 1, 512), generator=generator, device=device, dtype=torch.float32)
+
+ # Move latents onto desired device.
+ prompt_latents = prompt_latents.to(device)
+ vae_latents = vae_latents.to(device)
+ clip_latents = clip_latents.to(device)
+
+ latents = {
+ "prompt_latents": prompt_latents,
+ "vae_latents": vae_latents,
+ "clip_latents": clip_latents,
+ }
+ return latents
+
+ def test_unidiffuser_default_joint_v1_fp16(self):
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ # inputs = self.get_dummy_inputs(device)
+ inputs = self.get_inputs(device=torch_device, generate_latents=True)
+ # Delete prompt and image for joint inference.
+ del inputs["prompt"]
+ del inputs["image"]
+ sample = pipe(**inputs)
+ image = sample.images
+ text = sample.text
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_img_slice = np.array([0.2402, 0.2375, 0.2285, 0.2378, 0.2407, 0.2263, 0.2354, 0.2307, 0.2520])
+ assert np.abs(image_slice.flatten() - expected_img_slice).max() < 2e-1
+
+ expected_text_prefix = "a living room"
+ assert text[0][: len(expected_text_prefix)] == expected_text_prefix
+
+ def test_unidiffuser_default_text2img_v1_fp16(self):
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(device=torch_device, generate_latents=True)
+ del inputs["image"]
+ sample = pipe(**inputs)
+ image = sample.images
+ assert image.shape == (1, 512, 512, 3)
+
+ image_slice = image[0, -3:, -3:, -1]
+ expected_slice = np.array([0.0242, 0.0103, 0.0022, 0.0129, 0.0000, 0.0090, 0.0376, 0.0508, 0.0005])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-1
+
+ def test_unidiffuser_default_img2text_v1_fp16(self):
+ pipe = UniDiffuserPipeline.from_pretrained("thu-ml/unidiffuser-v1", torch_dtype=torch.float16)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ pipe.enable_attention_slicing()
+
+ inputs = self.get_inputs(device=torch_device, generate_latents=True)
+ del inputs["prompt"]
+ sample = pipe(**inputs)
+ text = sample.text
+
+ expected_text_prefix = "An astronaut"
+ assert text[0][: len(expected_text_prefix)] == expected_text_prefix
diff --git a/tests/pipelines/wuerstchen/__init__.py b/tests/pipelines/wuerstchen/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/pipelines/wuerstchen/test_wuerstchen_combined.py b/tests/pipelines/wuerstchen/test_wuerstchen_combined.py
new file mode 100644
index 0000000..0caed15
--- /dev/null
+++ b/tests/pipelines/wuerstchen/test_wuerstchen_combined.py
@@ -0,0 +1,239 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import DDPMWuerstchenScheduler, WuerstchenCombinedPipeline
+from diffusers.pipelines.wuerstchen import PaellaVQModel, WuerstchenDiffNeXt, WuerstchenPrior
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class WuerstchenCombinedPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = WuerstchenCombinedPipeline
+ params = ["prompt"]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "generator",
+ "height",
+ "width",
+ "latents",
+ "prior_guidance_scale",
+ "decoder_guidance_scale",
+ "negative_prompt",
+ "num_inference_steps",
+ "return_dict",
+ "prior_num_inference_steps",
+ "output_type",
+ ]
+ test_xformers_attention = True
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {"c_in": 2, "c": 8, "depth": 2, "c_cond": 32, "c_r": 8, "nhead": 2}
+ model = WuerstchenPrior(**model_kwargs)
+ return model.eval()
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_prior_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config).eval()
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ projection_dim=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config).eval()
+
+ @property
+ def dummy_vqgan(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "bottleneck_blocks": 1,
+ "num_vq_embeddings": 2,
+ }
+ model = PaellaVQModel(**model_kwargs)
+ return model.eval()
+
+ @property
+ def dummy_decoder(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "c_cond": self.text_embedder_hidden_size,
+ "c_hidden": [320],
+ "nhead": [-1],
+ "blocks": [4],
+ "level_config": ["CT"],
+ "clip_embd": self.text_embedder_hidden_size,
+ "inject_effnet": [False],
+ }
+
+ model = WuerstchenDiffNeXt(**model_kwargs)
+ return model.eval()
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ prior_text_encoder = self.dummy_prior_text_encoder
+
+ scheduler = DDPMWuerstchenScheduler()
+ tokenizer = self.dummy_tokenizer
+
+ text_encoder = self.dummy_text_encoder
+ decoder = self.dummy_decoder
+ vqgan = self.dummy_vqgan
+
+ components = {
+ "tokenizer": tokenizer,
+ "text_encoder": text_encoder,
+ "decoder": decoder,
+ "vqgan": vqgan,
+ "scheduler": scheduler,
+ "prior_prior": prior,
+ "prior_text_encoder": prior_text_encoder,
+ "prior_tokenizer": tokenizer,
+ "prior_scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "prior_guidance_scale": 4.0,
+ "decoder_guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "prior_num_inference_steps": 2,
+ "output_type": "np",
+ "height": 128,
+ "width": 128,
+ }
+ return inputs
+
+ def test_wuerstchen(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(**self.get_dummy_inputs(device), return_dict=False)[0]
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[-3:, -3:, -1]
+
+ assert image.shape == (1, 128, 128, 3)
+
+ expected_slice = np.array([0.7616304, 0.0, 1.0, 0.0, 1.0, 0.0, 0.05925313, 0.0, 0.951898])
+
+ assert (
+ np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_slice.flatten()}"
+ assert (
+ np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+ ), f" expected_slice {expected_slice}, but got {image_from_tuple_slice.flatten()}"
+
+ @require_torch_gpu
+ def test_offloads(self):
+ pipes = []
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components).to(torch_device)
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_sequential_cpu_offload()
+ pipes.append(sd_pipe)
+
+ components = self.get_dummy_components()
+ sd_pipe = self.pipeline_class(**components)
+ sd_pipe.enable_model_cpu_offload()
+ pipes.append(sd_pipe)
+
+ image_slices = []
+ for pipe in pipes:
+ inputs = self.get_dummy_inputs(torch_device)
+ image = pipe(**inputs).images
+
+ image_slices.append(image[0, -3:, -3:, -1].flatten())
+
+ assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
+ assert np.abs(image_slices[0] - image_slices[2]).max() < 1e-3
+
+ def test_inference_batch_single_identical(self):
+ super().test_inference_batch_single_identical(expected_max_diff=1e-2)
+
+ @unittest.skip(reason="flakey and float16 requires CUDA")
+ def test_float16_inference(self):
+ super().test_float16_inference()
+
+ def test_callback_inputs(self):
+ pass
+
+ def test_callback_cfg(self):
+ pass
diff --git a/tests/pipelines/wuerstchen/test_wuerstchen_decoder.py b/tests/pipelines/wuerstchen/test_wuerstchen_decoder.py
new file mode 100644
index 0000000..4675501
--- /dev/null
+++ b/tests/pipelines/wuerstchen/test_wuerstchen_decoder.py
@@ -0,0 +1,188 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import DDPMWuerstchenScheduler, WuerstchenDecoderPipeline
+from diffusers.pipelines.wuerstchen import PaellaVQModel, WuerstchenDiffNeXt
+from diffusers.utils.testing_utils import enable_full_determinism, skip_mps, torch_device
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class WuerstchenDecoderPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = WuerstchenDecoderPipeline
+ params = ["prompt"]
+ batch_params = ["image_embeddings", "prompt", "negative_prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "num_inference_steps",
+ "latents",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["image_embeddings", "text_encoder_hidden_states"]
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ projection_dim=self.text_embedder_hidden_size,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config).eval()
+
+ @property
+ def dummy_vqgan(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "bottleneck_blocks": 1,
+ "num_vq_embeddings": 2,
+ }
+ model = PaellaVQModel(**model_kwargs)
+ return model.eval()
+
+ @property
+ def dummy_decoder(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "c_cond": self.text_embedder_hidden_size,
+ "c_hidden": [320],
+ "nhead": [-1],
+ "blocks": [4],
+ "level_config": ["CT"],
+ "clip_embd": self.text_embedder_hidden_size,
+ "inject_effnet": [False],
+ }
+
+ model = WuerstchenDiffNeXt(**model_kwargs)
+ return model.eval()
+
+ def get_dummy_components(self):
+ decoder = self.dummy_decoder
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+ vqgan = self.dummy_vqgan
+
+ scheduler = DDPMWuerstchenScheduler()
+
+ components = {
+ "decoder": decoder,
+ "vqgan": vqgan,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ "latent_dim_scale": 4.0,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image_embeddings": torch.ones((1, 4, 4, 4), device=device),
+ "prompt": "horse",
+ "generator": generator,
+ "guidance_scale": 1.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_wuerstchen_decoder(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.images
+
+ image_from_tuple = pipe(**self.get_dummy_inputs(device), return_dict=False)
+
+ image_slice = image[0, -3:, -3:, -1]
+ image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
+
+ assert image.shape == (1, 64, 64, 3)
+
+ expected_slice = np.array([0.0000, 0.0000, 0.0089, 1.0000, 1.0000, 0.3927, 1.0000, 1.0000, 1.0000])
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 1e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(expected_max_diff=1e-5)
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
+
+ @unittest.skip(reason="bf16 not supported and requires CUDA")
+ def test_float16_inference(self):
+ super().test_float16_inference()
diff --git a/tests/pipelines/wuerstchen/test_wuerstchen_prior.py b/tests/pipelines/wuerstchen/test_wuerstchen_prior.py
new file mode 100644
index 0000000..200e4d1
--- /dev/null
+++ b/tests/pipelines/wuerstchen/test_wuerstchen_prior.py
@@ -0,0 +1,296 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import DDPMWuerstchenScheduler, WuerstchenPriorPipeline
+from diffusers.loaders import AttnProcsLayers
+from diffusers.models.attention_processor import (
+ LoRAAttnProcessor,
+ LoRAAttnProcessor2_0,
+)
+from diffusers.pipelines.wuerstchen import WuerstchenPrior
+from diffusers.utils.import_utils import is_peft_available
+from diffusers.utils.testing_utils import enable_full_determinism, require_peft_backend, skip_mps, torch_device
+
+
+if is_peft_available():
+ from peft import LoraConfig
+ from peft.tuners.tuners_utils import BaseTunerLayer
+
+from ..test_pipelines_common import PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+def create_prior_lora_layers(unet: nn.Module):
+ lora_attn_procs = {}
+ for name in unet.attn_processors.keys():
+ lora_attn_processor_class = (
+ LoRAAttnProcessor2_0 if hasattr(F, "scaled_dot_product_attention") else LoRAAttnProcessor
+ )
+ lora_attn_procs[name] = lora_attn_processor_class(
+ hidden_size=unet.config.c,
+ )
+ unet_lora_layers = AttnProcsLayers(lora_attn_procs)
+ return lora_attn_procs, unet_lora_layers
+
+
+class WuerstchenPriorPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+ pipeline_class = WuerstchenPriorPipeline
+ params = ["prompt"]
+ batch_params = ["prompt", "negative_prompt"]
+ required_optional_params = [
+ "num_images_per_prompt",
+ "generator",
+ "num_inference_steps",
+ "latents",
+ "negative_prompt",
+ "guidance_scale",
+ "output_type",
+ "return_dict",
+ ]
+ test_xformers_attention = False
+ callback_cfg_params = ["text_encoder_hidden_states"]
+
+ @property
+ def text_embedder_hidden_size(self):
+ return 32
+
+ @property
+ def time_input_dim(self):
+ return 32
+
+ @property
+ def block_out_channels_0(self):
+ return self.time_input_dim
+
+ @property
+ def time_embed_dim(self):
+ return self.time_input_dim * 4
+
+ @property
+ def dummy_tokenizer(self):
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+ return tokenizer
+
+ @property
+ def dummy_text_encoder(self):
+ torch.manual_seed(0)
+ config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=self.text_embedder_hidden_size,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ return CLIPTextModel(config).eval()
+
+ @property
+ def dummy_prior(self):
+ torch.manual_seed(0)
+
+ model_kwargs = {
+ "c_in": 2,
+ "c": 8,
+ "depth": 2,
+ "c_cond": 32,
+ "c_r": 8,
+ "nhead": 2,
+ }
+
+ model = WuerstchenPrior(**model_kwargs)
+ return model.eval()
+
+ def get_dummy_components(self):
+ prior = self.dummy_prior
+ text_encoder = self.dummy_text_encoder
+ tokenizer = self.dummy_tokenizer
+
+ scheduler = DDPMWuerstchenScheduler()
+
+ components = {
+ "prior": prior,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "scheduler": scheduler,
+ }
+
+ return components
+
+ def get_dummy_inputs(self, device, seed=0):
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "prompt": "horse",
+ "generator": generator,
+ "guidance_scale": 4.0,
+ "num_inference_steps": 2,
+ "output_type": "np",
+ }
+ return inputs
+
+ def test_wuerstchen_prior(self):
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output = pipe(**self.get_dummy_inputs(device))
+ image = output.image_embeddings
+
+ image_from_tuple = pipe(**self.get_dummy_inputs(device), return_dict=False)[0]
+
+ image_slice = image[0, 0, 0, -10:]
+ image_from_tuple_slice = image_from_tuple[0, 0, 0, -10:]
+ assert image.shape == (1, 2, 24, 24)
+
+ expected_slice = np.array(
+ [
+ -7172.837,
+ -3438.855,
+ -1093.312,
+ 388.8835,
+ -7471.467,
+ -7998.1206,
+ -5328.259,
+ 218.00089,
+ -2731.5745,
+ -8056.734,
+ ]
+ )
+ assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-2
+ assert np.abs(image_from_tuple_slice.flatten() - expected_slice).max() < 5e-2
+
+ @skip_mps
+ def test_inference_batch_single_identical(self):
+ self._test_inference_batch_single_identical(
+ expected_max_diff=2e-1,
+ )
+
+ @skip_mps
+ def test_attention_slicing_forward_pass(self):
+ test_max_difference = torch_device == "cpu"
+ test_mean_pixel_difference = False
+
+ self._test_attention_slicing_forward_pass(
+ test_max_difference=test_max_difference,
+ test_mean_pixel_difference=test_mean_pixel_difference,
+ )
+
+ @unittest.skip(reason="flaky for now")
+ def test_float16_inference(self):
+ super().test_float16_inference()
+
+ # override because we need to make sure latent_mean and latent_std to be 0
+ def test_callback_inputs(self):
+ components = self.get_dummy_components()
+ components["latent_mean"] = 0
+ components["latent_std"] = 0
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ self.assertTrue(
+ hasattr(pipe, "_callback_tensor_inputs"),
+ f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
+ )
+
+ def callback_inputs_test(pipe, i, t, callback_kwargs):
+ missing_callback_inputs = set()
+ for v in pipe._callback_tensor_inputs:
+ if v not in callback_kwargs:
+ missing_callback_inputs.add(v)
+ self.assertTrue(
+ len(missing_callback_inputs) == 0, f"Missing callback tensor inputs: {missing_callback_inputs}"
+ )
+ last_i = pipe.num_timesteps - 1
+ if i == last_i:
+ callback_kwargs["latents"] = torch.zeros_like(callback_kwargs["latents"])
+ return callback_kwargs
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["callback_on_step_end"] = callback_inputs_test
+ inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
+ inputs["output_type"] = "latent"
+
+ output = pipe(**inputs)[0]
+ assert output.abs().sum() == 0
+
+ def check_if_lora_correctly_set(self, model) -> bool:
+ """
+ Checks if the LoRA layers are correctly set with peft
+ """
+ for module in model.modules():
+ if isinstance(module, BaseTunerLayer):
+ return True
+ return False
+
+ def get_lora_components(self):
+ prior = self.dummy_prior
+
+ prior_lora_config = LoraConfig(
+ r=4, lora_alpha=4, target_modules=["to_q", "to_k", "to_v", "to_out.0"], init_lora_weights=False
+ )
+
+ prior_lora_attn_procs, prior_lora_layers = create_prior_lora_layers(prior)
+
+ lora_components = {
+ "prior_lora_layers": prior_lora_layers,
+ "prior_lora_attn_procs": prior_lora_attn_procs,
+ }
+
+ return prior, prior_lora_config, lora_components
+
+ @require_peft_backend
+ def test_inference_with_prior_lora(self):
+ _, prior_lora_config, _ = self.get_lora_components()
+ device = "cpu"
+
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe = pipe.to(device)
+
+ pipe.set_progress_bar_config(disable=None)
+
+ output_no_lora = pipe(**self.get_dummy_inputs(device))
+ image_embed = output_no_lora.image_embeddings
+ self.assertTrue(image_embed.shape == (1, 2, 24, 24))
+
+ pipe.prior.add_adapter(prior_lora_config)
+ self.assertTrue(self.check_if_lora_correctly_set(pipe.prior), "Lora not correctly set in prior")
+
+ output_lora = pipe(**self.get_dummy_inputs(device))
+ lora_image_embed = output_lora.image_embeddings
+
+ self.assertTrue(image_embed.shape == lora_image_embed.shape)
diff --git a/tests/schedulers/__init__.py b/tests/schedulers/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/schedulers/test_scheduler_consistency_model.py b/tests/schedulers/test_scheduler_consistency_model.py
new file mode 100644
index 0000000..4f773d7
--- /dev/null
+++ b/tests/schedulers/test_scheduler_consistency_model.py
@@ -0,0 +1,189 @@
+import torch
+
+from diffusers import CMStochasticIterativeScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class CMStochasticIterativeSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (CMStochasticIterativeScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 201,
+ "sigma_min": 0.002,
+ "sigma_max": 80.0,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ # Override test_step_shape to add CMStochasticIterativeScheduler-specific logic regarding timesteps
+ # Problem is that we don't know two timesteps that will always be in the timestep schedule from only the scheduler
+ # config; scaled sigma_max is always in the timestep schedule, but sigma_min is in the sigma schedule while scaled
+ # sigma_min is not in the timestep schedule
+ def test_step_shape(self):
+ num_inference_steps = 10
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ timestep_0 = scheduler.timesteps[0]
+ timestep_1 = scheduler.timesteps[1]
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ output_0 = scheduler.step(residual, timestep_0, sample).prev_sample
+ output_1 = scheduler.step(residual, timestep_1, sample).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_clip_denoised(self):
+ for clip_denoised in [True, False]:
+ self.check_over_configs(clip_denoised=clip_denoised)
+
+ def test_full_loop_no_noise_onestep(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 1
+ scheduler.set_timesteps(num_inference_steps)
+ timesteps = scheduler.timesteps
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ for i, t in enumerate(timesteps):
+ # 1. scale model input
+ scaled_sample = scheduler.scale_model_input(sample, t)
+
+ # 2. predict noise residual
+ residual = model(scaled_sample, t)
+
+ # 3. predict previous sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 192.7614) < 1e-2
+ assert abs(result_mean.item() - 0.2510) < 1e-3
+
+ def test_full_loop_no_noise_multistep(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [106, 0]
+ scheduler.set_timesteps(timesteps=timesteps)
+ timesteps = scheduler.timesteps
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ for t in timesteps:
+ # 1. scale model input
+ scaled_sample = scheduler.scale_model_input(sample, t)
+
+ # 2. predict noise residual
+ residual = model(scaled_sample, t)
+
+ # 3. predict previous sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 347.6357) < 1e-2
+ assert abs(result_mean.item() - 0.4527) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 8
+
+ scheduler.set_timesteps(num_inference_steps)
+ timesteps = scheduler.timesteps
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for t in timesteps:
+ # 1. scale model input
+ scaled_sample = scheduler.scale_model_input(sample, t)
+
+ # 2. predict noise residual
+ residual = model(scaled_sample, t)
+
+ # 3. predict previous sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 763.9186) < 1e-2, f" expected result sum 763.9186, but get {result_sum}"
+ assert abs(result_mean.item() - 0.9947) < 1e-3, f" expected result mean 0.9947, but get {result_mean}"
+
+ def test_custom_timesteps_increasing_order(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [39, 30, 12, 15, 0]
+
+ with self.assertRaises(ValueError, msg="`timesteps` must be in descending order."):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_custom_timesteps_passing_both_num_inference_steps_and_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [39, 30, 12, 1, 0]
+ num_inference_steps = len(timesteps)
+
+ with self.assertRaises(ValueError, msg="Can only pass one of `num_inference_steps` or `timesteps`."):
+ scheduler.set_timesteps(num_inference_steps=num_inference_steps, timesteps=timesteps)
+
+ def test_custom_timesteps_too_large(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [scheduler.config.num_train_timesteps]
+
+ with self.assertRaises(
+ ValueError,
+ msg="`timesteps` must start before `self.config.train_timesteps`: {scheduler.config.num_train_timesteps}}",
+ ):
+ scheduler.set_timesteps(timesteps=timesteps)
diff --git a/tests/schedulers/test_scheduler_ddim.py b/tests/schedulers/test_scheduler_ddim.py
new file mode 100644
index 0000000..13b353a
--- /dev/null
+++ b/tests/schedulers/test_scheduler_ddim.py
@@ -0,0 +1,176 @@
+import torch
+
+from diffusers import DDIMScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DDIMSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DDIMScheduler,)
+ forward_default_kwargs = (("eta", 0.0), ("num_inference_steps", 50))
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "clip_sample": True,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps, eta = 10, 0.0
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ for t in scheduler.timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample, eta).prev_sample
+
+ return sample
+
+ def test_timesteps(self):
+ for timesteps in [100, 500, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_steps_offset(self):
+ for steps_offset in [0, 1]:
+ self.check_over_configs(steps_offset=steps_offset)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(steps_offset=1)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(5)
+ assert torch.equal(scheduler.timesteps, torch.LongTensor([801, 601, 401, 201, 1]))
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_timestep_spacing(self):
+ for timestep_spacing in ["trailing", "leading"]:
+ self.check_over_configs(timestep_spacing=timestep_spacing)
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_time_indices(self):
+ for t in [1, 10, 49]:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 10, 50], [10, 50, 500]):
+ self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps)
+
+ def test_eta(self):
+ for t, eta in zip([1, 10, 49], [0.0, 0.5, 1.0]):
+ self.check_over_forward(time_step=t, eta=eta)
+
+ def test_variance(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ assert torch.sum(torch.abs(scheduler._get_variance(0, 0) - 0.0)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(420, 400) - 0.14771)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(980, 960) - 0.32460)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(0, 0) - 0.0)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(487, 486) - 0.00979)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(999, 998) - 0.02)) < 1e-5
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 172.0067) < 1e-2
+ assert abs(result_mean.item() - 0.223967) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 52.5302) < 1e-2
+ assert abs(result_mean.item() - 0.0684) < 1e-3
+
+ def test_full_loop_with_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=True, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 149.8295) < 1e-2
+ assert abs(result_mean.item() - 0.1951) < 1e-3
+
+ def test_full_loop_with_no_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=False, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 149.0784) < 1e-2
+ assert abs(result_mean.item() - 0.1941) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps, eta = 10, 0.0
+ t_start = 8
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for t in timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample, eta).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 354.5418) < 1e-2, f" expected result sum 218.4379, but get {result_sum}"
+ assert abs(result_mean.item() - 0.4616) < 1e-3, f" expected result mean 0.2844, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_ddim_inverse.py b/tests/schedulers/test_scheduler_ddim_inverse.py
new file mode 100644
index 0000000..696f576
--- /dev/null
+++ b/tests/schedulers/test_scheduler_ddim_inverse.py
@@ -0,0 +1,135 @@
+import torch
+
+from diffusers import DDIMInverseScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DDIMInverseSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DDIMInverseScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 50),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "clip_sample": True,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ for t in scheduler.timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_timesteps(self):
+ for timesteps in [100, 500, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_steps_offset(self):
+ for steps_offset in [0, 1]:
+ self.check_over_configs(steps_offset=steps_offset)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(steps_offset=1)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(5)
+ assert torch.equal(scheduler.timesteps, torch.LongTensor([1, 201, 401, 601, 801]))
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_timestep_spacing(self):
+ for timestep_spacing in ["trailing", "leading"]:
+ self.check_over_configs(timestep_spacing=timestep_spacing)
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_time_indices(self):
+ for t in [1, 10, 49]:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 10, 50], [10, 50, 500]):
+ self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps)
+
+ def test_add_noise_device(self):
+ pass
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 671.6816) < 1e-2
+ assert abs(result_mean.item() - 0.8746) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 1394.2185) < 1e-2
+ assert abs(result_mean.item() - 1.8154) < 1e-3
+
+ def test_full_loop_with_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=True, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 539.9622) < 1e-2
+ assert abs(result_mean.item() - 0.7031) < 1e-3
+
+ def test_full_loop_with_no_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=False, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 542.6722) < 1e-2
+ assert abs(result_mean.item() - 0.7066) < 1e-3
diff --git a/tests/schedulers/test_scheduler_ddim_parallel.py b/tests/schedulers/test_scheduler_ddim_parallel.py
new file mode 100644
index 0000000..5434d08
--- /dev/null
+++ b/tests/schedulers/test_scheduler_ddim_parallel.py
@@ -0,0 +1,216 @@
+# Copyright 2024 ParaDiGMS authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+
+from diffusers import DDIMParallelScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DDIMParallelSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DDIMParallelScheduler,)
+ forward_default_kwargs = (("eta", 0.0), ("num_inference_steps", 50))
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "clip_sample": True,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps, eta = 10, 0.0
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ for t in scheduler.timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample, eta).prev_sample
+
+ return sample
+
+ def test_timesteps(self):
+ for timesteps in [100, 500, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_steps_offset(self):
+ for steps_offset in [0, 1]:
+ self.check_over_configs(steps_offset=steps_offset)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(steps_offset=1)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(5)
+ assert torch.equal(scheduler.timesteps, torch.LongTensor([801, 601, 401, 201, 1]))
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_timestep_spacing(self):
+ for timestep_spacing in ["trailing", "leading"]:
+ self.check_over_configs(timestep_spacing=timestep_spacing)
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_time_indices(self):
+ for t in [1, 10, 49]:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 10, 50], [10, 50, 500]):
+ self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps)
+
+ def test_eta(self):
+ for t, eta in zip([1, 10, 49], [0.0, 0.5, 1.0]):
+ self.check_over_forward(time_step=t, eta=eta)
+
+ def test_variance(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ assert torch.sum(torch.abs(scheduler._get_variance(0, 0) - 0.0)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(420, 400) - 0.14771)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(980, 960) - 0.32460)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(0, 0) - 0.0)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(487, 486) - 0.00979)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(999, 998) - 0.02)) < 1e-5
+
+ def test_batch_step_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps, eta = 10, 0.0
+ scheduler.set_timesteps(num_inference_steps)
+
+ model = self.dummy_model()
+ sample1 = self.dummy_sample_deter
+ sample2 = self.dummy_sample_deter + 0.1
+ sample3 = self.dummy_sample_deter - 0.1
+
+ per_sample_batch = sample1.shape[0]
+ samples = torch.stack([sample1, sample2, sample3], dim=0)
+ timesteps = torch.arange(num_inference_steps)[0:3, None].repeat(1, per_sample_batch)
+
+ residual = model(samples.flatten(0, 1), timesteps.flatten(0, 1))
+ pred_prev_sample = scheduler.batch_step_no_noise(residual, timesteps.flatten(0, 1), samples.flatten(0, 1), eta)
+
+ result_sum = torch.sum(torch.abs(pred_prev_sample))
+ result_mean = torch.mean(torch.abs(pred_prev_sample))
+
+ assert abs(result_sum.item() - 1147.7904) < 1e-2
+ assert abs(result_mean.item() - 0.4982) < 1e-3
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 172.0067) < 1e-2
+ assert abs(result_mean.item() - 0.223967) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 52.5302) < 1e-2
+ assert abs(result_mean.item() - 0.0684) < 1e-3
+
+ def test_full_loop_with_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=True, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 149.8295) < 1e-2
+ assert abs(result_mean.item() - 0.1951) < 1e-3
+
+ def test_full_loop_with_no_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=False, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 149.0784) < 1e-2
+ assert abs(result_mean.item() - 0.1941) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps, eta = 10, 0.0
+ t_start = 8
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for t in timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample, eta).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 354.5418) < 1e-2, f" expected result sum 354.5418, but get {result_sum}"
+ assert abs(result_mean.item() - 0.4616) < 1e-3, f" expected result mean 0.4616, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_ddpm.py b/tests/schedulers/test_scheduler_ddpm.py
new file mode 100644
index 0000000..056b5d8
--- /dev/null
+++ b/tests/schedulers/test_scheduler_ddpm.py
@@ -0,0 +1,222 @@
+import torch
+
+from diffusers import DDPMScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DDPMSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DDPMScheduler,)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "variance_type": "fixed_small",
+ "clip_sample": True,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [1, 5, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_variance_type(self):
+ for variance in ["fixed_small", "fixed_large", "other"]:
+ self.check_over_configs(variance_type=variance)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample", "v_prediction"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "sample", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_time_indices(self):
+ for t in [0, 500, 999]:
+ self.check_over_forward(time_step=t)
+
+ def test_variance(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ assert torch.sum(torch.abs(scheduler._get_variance(0) - 0.0)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(487) - 0.00979)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(999) - 0.02)) < 1e-5
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ for t in reversed(range(num_trained_timesteps)):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ # if t > 0:
+ # noise = self.dummy_sample_deter
+ # variance = scheduler.get_variance(t) ** (0.5) * noise
+ #
+ # sample = pred_prev_sample + variance
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 258.9606) < 1e-2
+ assert abs(result_mean.item() - 0.3372) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ for t in reversed(range(num_trained_timesteps)):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ # if t > 0:
+ # noise = self.dummy_sample_deter
+ # variance = scheduler.get_variance(t) ** (0.5) * noise
+ #
+ # sample = pred_prev_sample + variance
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 202.0296) < 1e-2
+ assert abs(result_mean.item() - 0.2631) < 1e-3
+
+ def test_custom_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ scheduler_timesteps = scheduler.timesteps
+
+ for i, timestep in enumerate(scheduler_timesteps):
+ if i == len(timesteps) - 1:
+ expected_prev_t = -1
+ else:
+ expected_prev_t = timesteps[i + 1]
+
+ prev_t = scheduler.previous_timestep(timestep)
+ prev_t = prev_t.item()
+
+ self.assertEqual(prev_t, expected_prev_t)
+
+ def test_custom_timesteps_increasing_order(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 51, 0]
+
+ with self.assertRaises(ValueError, msg="`custom_timesteps` must be in descending order."):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_custom_timesteps_passing_both_num_inference_steps_and_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+ num_inference_steps = len(timesteps)
+
+ with self.assertRaises(ValueError, msg="Can only pass one of `num_inference_steps` or `custom_timesteps`."):
+ scheduler.set_timesteps(num_inference_steps=num_inference_steps, timesteps=timesteps)
+
+ def test_custom_timesteps_too_large(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [scheduler.config.num_train_timesteps]
+
+ with self.assertRaises(
+ ValueError,
+ msg="`timesteps` must start before `self.config.train_timesteps`: {scheduler.config.num_train_timesteps}}",
+ ):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+ t_start = num_trained_timesteps - 2
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for t in timesteps:
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 387.9466) < 1e-2, f" expected result sum 387.9466, but get {result_sum}"
+ assert abs(result_mean.item() - 0.5051) < 1e-3, f" expected result mean 0.5051, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_ddpm_parallel.py b/tests/schedulers/test_scheduler_ddpm_parallel.py
new file mode 100644
index 0000000..c358ad9
--- /dev/null
+++ b/tests/schedulers/test_scheduler_ddpm_parallel.py
@@ -0,0 +1,251 @@
+# Copyright 2024 ParaDiGMS authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+
+from diffusers import DDPMParallelScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DDPMParallelSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DDPMParallelScheduler,)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "variance_type": "fixed_small",
+ "clip_sample": True,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [1, 5, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_variance_type(self):
+ for variance in ["fixed_small", "fixed_large", "other"]:
+ self.check_over_configs(variance_type=variance)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample", "v_prediction"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "sample", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_time_indices(self):
+ for t in [0, 500, 999]:
+ self.check_over_forward(time_step=t)
+
+ def test_variance(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ assert torch.sum(torch.abs(scheduler._get_variance(0) - 0.0)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(487) - 0.00979)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(999) - 0.02)) < 1e-5
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_batch_step_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+
+ model = self.dummy_model()
+ sample1 = self.dummy_sample_deter
+ sample2 = self.dummy_sample_deter + 0.1
+ sample3 = self.dummy_sample_deter - 0.1
+
+ per_sample_batch = sample1.shape[0]
+ samples = torch.stack([sample1, sample2, sample3], dim=0)
+ timesteps = torch.arange(num_trained_timesteps)[0:3, None].repeat(1, per_sample_batch)
+
+ residual = model(samples.flatten(0, 1), timesteps.flatten(0, 1))
+ pred_prev_sample = scheduler.batch_step_no_noise(residual, timesteps.flatten(0, 1), samples.flatten(0, 1))
+
+ result_sum = torch.sum(torch.abs(pred_prev_sample))
+ result_mean = torch.mean(torch.abs(pred_prev_sample))
+
+ assert abs(result_sum.item() - 1153.1833) < 1e-2
+ assert abs(result_mean.item() - 0.5005) < 1e-3
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ for t in reversed(range(num_trained_timesteps)):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 258.9606) < 1e-2
+ assert abs(result_mean.item() - 0.3372) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ for t in reversed(range(num_trained_timesteps)):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 202.0296) < 1e-2
+ assert abs(result_mean.item() - 0.2631) < 1e-3
+
+ def test_custom_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ scheduler_timesteps = scheduler.timesteps
+
+ for i, timestep in enumerate(scheduler_timesteps):
+ if i == len(timesteps) - 1:
+ expected_prev_t = -1
+ else:
+ expected_prev_t = timesteps[i + 1]
+
+ prev_t = scheduler.previous_timestep(timestep)
+ prev_t = prev_t.item()
+
+ self.assertEqual(prev_t, expected_prev_t)
+
+ def test_custom_timesteps_increasing_order(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 51, 0]
+
+ with self.assertRaises(ValueError, msg="`custom_timesteps` must be in descending order."):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_custom_timesteps_passing_both_num_inference_steps_and_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+ num_inference_steps = len(timesteps)
+
+ with self.assertRaises(ValueError, msg="Can only pass one of `num_inference_steps` or `custom_timesteps`."):
+ scheduler.set_timesteps(num_inference_steps=num_inference_steps, timesteps=timesteps)
+
+ def test_custom_timesteps_too_large(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [scheduler.config.num_train_timesteps]
+
+ with self.assertRaises(
+ ValueError,
+ msg="`timesteps` must start before `self.config.train_timesteps`: {scheduler.config.num_train_timesteps}}",
+ ):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_trained_timesteps = len(scheduler)
+ t_start = num_trained_timesteps - 2
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for t in timesteps:
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 387.9466) < 1e-2, f" expected result sum 387.9466, but get {result_sum}"
+ assert abs(result_mean.item() - 0.5051) < 1e-3, f" expected result mean 0.5051, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_deis.py b/tests/schedulers/test_scheduler_deis.py
new file mode 100644
index 0000000..b2823a0
--- /dev/null
+++ b/tests/schedulers/test_scheduler_deis.py
@@ -0,0 +1,265 @@
+import tempfile
+
+import torch
+
+from diffusers import (
+ DEISMultistepScheduler,
+ DPMSolverMultistepScheduler,
+ DPMSolverSinglestepScheduler,
+ UniPCMultistepScheduler,
+)
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DEISMultistepSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DEISMultistepScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 25),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "solver_order": 2,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output, new_output = sample, sample
+ for t in range(time_step, time_step + scheduler.config.solver_order + 1):
+ t = scheduler.timesteps[t]
+ output = scheduler.step(residual, t, output, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, t, new_output, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, scheduler=None, **config):
+ if scheduler is None:
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_switch(self):
+ # make sure that iterating over schedulers with same config names gives same results
+ # for defaults
+ scheduler = DEISMultistepScheduler(**self.get_scheduler_config())
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.23916) < 1e-3
+
+ scheduler = DPMSolverSinglestepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverMultistepScheduler.from_config(scheduler.config)
+ scheduler = UniPCMultistepScheduler.from_config(scheduler.config)
+ scheduler = DEISMultistepScheduler.from_config(scheduler.config)
+
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.23916) < 1e-3
+
+ def test_timesteps(self):
+ for timesteps in [25, 50, 100, 999, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for order in [1, 2, 3]:
+ for solver_type in ["logrho"]:
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ algorithm_type="deis",
+ solver_order=order,
+ solver_type=solver_type,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_solver_order_and_type(self):
+ for algorithm_type in ["deis"]:
+ for solver_type in ["logrho"]:
+ for order in [1, 2, 3]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ sample = self.full_loop(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ assert not torch.isnan(sample).any(), "Samples have nan numbers"
+
+ def test_lower_order_final(self):
+ self.check_over_configs(lower_order_final=True)
+ self.check_over_configs(lower_order_final=False)
+
+ def test_inference_steps(self):
+ for num_inference_steps in [1, 2, 3, 5, 10, 50, 100, 999, 1000]:
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=0)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.23916) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.091) < 1e-3
+
+ def test_fp16_support(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(thresholding=True, dynamic_thresholding_ratio=0)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.half()
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ assert sample.dtype == torch.float16
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 8
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 315.3016) < 1e-2, f" expected result sum 315.3016, but get {result_sum}"
+ assert abs(result_mean.item() - 0.41054) < 1e-3, f" expected result mean 0.41054, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_dpm_multi.py b/tests/schedulers/test_scheduler_dpm_multi.py
new file mode 100644
index 0000000..fcf8881
--- /dev/null
+++ b/tests/schedulers/test_scheduler_dpm_multi.py
@@ -0,0 +1,318 @@
+import tempfile
+
+import torch
+
+from diffusers import (
+ DEISMultistepScheduler,
+ DPMSolverMultistepScheduler,
+ DPMSolverSinglestepScheduler,
+ UniPCMultistepScheduler,
+)
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DPMSolverMultistepSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DPMSolverMultistepScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 25),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "solver_order": 2,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ "sample_max_value": 1.0,
+ "algorithm_type": "dpmsolver++",
+ "solver_type": "midpoint",
+ "lower_order_final": False,
+ "euler_at_final": False,
+ "lambda_min_clipped": -float("inf"),
+ "variance_type": None,
+ "final_sigmas_type": "sigma_min",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output, new_output = sample, sample
+ for t in range(time_step, time_step + scheduler.config.solver_order + 1):
+ t = new_scheduler.timesteps[t]
+ output = scheduler.step(residual, t, output, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, t, new_output, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ time_step = new_scheduler.timesteps[time_step]
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, scheduler=None, **config):
+ if scheduler is None:
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [25, 50, 100, 999, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for order in [1, 2, 3]:
+ for solver_type in ["midpoint", "heun"]:
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ algorithm_type="dpmsolver++",
+ solver_order=order,
+ solver_type=solver_type,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_solver_order_and_type(self):
+ for algorithm_type in ["dpmsolver", "dpmsolver++", "sde-dpmsolver", "sde-dpmsolver++"]:
+ for solver_type in ["midpoint", "heun"]:
+ for order in [1, 2, 3]:
+ for prediction_type in ["epsilon", "sample"]:
+ if algorithm_type in ["sde-dpmsolver", "sde-dpmsolver++"]:
+ if order == 3:
+ continue
+ else:
+ self.check_over_configs(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ sample = self.full_loop(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ assert not torch.isnan(sample).any(), "Samples have nan numbers"
+
+ def test_lower_order_final(self):
+ self.check_over_configs(lower_order_final=True)
+ self.check_over_configs(lower_order_final=False)
+
+ def test_euler_at_final(self):
+ self.check_over_configs(euler_at_final=True)
+ self.check_over_configs(euler_at_final=False)
+
+ def test_lambda_min_clipped(self):
+ self.check_over_configs(lambda_min_clipped=-float("inf"))
+ self.check_over_configs(lambda_min_clipped=-5.1)
+
+ def test_variance_type(self):
+ self.check_over_configs(variance_type=None)
+ self.check_over_configs(variance_type="learned_range")
+
+ def test_inference_steps(self):
+ for num_inference_steps in [1, 2, 3, 5, 10, 50, 100, 999, 1000]:
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=0)
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.3301) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 5
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 318.4111) < 1e-2, f" expected result sum 318.4111, but get {result_sum}"
+ assert abs(result_mean.item() - 0.4146) < 1e-3, f" expected result mean 0.4146, but get {result_mean}"
+
+ def test_full_loop_no_noise_thres(self):
+ sample = self.full_loop(thresholding=True, dynamic_thresholding_ratio=0.87, sample_max_value=0.5)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 1.1364) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2251) < 1e-3
+
+ def test_full_loop_with_karras_and_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction", use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2096) < 1e-3
+
+ def test_full_loop_with_lu_and_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction", use_lu_lambdas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.1554) < 1e-3
+
+ def test_switch(self):
+ # make sure that iterating over schedulers with same config names gives same results
+ # for defaults
+ scheduler = DPMSolverMultistepScheduler(**self.get_scheduler_config())
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.3301) < 1e-3
+
+ scheduler = DPMSolverSinglestepScheduler.from_config(scheduler.config)
+ scheduler = UniPCMultistepScheduler.from_config(scheduler.config)
+ scheduler = DEISMultistepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverMultistepScheduler.from_config(scheduler.config)
+
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.3301) < 1e-3
+
+ def test_fp16_support(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(thresholding=True, dynamic_thresholding_ratio=0)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.half()
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ assert sample.dtype == torch.float16
+
+ def test_duplicated_timesteps(self, **config):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(scheduler.config.num_train_timesteps)
+ assert len(scheduler.timesteps) == scheduler.num_inference_steps
diff --git a/tests/schedulers/test_scheduler_dpm_multi_inverse.py b/tests/schedulers/test_scheduler_dpm_multi_inverse.py
new file mode 100644
index 0000000..014c901
--- /dev/null
+++ b/tests/schedulers/test_scheduler_dpm_multi_inverse.py
@@ -0,0 +1,267 @@
+import tempfile
+
+import torch
+
+from diffusers import DPMSolverMultistepInverseScheduler, DPMSolverMultistepScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DPMSolverMultistepSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DPMSolverMultistepInverseScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 25),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "solver_order": 2,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ "sample_max_value": 1.0,
+ "algorithm_type": "dpmsolver++",
+ "solver_type": "midpoint",
+ "lower_order_final": False,
+ "lambda_min_clipped": -float("inf"),
+ "variance_type": None,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output, new_output = sample, sample
+ for t in range(time_step, time_step + scheduler.config.solver_order + 1):
+ t = scheduler.timesteps[t]
+ output = scheduler.step(residual, t, output, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, t, new_output, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, scheduler=None, **config):
+ if scheduler is None:
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [25, 50, 100, 999, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for order in [1, 2, 3]:
+ for solver_type in ["midpoint", "heun"]:
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ algorithm_type="dpmsolver++",
+ solver_order=order,
+ solver_type=solver_type,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_solver_order_and_type(self):
+ for algorithm_type in ["dpmsolver", "dpmsolver++"]:
+ for solver_type in ["midpoint", "heun"]:
+ for order in [1, 2, 3]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ sample = self.full_loop(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ assert not torch.isnan(sample).any(), "Samples have nan numbers"
+
+ def test_lower_order_final(self):
+ self.check_over_configs(lower_order_final=True)
+ self.check_over_configs(lower_order_final=False)
+
+ def test_lambda_min_clipped(self):
+ self.check_over_configs(lambda_min_clipped=-float("inf"))
+ self.check_over_configs(lambda_min_clipped=-5.1)
+
+ def test_variance_type(self):
+ self.check_over_configs(variance_type=None)
+ self.check_over_configs(variance_type="learned_range")
+
+ def test_timestep_spacing(self):
+ for timestep_spacing in ["trailing", "leading"]:
+ self.check_over_configs(timestep_spacing=timestep_spacing)
+
+ def test_inference_steps(self):
+ for num_inference_steps in [1, 2, 3, 5, 10, 50, 100, 999, 1000]:
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=0)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.7047) < 1e-3
+
+ def test_full_loop_no_noise_thres(self):
+ sample = self.full_loop(thresholding=True, dynamic_thresholding_ratio=0.87, sample_max_value=0.5)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 19.8933) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 1.5194) < 1e-3
+
+ def test_full_loop_with_karras_and_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction", use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 1.7833) < 2e-3
+
+ def test_switch(self):
+ # make sure that iterating over schedulers with same config names gives same results
+ # for defaults
+ scheduler = DPMSolverMultistepInverseScheduler(**self.get_scheduler_config())
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.7047) < 1e-3
+
+ scheduler = DPMSolverMultistepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverMultistepInverseScheduler.from_config(scheduler.config)
+
+ sample = self.full_loop(scheduler=scheduler)
+ new_result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(new_result_mean.item() - result_mean.item()) < 1e-3
+
+ def test_fp16_support(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(thresholding=True, dynamic_thresholding_ratio=0)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.half()
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ assert sample.dtype == torch.float16
+
+ def test_unique_timesteps(self, **config):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(scheduler.config.num_train_timesteps)
+ assert len(scheduler.timesteps.unique()) == scheduler.num_inference_steps
diff --git a/tests/schedulers/test_scheduler_dpm_sde.py b/tests/schedulers/test_scheduler_dpm_sde.py
new file mode 100644
index 0000000..253a0a4
--- /dev/null
+++ b/tests/schedulers/test_scheduler_dpm_sde.py
@@ -0,0 +1,167 @@
+import torch
+
+from diffusers import DPMSolverSDEScheduler
+from diffusers.utils.testing_utils import require_torchsde, torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+@require_torchsde
+class DPMSolverSDESchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DPMSolverSDEScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "noise_sampler_seed": 0,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["mps"]:
+ assert abs(result_sum.item() - 167.47821044921875) < 1e-2
+ assert abs(result_mean.item() - 0.2178705964565277) < 1e-3
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 171.59352111816406) < 1e-2
+ assert abs(result_mean.item() - 0.22342906892299652) < 1e-3
+ else:
+ assert abs(result_sum.item() - 162.52383422851562) < 1e-2
+ assert abs(result_mean.item() - 0.211619570851326) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["mps"]:
+ assert abs(result_sum.item() - 124.77149200439453) < 1e-2
+ assert abs(result_mean.item() - 0.16226289014816284) < 1e-3
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 128.1663360595703) < 1e-2
+ assert abs(result_mean.item() - 0.16688326001167297) < 1e-3
+ else:
+ assert abs(result_sum.item() - 119.8487548828125) < 1e-2
+ assert abs(result_mean.item() - 0.1560530662536621) < 1e-3
+
+ def test_full_loop_device(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["mps"]:
+ assert abs(result_sum.item() - 167.46957397460938) < 1e-2
+ assert abs(result_mean.item() - 0.21805934607982635) < 1e-3
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 171.59353637695312) < 1e-2
+ assert abs(result_mean.item() - 0.22342908382415771) < 1e-3
+ else:
+ assert abs(result_sum.item() - 162.52383422851562) < 1e-2
+ assert abs(result_mean.item() - 0.211619570851326) < 1e-3
+
+ def test_full_loop_device_karras_sigmas(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config, use_karras_sigmas=True)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["mps"]:
+ assert abs(result_sum.item() - 176.66974135742188) < 1e-2
+ assert abs(result_mean.item() - 0.23003872730981811) < 1e-2
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 177.63653564453125) < 1e-2
+ assert abs(result_mean.item() - 0.23003872730981811) < 1e-2
+ else:
+ assert abs(result_sum.item() - 170.3135223388672) < 1e-2
+ assert abs(result_mean.item() - 0.23003872730981811) < 1e-2
diff --git a/tests/schedulers/test_scheduler_dpm_single.py b/tests/schedulers/test_scheduler_dpm_single.py
new file mode 100644
index 0000000..251a150
--- /dev/null
+++ b/tests/schedulers/test_scheduler_dpm_single.py
@@ -0,0 +1,309 @@
+import tempfile
+
+import torch
+
+from diffusers import (
+ DEISMultistepScheduler,
+ DPMSolverMultistepScheduler,
+ DPMSolverSinglestepScheduler,
+ UniPCMultistepScheduler,
+)
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class DPMSolverSinglestepSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (DPMSolverSinglestepScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 25),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "solver_order": 2,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ "sample_max_value": 1.0,
+ "algorithm_type": "dpmsolver++",
+ "solver_type": "midpoint",
+ "lambda_min_clipped": -float("inf"),
+ "variance_type": None,
+ "final_sigmas_type": "sigma_min",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output, new_output = sample, sample
+ for t in range(time_step, time_step + scheduler.config.solver_order + 1):
+ t = scheduler.timesteps[t]
+ output = scheduler.step(residual, t, output, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, t, new_output, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, scheduler=None, **config):
+ if scheduler is None:
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_full_uneven_loop(self):
+ scheduler = DPMSolverSinglestepScheduler(**self.get_scheduler_config())
+ num_inference_steps = 50
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # make sure that the first t is uneven
+ for i, t in enumerate(scheduler.timesteps[3:]):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2574) < 1e-3
+
+ def test_timesteps(self):
+ for timesteps in [25, 50, 100, 999, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_switch(self):
+ # make sure that iterating over schedulers with same config names gives same results
+ # for defaults
+ scheduler = DPMSolverSinglestepScheduler(**self.get_scheduler_config())
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2791) < 1e-3
+
+ scheduler = DEISMultistepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverMultistepScheduler.from_config(scheduler.config)
+ scheduler = UniPCMultistepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverSinglestepScheduler.from_config(scheduler.config)
+
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2791) < 1e-3
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for order in [1, 2, 3]:
+ for solver_type in ["midpoint", "heun"]:
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ algorithm_type="dpmsolver++",
+ solver_order=order,
+ solver_type=solver_type,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_solver_order_and_type(self):
+ for algorithm_type in ["dpmsolver", "dpmsolver++"]:
+ for solver_type in ["midpoint", "heun"]:
+ for order in [1, 2, 3]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ sample = self.full_loop(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ assert not torch.isnan(sample).any(), "Samples have nan numbers"
+
+ def test_lower_order_final(self):
+ self.check_over_configs(lower_order_final=True)
+ self.check_over_configs(lower_order_final=False)
+
+ def test_lambda_min_clipped(self):
+ self.check_over_configs(lambda_min_clipped=-float("inf"))
+ self.check_over_configs(lambda_min_clipped=-5.1)
+
+ def test_variance_type(self):
+ self.check_over_configs(variance_type=None)
+ self.check_over_configs(variance_type="learned_range")
+
+ def test_inference_steps(self):
+ for num_inference_steps in [1, 2, 3, 5, 10, 50, 100, 999, 1000]:
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=0)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2791) < 1e-3
+
+ def test_full_loop_with_karras(self):
+ sample = self.full_loop(use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2248) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.1453) < 1e-3
+
+ def test_full_loop_with_karras_and_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction", use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.0649) < 1e-3
+
+ def test_fp16_support(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(thresholding=True, dynamic_thresholding_ratio=0)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.half()
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ assert sample.dtype == torch.float16
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ time_step_0 = scheduler.timesteps[0]
+ time_step_1 = scheduler.timesteps[1]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 5
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 269.2187) < 1e-2, f" expected result sum 269.2187, but get {result_sum}"
+ assert abs(result_mean.item() - 0.3505) < 1e-3, f" expected result mean 0.3505, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_edm_dpmsolver_multistep.py b/tests/schedulers/test_scheduler_edm_dpmsolver_multistep.py
new file mode 100644
index 0000000..b5522f5
--- /dev/null
+++ b/tests/schedulers/test_scheduler_edm_dpmsolver_multistep.py
@@ -0,0 +1,262 @@
+import tempfile
+import unittest
+
+import torch
+
+from diffusers import (
+ EDMDPMSolverMultistepScheduler,
+)
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class EDMDPMSolverMultistepSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (EDMDPMSolverMultistepScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 25),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "sigma_min": 0.002,
+ "sigma_max": 80.0,
+ "sigma_data": 0.5,
+ "num_train_timesteps": 1000,
+ "solver_order": 2,
+ "prediction_type": "epsilon",
+ "thresholding": False,
+ "sample_max_value": 1.0,
+ "algorithm_type": "dpmsolver++",
+ "solver_type": "midpoint",
+ "lower_order_final": False,
+ "euler_at_final": False,
+ "final_sigmas_type": "sigma_min",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output, new_output = sample, sample
+ for t in range(time_step, time_step + scheduler.config.solver_order + 1):
+ t = new_scheduler.timesteps[t]
+ output = scheduler.step(residual, t, output, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, t, new_output, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ time_step = new_scheduler.timesteps[time_step]
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, scheduler=None, **config):
+ if scheduler is None:
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [25, 50, 100, 999, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for order in [1, 2, 3]:
+ for solver_type in ["midpoint", "heun"]:
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ algorithm_type="dpmsolver++",
+ solver_order=order,
+ solver_type=solver_type,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ # TODO (patil-suraj): Fix this test
+ @unittest.skip("Skip for now, as it failing currently but works with the actual model")
+ def test_solver_order_and_type(self):
+ for algorithm_type in ["dpmsolver++", "sde-dpmsolver++"]:
+ for solver_type in ["midpoint", "heun"]:
+ for order in [1, 2, 3]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ if algorithm_type == "sde-dpmsolver++":
+ if order == 3:
+ continue
+ else:
+ self.check_over_configs(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ sample = self.full_loop(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ algorithm_type=algorithm_type,
+ )
+ assert (
+ not torch.isnan(sample).any()
+ ), f"Samples have nan numbers, {order}, {solver_type}, {prediction_type}, {algorithm_type}"
+
+ def test_lower_order_final(self):
+ self.check_over_configs(lower_order_final=True)
+ self.check_over_configs(lower_order_final=False)
+
+ def test_euler_at_final(self):
+ self.check_over_configs(euler_at_final=True)
+ self.check_over_configs(euler_at_final=False)
+
+ def test_inference_steps(self):
+ for num_inference_steps in [1, 2, 3, 5, 10, 50, 100, 999, 1000]:
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=0)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.0001) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 5
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 8.1661) < 1e-2, f" expected result sum 8.1661, but get {result_sum}"
+ assert abs(result_mean.item() - 0.0106) < 1e-3, f" expected result mean 0.0106, but get {result_mean}"
+
+ def test_full_loop_no_noise_thres(self):
+ sample = self.full_loop(thresholding=True, dynamic_thresholding_ratio=0.87, sample_max_value=0.5)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.0080) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.0092) < 1e-3
+
+ def test_duplicated_timesteps(self, **config):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(scheduler.config.num_train_timesteps)
+ assert len(scheduler.timesteps) == scheduler.num_inference_steps
+
+ def test_trained_betas(self):
+ pass
diff --git a/tests/schedulers/test_scheduler_edm_euler.py b/tests/schedulers/test_scheduler_edm_euler.py
new file mode 100644
index 0000000..9d2adea
--- /dev/null
+++ b/tests/schedulers/test_scheduler_edm_euler.py
@@ -0,0 +1,206 @@
+import inspect
+import tempfile
+import unittest
+from typing import Dict, List, Tuple
+
+import torch
+
+from diffusers import EDMEulerScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class EDMEulerSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (EDMEulerScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 10),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 256,
+ "sigma_min": 0.002,
+ "sigma_max": 80.0,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_full_loop_no_noise(self, num_inference_steps=10, seed=0):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ for i, t in enumerate(scheduler.timesteps):
+ scaled_sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(scaled_sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 34.1855) < 1e-3
+ assert abs(result_mean.item() - 0.044) < 1e-3
+
+ def test_full_loop_device(self, num_inference_steps=10, seed=0):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ for i, t in enumerate(scheduler.timesteps):
+ scaled_sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(scaled_sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 34.1855) < 1e-3
+ assert abs(result_mean.item() - 0.044) < 1e-3
+
+ # Override test_from_save_pretrined to use EDMEulerScheduler-specific logic
+ def test_from_save_pretrained(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ scheduler.set_timesteps(num_inference_steps)
+ new_scheduler.set_timesteps(num_inference_steps)
+ timestep = scheduler.timesteps[0]
+
+ sample = self.dummy_sample
+
+ scaled_sample = scheduler.scale_model_input(sample, timestep)
+ residual = 0.1 * scaled_sample
+
+ new_scaled_sample = new_scheduler.scale_model_input(sample, timestep)
+ new_residual = 0.1 * new_scaled_sample
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ output = scheduler.step(residual, timestep, sample, **kwargs).prev_sample
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ new_output = new_scheduler.step(new_residual, timestep, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ # Override test_from_save_pretrined to use EDMEulerScheduler-specific logic
+ def test_step_shape(self):
+ num_inference_steps = 10
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ timestep_0 = scheduler.timesteps[0]
+ timestep_1 = scheduler.timesteps[1]
+
+ sample = self.dummy_sample
+ scaled_sample = scheduler.scale_model_input(sample, timestep_0)
+ residual = 0.1 * scaled_sample
+
+ output_0 = scheduler.step(residual, timestep_0, sample).prev_sample
+ output_1 = scheduler.step(residual, timestep_1, sample).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ # Override test_from_save_pretrined to use EDMEulerScheduler-specific logic
+ def test_scheduler_outputs_equivalence(self):
+ def set_nan_tensor_to_zero(t):
+ t[t != t] = 0
+ return t
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, (List, Tuple)):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ torch.allclose(
+ set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
+ ),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has"
+ f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}."
+ ),
+ )
+
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", 50)
+
+ timestep = 0
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ timestep = scheduler.timesteps[0]
+
+ sample = self.dummy_sample
+ scaled_sample = scheduler.scale_model_input(sample, timestep)
+ residual = 0.1 * scaled_sample
+
+ # Set the seed before state as some schedulers are stochastic like EulerAncestralDiscreteScheduler, EulerDiscreteScheduler
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ outputs_dict = scheduler.step(residual, timestep, sample, **kwargs)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ scaled_sample = scheduler.scale_model_input(sample, timestep)
+ residual = 0.1 * scaled_sample
+
+ # Set the seed before state as some schedulers are stochastic like EulerAncestralDiscreteScheduler, EulerDiscreteScheduler
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ outputs_tuple = scheduler.step(residual, timestep, sample, return_dict=False, **kwargs)
+
+ recursive_check(outputs_tuple, outputs_dict)
+
+ @unittest.skip(reason="EDMEulerScheduler does not support beta schedules.")
+ def test_trained_betas(self):
+ pass
diff --git a/tests/schedulers/test_scheduler_euler.py b/tests/schedulers/test_scheduler_euler.py
new file mode 100644
index 0000000..41c418c
--- /dev/null
+++ b/tests/schedulers/test_scheduler_euler.py
@@ -0,0 +1,191 @@
+import torch
+
+from diffusers import EulerDiscreteScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class EulerDiscreteSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (EulerDiscreteScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_timestep_type(self):
+ timestep_types = ["discrete", "continuous"]
+ for timestep_type in timestep_types:
+ self.check_over_configs(timestep_type=timestep_type)
+
+ def test_karras_sigmas(self):
+ self.check_over_configs(use_karras_sigmas=True, sigma_min=0.02, sigma_max=700.0)
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 10.0807) < 1e-2
+ assert abs(result_mean.item() - 0.0131) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 0.0002) < 1e-2
+ assert abs(result_mean.item() - 2.2676e-06) < 1e-3
+
+ def test_full_loop_device(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma.cpu()
+ sample = sample.to(torch_device)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 10.0807) < 1e-2
+ assert abs(result_mean.item() - 0.0131) < 1e-3
+
+ def test_full_loop_device_karras_sigmas(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config, use_karras_sigmas=True)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma.cpu()
+ sample = sample.to(torch_device)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 124.52299499511719) < 1e-2
+ assert abs(result_mean.item() - 0.16213932633399963) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ # add noise
+ t_start = self.num_inference_steps - 2
+ noise = self.dummy_noise_deter
+ noise = noise.to(sample.device)
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 57062.9297) < 1e-2, f" expected result sum 57062.9297, but get {result_sum}"
+ assert abs(result_mean.item() - 74.3007) < 1e-3, f" expected result mean 74.3007, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_euler_ancestral.py b/tests/schedulers/test_scheduler_euler_ancestral.py
new file mode 100644
index 0000000..9f22ab3
--- /dev/null
+++ b/tests/schedulers/test_scheduler_euler_ancestral.py
@@ -0,0 +1,156 @@
+import torch
+
+from diffusers import EulerAncestralDiscreteScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class EulerAncestralDiscreteSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (EulerAncestralDiscreteScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_rescale_betas_zero_snr(self):
+ for rescale_betas_zero_snr in [True, False]:
+ self.check_over_configs(rescale_betas_zero_snr=rescale_betas_zero_snr)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma.cpu()
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 152.3192) < 1e-2
+ assert abs(result_mean.item() - 0.1983) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 108.4439) < 1e-2
+ assert abs(result_mean.item() - 0.1412) < 1e-3
+
+ def test_full_loop_device(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma.cpu()
+ sample = sample.to(torch_device)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 152.3192) < 1e-2
+ assert abs(result_mean.item() - 0.1983) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ t_start = self.num_inference_steps - 2
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ # add noise
+ noise = self.dummy_noise_deter
+ noise = noise.to(sample.device)
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 56163.0508) < 1e-2, f" expected result sum 56163.0508, but get {result_sum}"
+ assert abs(result_mean.item() - 73.1290) < 1e-3, f" expected result mean 73.1290, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_flax.py b/tests/schedulers/test_scheduler_flax.py
new file mode 100644
index 0000000..2855f09
--- /dev/null
+++ b/tests/schedulers/test_scheduler_flax.py
@@ -0,0 +1,919 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import tempfile
+import unittest
+from typing import Dict, List, Tuple
+
+from diffusers import FlaxDDIMScheduler, FlaxDDPMScheduler, FlaxPNDMScheduler
+from diffusers.utils import is_flax_available
+from diffusers.utils.testing_utils import require_flax
+
+
+if is_flax_available():
+ import jax
+ import jax.numpy as jnp
+ from jax import random
+
+ jax_device = jax.default_backend()
+
+
+@require_flax
+class FlaxSchedulerCommonTest(unittest.TestCase):
+ scheduler_classes = ()
+ forward_default_kwargs = ()
+
+ @property
+ def dummy_sample(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ key1, key2 = random.split(random.PRNGKey(0))
+ sample = random.uniform(key1, (batch_size, num_channels, height, width))
+
+ return sample, key2
+
+ @property
+ def dummy_sample_deter(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ num_elems = batch_size * num_channels * height * width
+ sample = jnp.arange(num_elems)
+ sample = sample.reshape(num_channels, height, width, batch_size)
+ sample = sample / num_elems
+ return jnp.transpose(sample, (3, 0, 1, 2))
+
+ def get_scheduler_config(self):
+ raise NotImplementedError
+
+ def dummy_model(self):
+ def model(sample, t, *args):
+ return sample * t / (t + 1)
+
+ return model
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ sample, key = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output = scheduler.step(state, residual, time_step, sample, key, **kwargs).prev_sample
+ new_output = new_scheduler.step(new_state, residual, time_step, sample, key, **kwargs).prev_sample
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ kwargs.update(forward_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ sample, key = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output = scheduler.step(state, residual, time_step, sample, key, **kwargs).prev_sample
+ new_output = new_scheduler.step(new_state, residual, time_step, sample, key, **kwargs).prev_sample
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ sample, key = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output = scheduler.step(state, residual, 1, sample, key, **kwargs).prev_sample
+ new_output = new_scheduler.step(new_state, residual, 1, sample, key, **kwargs).prev_sample
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ sample, key = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output_0 = scheduler.step(state, residual, 0, sample, key, **kwargs).prev_sample
+ output_1 = scheduler.step(state, residual, 1, sample, key, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_scheduler_outputs_equivalence(self):
+ def set_nan_tensor_to_zero(t):
+ return t.at[t != t].set(0)
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, (List, Tuple)):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ jnp.allclose(set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {jnp.max(jnp.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {jnp.isnan(tuple_object).any()} and `inf`: {jnp.isinf(tuple_object)}. Dict has"
+ f" `nan`: {jnp.isnan(dict_object).any()} and `inf`: {jnp.isinf(dict_object)}."
+ ),
+ )
+
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ sample, key = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ outputs_dict = scheduler.step(state, residual, 0, sample, key, **kwargs)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ outputs_tuple = scheduler.step(state, residual, 0, sample, key, return_dict=False, **kwargs)
+
+ recursive_check(outputs_tuple[0], outputs_dict.prev_sample)
+
+ def test_deprecated_kwargs(self):
+ for scheduler_class in self.scheduler_classes:
+ has_kwarg_in_model_class = "kwargs" in inspect.signature(scheduler_class.__init__).parameters
+ has_deprecated_kwarg = len(scheduler_class._deprecated_kwargs) > 0
+
+ if has_kwarg_in_model_class and not has_deprecated_kwarg:
+ raise ValueError(
+ f"{scheduler_class} has `**kwargs` in its __init__ method but has not defined any deprecated"
+ " kwargs under the `_deprecated_kwargs` class attribute. Make sure to either remove `**kwargs` if"
+ " there are no deprecated arguments or add the deprecated argument with `_deprecated_kwargs ="
+ " []`"
+ )
+
+ if not has_kwarg_in_model_class and has_deprecated_kwarg:
+ raise ValueError(
+ f"{scheduler_class} doesn't have `**kwargs` in its __init__ method but has defined deprecated"
+ " kwargs under the `_deprecated_kwargs` class attribute. Make sure to either add the `**kwargs`"
+ f" argument to {self.model_class}.__init__ if there are deprecated arguments or remove the"
+ " deprecated argument from `_deprecated_kwargs = []`"
+ )
+
+
+@require_flax
+class FlaxDDPMSchedulerTest(FlaxSchedulerCommonTest):
+ scheduler_classes = (FlaxDDPMScheduler,)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "variance_type": "fixed_small",
+ "clip_sample": True,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [1, 5, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_variance_type(self):
+ for variance in ["fixed_small", "fixed_large", "other"]:
+ self.check_over_configs(variance_type=variance)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_time_indices(self):
+ for t in [0, 500, 999]:
+ self.check_over_forward(time_step=t)
+
+ def test_variance(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 0) - 0.0)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 487) - 0.00979)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 999) - 0.02)) < 1e-5
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ num_trained_timesteps = len(scheduler)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ key1, key2 = random.split(random.PRNGKey(0))
+
+ for t in reversed(range(num_trained_timesteps)):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ output = scheduler.step(state, residual, t, sample, key1)
+ pred_prev_sample = output.prev_sample
+ state = output.state
+ key1, key2 = random.split(key2)
+
+ # if t > 0:
+ # noise = self.dummy_sample_deter
+ # variance = scheduler.get_variance(t) ** (0.5) * noise
+ #
+ # sample = pred_prev_sample + variance
+ sample = pred_prev_sample
+
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ if jax_device == "tpu":
+ assert abs(result_sum - 255.0714) < 1e-2
+ assert abs(result_mean - 0.332124) < 1e-3
+ else:
+ assert abs(result_sum - 255.1113) < 1e-2
+ assert abs(result_mean - 0.332176) < 1e-3
+
+
+@require_flax
+class FlaxDDIMSchedulerTest(FlaxSchedulerCommonTest):
+ scheduler_classes = (FlaxDDIMScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 50),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+ key1, key2 = random.split(random.PRNGKey(0))
+
+ num_inference_steps = 10
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ state = scheduler.set_timesteps(state, num_inference_steps)
+
+ for t in state.timesteps:
+ residual = model(sample, t)
+ output = scheduler.step(state, residual, t, sample)
+ sample = output.prev_sample
+ state = output.state
+ key1, key2 = random.split(key2)
+
+ return sample
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output = scheduler.step(state, residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(new_state, residual, time_step, sample, **kwargs).prev_sample
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output = scheduler.step(state, residual, 1, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(new_state, residual, 1, sample, **kwargs).prev_sample
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ kwargs.update(forward_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output = scheduler.step(state, residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(new_state, residual, time_step, sample, **kwargs).prev_sample
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_scheduler_outputs_equivalence(self):
+ def set_nan_tensor_to_zero(t):
+ return t.at[t != t].set(0)
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, (List, Tuple)):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ jnp.allclose(set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {jnp.max(jnp.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {jnp.isnan(tuple_object).any()} and `inf`: {jnp.isinf(tuple_object)}. Dict has"
+ f" `nan`: {jnp.isnan(dict_object).any()} and `inf`: {jnp.isinf(dict_object)}."
+ ),
+ )
+
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ outputs_dict = scheduler.step(state, residual, 0, sample, **kwargs)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ outputs_tuple = scheduler.step(state, residual, 0, sample, return_dict=False, **kwargs)
+
+ recursive_check(outputs_tuple[0], outputs_dict.prev_sample)
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output_0 = scheduler.step(state, residual, 0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(state, residual, 1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [100, 500, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_steps_offset(self):
+ for steps_offset in [0, 1]:
+ self.check_over_configs(steps_offset=steps_offset)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(steps_offset=1)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+ state = scheduler.set_timesteps(state, 5)
+ assert jnp.equal(state.timesteps, jnp.array([801, 601, 401, 201, 1])).all()
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_time_indices(self):
+ for t in [1, 10, 49]:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 10, 50], [10, 50, 500]):
+ self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps)
+
+ def test_variance(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 0, 0) - 0.0)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 420, 400) - 0.14771)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 980, 960) - 0.32460)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 0, 0) - 0.0)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 487, 486) - 0.00979)) < 1e-5
+ assert jnp.sum(jnp.abs(scheduler._get_variance(state, 999, 998) - 0.02)) < 1e-5
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ assert abs(result_sum - 172.0067) < 1e-2
+ assert abs(result_mean - 0.223967) < 1e-3
+
+ def test_full_loop_with_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=True, beta_start=0.01)
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ if jax_device == "tpu":
+ assert abs(result_sum - 149.8409) < 1e-2
+ assert abs(result_mean - 0.1951) < 1e-3
+ else:
+ assert abs(result_sum - 149.8295) < 1e-2
+ assert abs(result_mean - 0.1951) < 1e-3
+
+ def test_full_loop_with_no_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=False, beta_start=0.01)
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ if jax_device == "tpu":
+ pass
+ # FIXME: both result_sum and result_mean are nan on TPU
+ # assert jnp.isnan(result_sum)
+ # assert jnp.isnan(result_mean)
+ else:
+ assert abs(result_sum - 149.0784) < 1e-2
+ assert abs(result_mean - 0.1941) < 1e-3
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "sample", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+
+@require_flax
+class FlaxPNDMSchedulerTest(FlaxSchedulerCommonTest):
+ scheduler_classes = (FlaxPNDMScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 50),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = jnp.array([residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05])
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+ # copy over dummy past residuals
+ state = state.replace(ets=dummy_past_residuals[:])
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps, shape=sample.shape)
+ # copy over dummy past residuals
+ new_state = new_state.replace(ets=dummy_past_residuals[:])
+
+ (prev_sample, state) = scheduler.step_prk(state, residual, time_step, sample, **kwargs)
+ (new_prev_sample, new_state) = new_scheduler.step_prk(new_state, residual, time_step, sample, **kwargs)
+
+ assert jnp.sum(jnp.abs(prev_sample - new_prev_sample)) < 1e-5, "Scheduler outputs are not identical"
+
+ output, _ = scheduler.step_plms(state, residual, time_step, sample, **kwargs)
+ new_output, _ = new_scheduler.step_plms(new_state, residual, time_step, sample, **kwargs)
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def test_scheduler_outputs_equivalence(self):
+ def set_nan_tensor_to_zero(t):
+ return t.at[t != t].set(0)
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, (List, Tuple)):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ jnp.allclose(set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {jnp.max(jnp.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {jnp.isnan(tuple_object).any()} and `inf`: {jnp.isinf(tuple_object)}. Dict has"
+ f" `nan`: {jnp.isnan(dict_object).any()} and `inf`: {jnp.isinf(dict_object)}."
+ ),
+ )
+
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ outputs_dict = scheduler.step(state, residual, 0, sample, **kwargs)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ outputs_tuple = scheduler.step(state, residual, 0, sample, return_dict=False, **kwargs)
+
+ recursive_check(outputs_tuple[0], outputs_dict.prev_sample)
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = jnp.array([residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05])
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.ets = dummy_past_residuals[:]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler, new_state = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_state = new_scheduler.set_timesteps(new_state, num_inference_steps, shape=sample.shape)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_state.replace(ets=dummy_past_residuals[:])
+
+ output, state = scheduler.step_prk(state, residual, time_step, sample, **kwargs)
+ new_output, new_state = new_scheduler.step_prk(new_state, residual, time_step, sample, **kwargs)
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output, _ = scheduler.step_plms(state, residual, time_step, sample, **kwargs)
+ new_output, _ = new_scheduler.step_plms(new_state, residual, time_step, sample, **kwargs)
+
+ assert jnp.sum(jnp.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+
+ for i, t in enumerate(state.prk_timesteps):
+ residual = model(sample, t)
+ sample, state = scheduler.step_prk(state, residual, t, sample)
+
+ for i, t in enumerate(state.plms_timesteps):
+ residual = model(sample, t)
+ sample, state = scheduler.step_plms(state, residual, t, sample)
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = jnp.array([residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05])
+ state = state.replace(ets=dummy_past_residuals[:])
+
+ output_0, state = scheduler.step_prk(state, residual, 0, sample, **kwargs)
+ output_1, state = scheduler.step_prk(state, residual, 1, sample, **kwargs)
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ output_0, state = scheduler.step_plms(state, residual, 0, sample, **kwargs)
+ output_1, state = scheduler.step_plms(state, residual, 1, sample, **kwargs)
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_steps_offset(self):
+ for steps_offset in [0, 1]:
+ self.check_over_configs(steps_offset=steps_offset)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(steps_offset=1)
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+ state = scheduler.set_timesteps(state, 10, shape=())
+ assert jnp.equal(
+ state.timesteps,
+ jnp.array([901, 851, 851, 801, 801, 751, 751, 701, 701, 651, 651, 601, 601, 501, 401, 301, 201, 101, 1]),
+ ).all()
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001], [0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_time_indices(self):
+ for t in [1, 5, 10]:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 5, 10], [10, 50, 100]):
+ self.check_over_forward(num_inference_steps=num_inference_steps)
+
+ def test_pow_of_3_inference_steps(self):
+ # earlier version of set_timesteps() caused an error indexing alpha's with inference steps as power of 3
+ num_inference_steps = 27
+
+ for scheduler_class in self.scheduler_classes:
+ sample, _ = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ state = scheduler.set_timesteps(state, num_inference_steps, shape=sample.shape)
+
+ # before power of 3 fix, would error on first step, so we only need to do two
+ for i, t in enumerate(state.prk_timesteps[:2]):
+ sample, state = scheduler.step_prk(state, residual, t, sample)
+
+ def test_inference_plms_no_past_residuals(self):
+ with self.assertRaises(ValueError):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ state = scheduler.create_state()
+
+ scheduler.step_plms(state, self.dummy_sample, 1, self.dummy_sample).prev_sample
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ if jax_device == "tpu":
+ assert abs(result_sum - 198.1275) < 1e-2
+ assert abs(result_mean - 0.2580) < 1e-3
+ else:
+ assert abs(result_sum - 198.1318) < 1e-2
+ assert abs(result_mean - 0.2580) < 1e-3
+
+ def test_full_loop_with_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=True, beta_start=0.01)
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ if jax_device == "tpu":
+ assert abs(result_sum - 186.83226) < 1e-2
+ assert abs(result_mean - 0.24327) < 1e-3
+ else:
+ assert abs(result_sum - 186.9466) < 1e-2
+ assert abs(result_mean - 0.24342) < 1e-3
+
+ def test_full_loop_with_no_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=False, beta_start=0.01)
+ result_sum = jnp.sum(jnp.abs(sample))
+ result_mean = jnp.mean(jnp.abs(sample))
+
+ if jax_device == "tpu":
+ assert abs(result_sum - 186.83226) < 1e-2
+ assert abs(result_mean - 0.24327) < 1e-3
+ else:
+ assert abs(result_sum - 186.9482) < 1e-2
+ assert abs(result_mean - 0.2434) < 1e-3
diff --git a/tests/schedulers/test_scheduler_heun.py b/tests/schedulers/test_scheduler_heun.py
new file mode 100644
index 0000000..df2b62d
--- /dev/null
+++ b/tests/schedulers/test_scheduler_heun.py
@@ -0,0 +1,191 @@
+import torch
+
+from diffusers import HeunDiscreteScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class HeunDiscreteSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (HeunDiscreteScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear", "exp"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_clip_sample(self):
+ for clip_sample_range in [1.0, 2.0, 3.0]:
+ self.check_over_configs(clip_sample_range=clip_sample_range, clip_sample=True)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction", "sample"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu", "mps"]:
+ assert abs(result_sum.item() - 0.1233) < 1e-2
+ assert abs(result_mean.item() - 0.0002) < 1e-3
+ else:
+ # CUDA
+ assert abs(result_sum.item() - 0.1233) < 1e-2
+ assert abs(result_mean.item() - 0.0002) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu", "mps"]:
+ assert abs(result_sum.item() - 4.6934e-07) < 1e-2
+ assert abs(result_mean.item() - 6.1112e-10) < 1e-3
+ else:
+ # CUDA
+ assert abs(result_sum.item() - 4.693428650170972e-07) < 1e-2
+ assert abs(result_mean.item() - 0.0002) < 1e-3
+
+ def test_full_loop_device(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if str(torch_device).startswith("cpu"):
+ # The following sum varies between 148 and 156 on mps. Why?
+ assert abs(result_sum.item() - 0.1233) < 1e-2
+ assert abs(result_mean.item() - 0.0002) < 1e-3
+ elif str(torch_device).startswith("mps"):
+ # Larger tolerance on mps
+ assert abs(result_mean.item() - 0.0002) < 1e-2
+ else:
+ # CUDA
+ assert abs(result_sum.item() - 0.1233) < 1e-2
+ assert abs(result_mean.item() - 0.0002) < 1e-3
+
+ def test_full_loop_device_karras_sigmas(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config, use_karras_sigmas=True)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 0.00015) < 1e-2
+ assert abs(result_mean.item() - 1.9869554535034695e-07) < 1e-2
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ t_start = self.num_inference_steps - 2
+ noise = self.dummy_noise_deter
+ noise = noise.to(torch_device)
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 75074.8906) < 1e-2, f" expected result sum 75074.8906, but get {result_sum}"
+ assert abs(result_mean.item() - 97.7538) < 1e-3, f" expected result mean 97.7538, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_ipndm.py b/tests/schedulers/test_scheduler_ipndm.py
new file mode 100644
index 0000000..87c8da3
--- /dev/null
+++ b/tests/schedulers/test_scheduler_ipndm.py
@@ -0,0 +1,163 @@
+import tempfile
+
+import torch
+
+from diffusers import IPNDMScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class IPNDMSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (IPNDMScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 50),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {"num_train_timesteps": 1000}
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.ets = dummy_past_residuals[:]
+
+ if time_step is None:
+ time_step = scheduler.timesteps[len(scheduler.timesteps) // 2]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.ets = dummy_past_residuals[:]
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.ets = dummy_past_residuals[:]
+
+ if time_step is None:
+ time_step = scheduler.timesteps[len(scheduler.timesteps) // 2]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.ets = dummy_past_residuals[:]
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ scheduler._step_index = None
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05]
+ scheduler.ets = dummy_past_residuals[:]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps, time_step=None)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 5, 10], [10, 50, 100]):
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=None)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 2540529) < 10
diff --git a/tests/schedulers/test_scheduler_kdpm2_ancestral.py b/tests/schedulers/test_scheduler_kdpm2_ancestral.py
new file mode 100644
index 0000000..f6e8e96
--- /dev/null
+++ b/tests/schedulers/test_scheduler_kdpm2_ancestral.py
@@ -0,0 +1,158 @@
+import torch
+
+from diffusers import KDPM2AncestralDiscreteScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class KDPM2AncestralDiscreteSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (KDPM2AncestralDiscreteScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_full_loop_no_noise(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 13849.3877) < 1e-2
+ assert abs(result_mean.item() - 18.0331) < 5e-3
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_full_loop_with_v_prediction(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ generator = torch.manual_seed(0)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 328.9970) < 1e-2
+ assert abs(result_mean.item() - 0.4284) < 1e-3
+
+ def test_full_loop_device(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 13849.3818) < 1e-1
+ assert abs(result_mean.item() - 18.0331) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ generator = torch.manual_seed(0)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ # add noise
+ t_start = self.num_inference_steps - 2
+ noise = self.dummy_noise_deter
+ noise = noise.to(sample.device)
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 93087.0312) < 1e-2, f" expected result sum 93087.0312, but get {result_sum}"
+ assert abs(result_mean.item() - 121.2071) < 5e-3, f" expected result mean 121.2071, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_kdpm2_discrete.py b/tests/schedulers/test_scheduler_kdpm2_discrete.py
new file mode 100644
index 0000000..a992edc
--- /dev/null
+++ b/tests/schedulers/test_scheduler_kdpm2_discrete.py
@@ -0,0 +1,166 @@
+import torch
+
+from diffusers import KDPM2DiscreteScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class KDPM2DiscreteSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (KDPM2DiscreteScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu", "mps"]:
+ assert abs(result_sum.item() - 4.6934e-07) < 1e-2
+ assert abs(result_mean.item() - 6.1112e-10) < 1e-3
+ else:
+ # CUDA
+ assert abs(result_sum.item() - 4.693428650170972e-07) < 1e-2
+ assert abs(result_mean.item() - 0.0002) < 1e-3
+
+ def test_full_loop_no_noise(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu", "mps"]:
+ assert abs(result_sum.item() - 20.4125) < 1e-2
+ assert abs(result_mean.item() - 0.0266) < 1e-3
+ else:
+ # CUDA
+ assert abs(result_sum.item() - 20.4125) < 1e-2
+ assert abs(result_mean.item() - 0.0266) < 1e-3
+
+ def test_full_loop_device(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if str(torch_device).startswith("cpu"):
+ # The following sum varies between 148 and 156 on mps. Why?
+ assert abs(result_sum.item() - 20.4125) < 1e-2
+ assert abs(result_mean.item() - 0.0266) < 1e-3
+ else:
+ # CUDA
+ assert abs(result_sum.item() - 20.4125) < 1e-2
+ assert abs(result_mean.item() - 0.0266) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ if torch_device == "mps":
+ return
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ # add noise
+ t_start = self.num_inference_steps - 2
+ noise = self.dummy_noise_deter
+ noise = noise.to(sample.device)
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 70408.4062) < 1e-2, f" expected result sum 70408.4062, but get {result_sum}"
+ assert abs(result_mean.item() - 91.6776) < 1e-3, f" expected result mean 91.6776, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_lcm.py b/tests/schedulers/test_scheduler_lcm.py
new file mode 100644
index 0000000..c2c6530
--- /dev/null
+++ b/tests/schedulers/test_scheduler_lcm.py
@@ -0,0 +1,300 @@
+import tempfile
+from typing import Dict, List, Tuple
+
+import torch
+
+from diffusers import LCMScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class LCMSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (LCMScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 10),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.00085,
+ "beta_end": 0.0120,
+ "beta_schedule": "scaled_linear",
+ "prediction_type": "epsilon",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ @property
+ def default_valid_timestep(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ timestep = scheduler.timesteps[-1]
+ return timestep
+
+ def test_timesteps(self):
+ for timesteps in [100, 500, 1000]:
+ # 0 is not guaranteed to be in the timestep schedule, but timesteps - 1 is
+ self.check_over_configs(time_step=timesteps - 1, num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(time_step=self.default_valid_timestep, beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(time_step=self.default_valid_timestep, beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(time_step=self.default_valid_timestep, prediction_type=prediction_type)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(time_step=self.default_valid_timestep, clip_sample=clip_sample)
+
+ def test_thresholding(self):
+ self.check_over_configs(time_step=self.default_valid_timestep, thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(
+ time_step=self.default_valid_timestep,
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_time_indices(self):
+ # Get default timestep schedule.
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ timesteps = scheduler.timesteps
+ for t in timesteps:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ # Hardcoded for now
+ for t, num_inference_steps in zip([99, 39, 39, 19], [10, 25, 26, 50]):
+ self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps)
+
+ # Override test_add_noise_device because the hardcoded num_inference_steps of 100 doesn't work
+ # for LCMScheduler under default settings
+ def test_add_noise_device(self, num_inference_steps=10):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ sample = self.dummy_sample.to(torch_device)
+ scaled_sample = scheduler.scale_model_input(sample, 0.0)
+ self.assertEqual(sample.shape, scaled_sample.shape)
+
+ noise = torch.randn_like(scaled_sample).to(torch_device)
+ t = scheduler.timesteps[5][None]
+ noised = scheduler.add_noise(scaled_sample, noise, t)
+ self.assertEqual(noised.shape, scaled_sample.shape)
+
+ # Override test_from_save_pretrained because it hardcodes a timestep of 1
+ def test_from_save_pretrained(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ timestep = self.default_valid_timestep
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ scheduler.set_timesteps(num_inference_steps)
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ kwargs["generator"] = torch.manual_seed(0)
+ output = scheduler.step(residual, timestep, sample, **kwargs).prev_sample
+
+ kwargs["generator"] = torch.manual_seed(0)
+ new_output = new_scheduler.step(residual, timestep, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ # Override test_step_shape because uses 0 and 1 as hardcoded timesteps
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ timestep_0 = scheduler.timesteps[-2]
+ timestep_1 = scheduler.timesteps[-1]
+
+ output_0 = scheduler.step(residual, timestep_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, timestep_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ # Override test_set_scheduler_outputs_equivalence since it uses 0 as a hardcoded timestep
+ def test_scheduler_outputs_equivalence(self):
+ def set_nan_tensor_to_zero(t):
+ t[t != t] = 0
+ return t
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, (List, Tuple)):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ torch.allclose(
+ set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
+ ),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has"
+ f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}."
+ ),
+ )
+
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", 50)
+
+ timestep = self.default_valid_timestep
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler.set_timesteps(num_inference_steps)
+ kwargs["generator"] = torch.manual_seed(0)
+ outputs_dict = scheduler.step(residual, timestep, sample, **kwargs)
+
+ scheduler.set_timesteps(num_inference_steps)
+ kwargs["generator"] = torch.manual_seed(0)
+ outputs_tuple = scheduler.step(residual, timestep, sample, return_dict=False, **kwargs)
+
+ recursive_check(outputs_tuple, outputs_dict)
+
+ def full_loop(self, num_inference_steps=10, seed=0, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(seed)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ for t in scheduler.timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample, generator).prev_sample
+
+ return sample
+
+ def test_full_loop_onestep(self):
+ sample = self.full_loop(num_inference_steps=1)
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ # TODO: get expected sum and mean
+ assert abs(result_sum.item() - 18.7097) < 1e-3
+ assert abs(result_mean.item() - 0.0244) < 1e-3
+
+ def test_full_loop_multistep(self):
+ sample = self.full_loop(num_inference_steps=10)
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ # TODO: get expected sum and mean
+ assert abs(result_sum.item() - 197.7616) < 1e-3
+ assert abs(result_mean.item() - 0.2575) < 1e-3
+
+ def test_custom_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ scheduler_timesteps = scheduler.timesteps
+
+ for i, timestep in enumerate(scheduler_timesteps):
+ if i == len(timesteps) - 1:
+ expected_prev_t = -1
+ else:
+ expected_prev_t = timesteps[i + 1]
+
+ prev_t = scheduler.previous_timestep(timestep)
+ prev_t = prev_t.item()
+
+ self.assertEqual(prev_t, expected_prev_t)
+
+ def test_custom_timesteps_increasing_order(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 51, 0]
+
+ with self.assertRaises(ValueError, msg="`custom_timesteps` must be in descending order."):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_custom_timesteps_passing_both_num_inference_steps_and_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+ num_inference_steps = len(timesteps)
+
+ with self.assertRaises(ValueError, msg="Can only pass one of `num_inference_steps` or `custom_timesteps`."):
+ scheduler.set_timesteps(num_inference_steps=num_inference_steps, timesteps=timesteps)
+
+ def test_custom_timesteps_too_large(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [scheduler.config.num_train_timesteps]
+
+ with self.assertRaises(
+ ValueError,
+ msg="`timesteps` must start before `self.config.train_timesteps`: {scheduler.config.num_train_timesteps}}",
+ ):
+ scheduler.set_timesteps(timesteps=timesteps)
diff --git a/tests/schedulers/test_scheduler_lms.py b/tests/schedulers/test_scheduler_lms.py
new file mode 100644
index 0000000..5c163ce
--- /dev/null
+++ b/tests/schedulers/test_scheduler_lms.py
@@ -0,0 +1,170 @@
+import torch
+
+from diffusers import LMSDiscreteScheduler
+from diffusers.utils.testing_utils import torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class LMSDiscreteSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (LMSDiscreteScheduler,)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_time_indices(self):
+ for t in [0, 500, 800]:
+ self.check_over_forward(time_step=t)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 1006.388) < 1e-2
+ assert abs(result_mean.item() - 1.31) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 0.0017) < 1e-2
+ assert abs(result_mean.item() - 2.2676e-06) < 1e-3
+
+ def test_full_loop_device(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma.cpu()
+ sample = sample.to(torch_device)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 1006.388) < 1e-2
+ assert abs(result_mean.item() - 1.31) < 1e-3
+
+ def test_full_loop_device_karras_sigmas(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config, use_karras_sigmas=True)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 3812.9927) < 2e-2
+ assert abs(result_mean.item() - 4.9648) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+
+ # add noise
+ t_start = self.num_inference_steps - 2
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 27663.6895) < 1e-2
+ assert abs(result_mean.item() - 36.0204) < 1e-3
diff --git a/tests/schedulers/test_scheduler_pndm.py b/tests/schedulers/test_scheduler_pndm.py
new file mode 100644
index 0000000..c1519f7
--- /dev/null
+++ b/tests/schedulers/test_scheduler_pndm.py
@@ -0,0 +1,242 @@
+import tempfile
+
+import torch
+
+from diffusers import PNDMScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class PNDMSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (PNDMScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 50),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.ets = dummy_past_residuals[:]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.ets = dummy_past_residuals[:]
+
+ output = scheduler.step_prk(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step_prk(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output = scheduler.step_plms(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step_plms(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ pass
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.ets = dummy_past_residuals[:]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.ets = dummy_past_residuals[:]
+
+ output = scheduler.step_prk(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step_prk(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output = scheduler.step_plms(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step_plms(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.prk_timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step_prk(residual, t, sample).prev_sample
+
+ for i, t in enumerate(scheduler.plms_timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step_plms(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.1, residual + 0.05]
+ scheduler.ets = dummy_past_residuals[:]
+
+ output_0 = scheduler.step_prk(residual, 0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step_prk(residual, 1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ output_0 = scheduler.step_plms(residual, 0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step_plms(residual, 1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_steps_offset(self):
+ for steps_offset in [0, 1]:
+ self.check_over_configs(steps_offset=steps_offset)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(steps_offset=1)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(10)
+ assert torch.equal(
+ scheduler.timesteps,
+ torch.LongTensor(
+ [901, 851, 851, 801, 801, 751, 751, 701, 701, 651, 651, 601, 601, 501, 401, 301, 201, 101, 1]
+ ),
+ )
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001], [0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_time_indices(self):
+ for t in [1, 5, 10]:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ for t, num_inference_steps in zip([1, 5, 10], [10, 50, 100]):
+ self.check_over_forward(num_inference_steps=num_inference_steps)
+
+ def test_pow_of_3_inference_steps(self):
+ # earlier version of set_timesteps() caused an error indexing alpha's with inference steps as power of 3
+ num_inference_steps = 27
+
+ for scheduler_class in self.scheduler_classes:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+
+ # before power of 3 fix, would error on first step, so we only need to do two
+ for i, t in enumerate(scheduler.prk_timesteps[:2]):
+ sample = scheduler.step_prk(residual, t, sample).prev_sample
+
+ def test_inference_plms_no_past_residuals(self):
+ with self.assertRaises(ValueError):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.step_plms(self.dummy_sample, 1, self.dummy_sample).prev_sample
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 198.1318) < 1e-2
+ assert abs(result_mean.item() - 0.2580) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 67.3986) < 1e-2
+ assert abs(result_mean.item() - 0.0878) < 1e-3
+
+ def test_full_loop_with_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=True, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 230.0399) < 1e-2
+ assert abs(result_mean.item() - 0.2995) < 1e-3
+
+ def test_full_loop_with_no_set_alpha_to_one(self):
+ # We specify different beta, so that the first alpha is 0.99
+ sample = self.full_loop(set_alpha_to_one=False, beta_start=0.01)
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 186.9482) < 1e-2
+ assert abs(result_mean.item() - 0.2434) < 1e-3
diff --git a/tests/schedulers/test_scheduler_sasolver.py b/tests/schedulers/test_scheduler_sasolver.py
new file mode 100644
index 0000000..5741946
--- /dev/null
+++ b/tests/schedulers/test_scheduler_sasolver.py
@@ -0,0 +1,202 @@
+import torch
+
+from diffusers import SASolverScheduler
+from diffusers.utils.testing_utils import require_torchsde, torch_device
+
+from .test_schedulers import SchedulerCommonTest
+
+
+@require_torchsde
+class SASolverSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (SASolverScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 10),)
+ num_inference_steps = 10
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1100,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[
+ : max(
+ scheduler.config.predictor_order,
+ scheduler.config.corrector_order - 1,
+ )
+ ]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_timesteps(self):
+ for timesteps in [10, 50, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.00001, 0.0001, 0.001], [0.0002, 0.002, 0.02]):
+ self.check_over_configs(beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear"]:
+ self.check_over_configs(beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_full_loop_no_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+ generator = torch.manual_seed(0)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t, generator=generator)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu"]:
+ assert abs(result_sum.item() - 337.394287109375) < 1e-2
+ assert abs(result_mean.item() - 0.43931546807289124) < 1e-3
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 329.1999816894531) < 1e-2
+ assert abs(result_mean.item() - 0.4286458194255829) < 1e-3
+ else:
+ print("None")
+
+ def test_full_loop_with_v_prediction(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(prediction_type="v_prediction")
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+ generator = torch.manual_seed(0)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sample = scheduler.scale_model_input(sample, t, generator=generator)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu"]:
+ assert abs(result_sum.item() - 193.1467742919922) < 1e-2
+ assert abs(result_mean.item() - 0.2514931857585907) < 1e-3
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 193.4154052734375) < 1e-2
+ assert abs(result_mean.item() - 0.2518429756164551) < 1e-3
+ else:
+ print("None")
+
+ def test_full_loop_device(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+ generator = torch.manual_seed(0)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu"]:
+ assert abs(result_sum.item() - 337.394287109375) < 1e-2
+ assert abs(result_mean.item() - 0.43931546807289124) < 1e-3
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 337.394287109375) < 1e-2
+ assert abs(result_mean.item() - 0.4393154978752136) < 1e-3
+ else:
+ print("None")
+
+ def test_full_loop_device_karras_sigmas(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config, use_karras_sigmas=True)
+
+ scheduler.set_timesteps(self.num_inference_steps, device=torch_device)
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.to(torch_device) * scheduler.init_noise_sigma
+ sample = sample.to(torch_device)
+ generator = torch.manual_seed(0)
+
+ for t in scheduler.timesteps:
+ sample = scheduler.scale_model_input(sample, t)
+
+ model_output = model(sample, t)
+
+ output = scheduler.step(model_output, t, sample, generator=generator)
+ sample = output.prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ if torch_device in ["cpu"]:
+ assert abs(result_sum.item() - 837.2554931640625) < 1e-2
+ assert abs(result_mean.item() - 1.0901764631271362) < 1e-2
+ elif torch_device in ["cuda"]:
+ assert abs(result_sum.item() - 837.25537109375) < 1e-2
+ assert abs(result_mean.item() - 1.0901763439178467) < 1e-2
+ else:
+ print("None")
diff --git a/tests/schedulers/test_scheduler_score_sde_ve.py b/tests/schedulers/test_scheduler_score_sde_ve.py
new file mode 100644
index 0000000..08c30f9
--- /dev/null
+++ b/tests/schedulers/test_scheduler_score_sde_ve.py
@@ -0,0 +1,189 @@
+import tempfile
+import unittest
+
+import numpy as np
+import torch
+
+from diffusers import ScoreSdeVeScheduler
+
+
+class ScoreSdeVeSchedulerTest(unittest.TestCase):
+ # TODO adapt with class SchedulerCommonTest (scheduler needs Numpy Integration)
+ scheduler_classes = (ScoreSdeVeScheduler,)
+ forward_default_kwargs = ()
+
+ @property
+ def dummy_sample(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ sample = torch.rand((batch_size, num_channels, height, width))
+
+ return sample
+
+ @property
+ def dummy_sample_deter(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ num_elems = batch_size * num_channels * height * width
+ sample = torch.arange(num_elems)
+ sample = sample.reshape(num_channels, height, width, batch_size)
+ sample = sample / num_elems
+ sample = sample.permute(3, 0, 1, 2)
+
+ return sample
+
+ def dummy_model(self):
+ def model(sample, t, *args):
+ return sample * t / (t + 1)
+
+ return model
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 2000,
+ "snr": 0.15,
+ "sigma_min": 0.01,
+ "sigma_max": 1348,
+ "sampling_eps": 1e-5,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+
+ for scheduler_class in self.scheduler_classes:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ output = scheduler.step_pred(
+ residual, time_step, sample, generator=torch.manual_seed(0), **kwargs
+ ).prev_sample
+ new_output = new_scheduler.step_pred(
+ residual, time_step, sample, generator=torch.manual_seed(0), **kwargs
+ ).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output = scheduler.step_correct(residual, sample, generator=torch.manual_seed(0), **kwargs).prev_sample
+ new_output = new_scheduler.step_correct(
+ residual, sample, generator=torch.manual_seed(0), **kwargs
+ ).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler correction are not identical"
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ kwargs.update(forward_kwargs)
+
+ for scheduler_class in self.scheduler_classes:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ output = scheduler.step_pred(
+ residual, time_step, sample, generator=torch.manual_seed(0), **kwargs
+ ).prev_sample
+ new_output = new_scheduler.step_pred(
+ residual, time_step, sample, generator=torch.manual_seed(0), **kwargs
+ ).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ output = scheduler.step_correct(residual, sample, generator=torch.manual_seed(0), **kwargs).prev_sample
+ new_output = new_scheduler.step_correct(
+ residual, sample, generator=torch.manual_seed(0), **kwargs
+ ).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler correction are not identical"
+
+ def test_timesteps(self):
+ for timesteps in [10, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_sigmas(self):
+ for sigma_min, sigma_max in zip([0.0001, 0.001, 0.01], [1, 100, 1000]):
+ self.check_over_configs(sigma_min=sigma_min, sigma_max=sigma_max)
+
+ def test_time_indices(self):
+ for t in [0.1, 0.5, 0.75]:
+ self.check_over_forward(time_step=t)
+
+ def test_full_loop_no_noise(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 3
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+
+ scheduler.set_sigmas(num_inference_steps)
+ scheduler.set_timesteps(num_inference_steps)
+ generator = torch.manual_seed(0)
+
+ for i, t in enumerate(scheduler.timesteps):
+ sigma_t = scheduler.sigmas[i]
+
+ for _ in range(scheduler.config.correct_steps):
+ with torch.no_grad():
+ model_output = model(sample, sigma_t)
+ sample = scheduler.step_correct(model_output, sample, generator=generator, **kwargs).prev_sample
+
+ with torch.no_grad():
+ model_output = model(sample, sigma_t)
+
+ output = scheduler.step_pred(model_output, t, sample, generator=generator, **kwargs)
+ sample, _ = output.prev_sample, output.prev_sample_mean
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert np.isclose(result_sum.item(), 14372758528.0)
+ assert np.isclose(result_mean.item(), 18714530.0)
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output_0 = scheduler.step_pred(residual, 0, sample, generator=torch.manual_seed(0), **kwargs).prev_sample
+ output_1 = scheduler.step_pred(residual, 1, sample, generator=torch.manual_seed(0), **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
diff --git a/tests/schedulers/test_scheduler_tcd.py b/tests/schedulers/test_scheduler_tcd.py
new file mode 100644
index 0000000..e95c536
--- /dev/null
+++ b/tests/schedulers/test_scheduler_tcd.py
@@ -0,0 +1,180 @@
+import torch
+
+from diffusers import TCDScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class TCDSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (TCDScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 10),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.00085,
+ "beta_end": 0.0120,
+ "beta_schedule": "scaled_linear",
+ "prediction_type": "epsilon",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ @property
+ def default_num_inference_steps(self):
+ return 10
+
+ @property
+ def default_valid_timestep(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ timestep = scheduler.timesteps[-1]
+ return timestep
+
+ def test_timesteps(self):
+ for timesteps in [100, 500, 1000]:
+ # 0 is not guaranteed to be in the timestep schedule, but timesteps - 1 is
+ self.check_over_configs(time_step=timesteps - 1, num_train_timesteps=timesteps)
+
+ def test_betas(self):
+ for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]):
+ self.check_over_configs(time_step=self.default_valid_timestep, beta_start=beta_start, beta_end=beta_end)
+
+ def test_schedules(self):
+ for schedule in ["linear", "scaled_linear", "squaredcos_cap_v2"]:
+ self.check_over_configs(time_step=self.default_valid_timestep, beta_schedule=schedule)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(time_step=self.default_valid_timestep, prediction_type=prediction_type)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(time_step=self.default_valid_timestep, clip_sample=clip_sample)
+
+ def test_thresholding(self):
+ self.check_over_configs(time_step=self.default_valid_timestep, thresholding=False)
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(
+ time_step=self.default_valid_timestep,
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ )
+
+ def test_time_indices(self):
+ # Get default timestep schedule.
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ timesteps = scheduler.timesteps
+ for t in timesteps:
+ self.check_over_forward(time_step=t)
+
+ def test_inference_steps(self):
+ # Hardcoded for now
+ for t, num_inference_steps in zip([99, 39, 39, 19], [10, 25, 26, 50]):
+ self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps)
+
+ def full_loop(self, num_inference_steps=10, seed=0, **config):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ eta = 0.0 # refer to gamma in the paper
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(seed)
+ scheduler.set_timesteps(num_inference_steps)
+
+ for t in scheduler.timesteps:
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample, eta, generator).prev_sample
+
+ return sample
+
+ def test_full_loop_onestep_deter(self):
+ sample = self.full_loop(num_inference_steps=1)
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 29.8715) < 1e-3 # 0.0778918
+ assert abs(result_mean.item() - 0.0389) < 1e-3
+
+ def test_full_loop_multistep_deter(self):
+ sample = self.full_loop(num_inference_steps=10)
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 181.2040) < 1e-3
+ assert abs(result_mean.item() - 0.2359) < 1e-3
+
+ def test_custom_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ scheduler_timesteps = scheduler.timesteps
+
+ for i, timestep in enumerate(scheduler_timesteps):
+ if i == len(timesteps) - 1:
+ expected_prev_t = -1
+ else:
+ expected_prev_t = timesteps[i + 1]
+
+ prev_t = scheduler.previous_timestep(timestep)
+ prev_t = prev_t.item()
+
+ self.assertEqual(prev_t, expected_prev_t)
+
+ def test_custom_timesteps_increasing_order(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 51, 0]
+
+ with self.assertRaises(ValueError, msg="`custom_timesteps` must be in descending order."):
+ scheduler.set_timesteps(timesteps=timesteps)
+
+ def test_custom_timesteps_passing_both_num_inference_steps_and_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [100, 87, 50, 1, 0]
+ num_inference_steps = len(timesteps)
+
+ with self.assertRaises(ValueError, msg="Can only pass one of `num_inference_steps` or `custom_timesteps`."):
+ scheduler.set_timesteps(num_inference_steps=num_inference_steps, timesteps=timesteps)
+
+ def test_custom_timesteps_too_large(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = [scheduler.config.num_train_timesteps]
+
+ with self.assertRaises(
+ ValueError,
+ msg="`timesteps` must start before `self.config.train_timesteps`: {scheduler.config.num_train_timesteps}}",
+ ):
+ scheduler.set_timesteps(timesteps=timesteps)
diff --git a/tests/schedulers/test_scheduler_unclip.py b/tests/schedulers/test_scheduler_unclip.py
new file mode 100644
index 0000000..b0ce131
--- /dev/null
+++ b/tests/schedulers/test_scheduler_unclip.py
@@ -0,0 +1,137 @@
+import torch
+
+from diffusers import UnCLIPScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+# UnCLIPScheduler is a modified DDPMScheduler with a subset of the configuration.
+class UnCLIPSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (UnCLIPScheduler,)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "variance_type": "fixed_small_log",
+ "clip_sample": True,
+ "clip_sample_range": 1.0,
+ "prediction_type": "epsilon",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def test_timesteps(self):
+ for timesteps in [1, 5, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_variance_type(self):
+ for variance in ["fixed_small_log", "learned_range"]:
+ self.check_over_configs(variance_type=variance)
+
+ def test_clip_sample(self):
+ for clip_sample in [True, False]:
+ self.check_over_configs(clip_sample=clip_sample)
+
+ def test_clip_sample_range(self):
+ for clip_sample_range in [1, 5, 10, 20]:
+ self.check_over_configs(clip_sample_range=clip_sample_range)
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_time_indices(self):
+ for time_step in [0, 500, 999]:
+ for prev_timestep in [None, 5, 100, 250, 500, 750]:
+ if prev_timestep is not None and prev_timestep >= time_step:
+ continue
+
+ self.check_over_forward(time_step=time_step, prev_timestep=prev_timestep)
+
+ def test_variance_fixed_small_log(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(variance_type="fixed_small_log")
+ scheduler = scheduler_class(**scheduler_config)
+
+ assert torch.sum(torch.abs(scheduler._get_variance(0) - 1.0000e-10)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(487) - 0.0549625)) < 1e-5
+ assert torch.sum(torch.abs(scheduler._get_variance(999) - 0.9994987)) < 1e-5
+
+ def test_variance_learned_range(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(variance_type="learned_range")
+ scheduler = scheduler_class(**scheduler_config)
+
+ predicted_variance = 0.5
+
+ assert scheduler._get_variance(1, predicted_variance=predicted_variance) - -10.1712790 < 1e-5
+ assert scheduler._get_variance(487, predicted_variance=predicted_variance) - -5.7998052 < 1e-5
+ assert scheduler._get_variance(999, predicted_variance=predicted_variance) - -0.0010011 < 1e-5
+
+ def test_full_loop(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ timesteps = scheduler.timesteps
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ for i, t in enumerate(timesteps):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(residual, t, sample, generator=generator).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 252.2682495) < 1e-2
+ assert abs(result_mean.item() - 0.3284743) < 1e-3
+
+ def test_full_loop_skip_timesteps(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler.set_timesteps(25)
+
+ timesteps = scheduler.timesteps
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ generator = torch.manual_seed(0)
+
+ for i, t in enumerate(timesteps):
+ # 1. predict noise residual
+ residual = model(sample, t)
+
+ if i + 1 == timesteps.shape[0]:
+ prev_timestep = None
+ else:
+ prev_timestep = timesteps[i + 1]
+
+ # 2. predict previous mean of sample x_t-1
+ pred_prev_sample = scheduler.step(
+ residual, t, sample, prev_timestep=prev_timestep, generator=generator
+ ).prev_sample
+
+ sample = pred_prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 258.2044983) < 1e-2
+ assert abs(result_mean.item() - 0.3362038) < 1e-3
+
+ def test_trained_betas(self):
+ pass
+
+ def test_add_noise_device(self):
+ pass
diff --git a/tests/schedulers/test_scheduler_unipc.py b/tests/schedulers/test_scheduler_unipc.py
new file mode 100644
index 0000000..be41cea
--- /dev/null
+++ b/tests/schedulers/test_scheduler_unipc.py
@@ -0,0 +1,381 @@
+import tempfile
+
+import torch
+
+from diffusers import (
+ DEISMultistepScheduler,
+ DPMSolverMultistepScheduler,
+ DPMSolverSinglestepScheduler,
+ UniPCMultistepScheduler,
+)
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class UniPCMultistepSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (UniPCMultistepScheduler,)
+ forward_default_kwargs = (("num_inference_steps", 25),)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_train_timesteps": 1000,
+ "beta_start": 0.0001,
+ "beta_end": 0.02,
+ "beta_schedule": "linear",
+ "solver_order": 2,
+ "solver_type": "bh2",
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ new_scheduler.set_timesteps(num_inference_steps)
+ # copy over dummy past residuals
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output, new_output = sample, sample
+ for t in range(time_step, time_step + scheduler.config.solver_order + 1):
+ t = scheduler.timesteps[t]
+ output = scheduler.step(residual, t, output, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, t, new_output, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residuals (must be after setting timesteps)
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+ # copy over dummy past residuals
+ new_scheduler.set_timesteps(num_inference_steps)
+
+ # copy over dummy past residual (must be after setting timesteps)
+ new_scheduler.model_outputs = dummy_past_residuals[: new_scheduler.config.solver_order]
+
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def full_loop(self, scheduler=None, **config):
+ if scheduler is None:
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ return sample
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # copy over dummy past residuals (must be done after set_timesteps)
+ dummy_past_residuals = [residual + 0.2, residual + 0.15, residual + 0.10]
+ scheduler.model_outputs = dummy_past_residuals[: scheduler.config.solver_order]
+
+ time_step_0 = scheduler.timesteps[5]
+ time_step_1 = scheduler.timesteps[6]
+
+ output_0 = scheduler.step(residual, time_step_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, time_step_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_switch(self):
+ # make sure that iterating over schedulers with same config names gives same results
+ # for defaults
+ scheduler = UniPCMultistepScheduler(**self.get_scheduler_config())
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2464) < 1e-3
+
+ scheduler = DPMSolverSinglestepScheduler.from_config(scheduler.config)
+ scheduler = DEISMultistepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverMultistepScheduler.from_config(scheduler.config)
+ scheduler = UniPCMultistepScheduler.from_config(scheduler.config)
+
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2464) < 1e-3
+
+ def test_timesteps(self):
+ for timesteps in [25, 50, 100, 999, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_thresholding(self):
+ self.check_over_configs(thresholding=False)
+ for order in [1, 2, 3]:
+ for solver_type in ["bh1", "bh2"]:
+ for threshold in [0.5, 1.0, 2.0]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ thresholding=True,
+ prediction_type=prediction_type,
+ sample_max_value=threshold,
+ solver_order=order,
+ solver_type=solver_type,
+ )
+
+ def test_prediction_type(self):
+ for prediction_type in ["epsilon", "v_prediction"]:
+ self.check_over_configs(prediction_type=prediction_type)
+
+ def test_solver_order_and_type(self):
+ for solver_type in ["bh1", "bh2"]:
+ for order in [1, 2, 3]:
+ for prediction_type in ["epsilon", "sample"]:
+ self.check_over_configs(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ )
+ sample = self.full_loop(
+ solver_order=order,
+ solver_type=solver_type,
+ prediction_type=prediction_type,
+ )
+ assert not torch.isnan(sample).any(), "Samples have nan numbers"
+
+ def test_lower_order_final(self):
+ self.check_over_configs(lower_order_final=True)
+ self.check_over_configs(lower_order_final=False)
+
+ def test_inference_steps(self):
+ for num_inference_steps in [1, 2, 3, 5, 10, 50, 100, 999, 1000]:
+ self.check_over_forward(num_inference_steps=num_inference_steps, time_step=0)
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2464) < 1e-3
+
+ def test_full_loop_with_karras(self):
+ sample = self.full_loop(use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2925) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.1014) < 1e-3
+
+ def test_full_loop_with_karras_and_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction", use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.1966) < 1e-3
+
+ def test_fp16_support(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config(thresholding=True, dynamic_thresholding_ratio=0)
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter.half()
+ scheduler.set_timesteps(num_inference_steps)
+
+ for i, t in enumerate(scheduler.timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ assert sample.dtype == torch.float16
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 8
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 315.5757) < 1e-2, f" expected result sum 315.5757, but get {result_sum}"
+ assert abs(result_mean.item() - 0.4109) < 1e-3, f" expected result mean 0.4109, but get {result_mean}"
+
+
+class UniPCMultistepScheduler1DTest(UniPCMultistepSchedulerTest):
+ @property
+ def dummy_sample(self):
+ batch_size = 4
+ num_channels = 3
+ width = 8
+
+ sample = torch.rand((batch_size, num_channels, width))
+
+ return sample
+
+ @property
+ def dummy_noise_deter(self):
+ batch_size = 4
+ num_channels = 3
+ width = 8
+
+ num_elems = batch_size * num_channels * width
+ sample = torch.arange(num_elems).flip(-1)
+ sample = sample.reshape(num_channels, width, batch_size)
+ sample = sample / num_elems
+ sample = sample.permute(2, 0, 1)
+
+ return sample
+
+ @property
+ def dummy_sample_deter(self):
+ batch_size = 4
+ num_channels = 3
+ width = 8
+
+ num_elems = batch_size * num_channels * width
+ sample = torch.arange(num_elems)
+ sample = sample.reshape(num_channels, width, batch_size)
+ sample = sample / num_elems
+ sample = sample.permute(2, 0, 1)
+
+ return sample
+
+ def test_switch(self):
+ # make sure that iterating over schedulers with same config names gives same results
+ # for defaults
+ scheduler = UniPCMultistepScheduler(**self.get_scheduler_config())
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2441) < 1e-3
+
+ scheduler = DPMSolverSinglestepScheduler.from_config(scheduler.config)
+ scheduler = DEISMultistepScheduler.from_config(scheduler.config)
+ scheduler = DPMSolverMultistepScheduler.from_config(scheduler.config)
+ scheduler = UniPCMultistepScheduler.from_config(scheduler.config)
+
+ sample = self.full_loop(scheduler=scheduler)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2441) < 1e-3
+
+ def test_full_loop_no_noise(self):
+ sample = self.full_loop()
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2441) < 1e-3
+
+ def test_full_loop_with_karras(self):
+ sample = self.full_loop(use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.2898) < 1e-3
+
+ def test_full_loop_with_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction")
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.1014) < 1e-3
+
+ def test_full_loop_with_karras_and_v_prediction(self):
+ sample = self.full_loop(prediction_type="v_prediction", use_karras_sigmas=True)
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_mean.item() - 0.1944) < 1e-3
+
+ def test_full_loop_with_noise(self):
+ scheduler_class = self.scheduler_classes[0]
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ num_inference_steps = 10
+ t_start = 8
+
+ model = self.dummy_model()
+ sample = self.dummy_sample_deter
+ scheduler.set_timesteps(num_inference_steps)
+
+ # add noise
+ noise = self.dummy_noise_deter
+ timesteps = scheduler.timesteps[t_start * scheduler.order :]
+ sample = scheduler.add_noise(sample, noise, timesteps[:1])
+
+ for i, t in enumerate(timesteps):
+ residual = model(sample, t)
+ sample = scheduler.step(residual, t, sample).prev_sample
+
+ result_sum = torch.sum(torch.abs(sample))
+ result_mean = torch.mean(torch.abs(sample))
+
+ assert abs(result_sum.item() - 39.0870) < 1e-2, f" expected result sum 39.0870, but get {result_sum}"
+ assert abs(result_mean.item() - 0.4072) < 1e-3, f" expected result mean 0.4072, but get {result_mean}"
diff --git a/tests/schedulers/test_scheduler_vq_diffusion.py b/tests/schedulers/test_scheduler_vq_diffusion.py
new file mode 100644
index 0000000..74437ad
--- /dev/null
+++ b/tests/schedulers/test_scheduler_vq_diffusion.py
@@ -0,0 +1,56 @@
+import torch
+import torch.nn.functional as F
+
+from diffusers import VQDiffusionScheduler
+
+from .test_schedulers import SchedulerCommonTest
+
+
+class VQDiffusionSchedulerTest(SchedulerCommonTest):
+ scheduler_classes = (VQDiffusionScheduler,)
+
+ def get_scheduler_config(self, **kwargs):
+ config = {
+ "num_vec_classes": 4097,
+ "num_train_timesteps": 100,
+ }
+
+ config.update(**kwargs)
+ return config
+
+ def dummy_sample(self, num_vec_classes):
+ batch_size = 4
+ height = 8
+ width = 8
+
+ sample = torch.randint(0, num_vec_classes, (batch_size, height * width))
+
+ return sample
+
+ @property
+ def dummy_sample_deter(self):
+ assert False
+
+ def dummy_model(self, num_vec_classes):
+ def model(sample, t, *args):
+ batch_size, num_latent_pixels = sample.shape
+ logits = torch.rand((batch_size, num_vec_classes - 1, num_latent_pixels))
+ return_value = F.log_softmax(logits.double(), dim=1).float()
+ return return_value
+
+ return model
+
+ def test_timesteps(self):
+ for timesteps in [2, 5, 100, 1000]:
+ self.check_over_configs(num_train_timesteps=timesteps)
+
+ def test_num_vec_classes(self):
+ for num_vec_classes in [5, 100, 1000, 4000]:
+ self.check_over_configs(num_vec_classes=num_vec_classes)
+
+ def test_time_indices(self):
+ for t in [0, 50, 99]:
+ self.check_over_forward(time_step=t)
+
+ def test_add_noise_device(self):
+ pass
diff --git a/tests/schedulers/test_schedulers.py b/tests/schedulers/test_schedulers.py
new file mode 100755
index 0000000..9982db7
--- /dev/null
+++ b/tests/schedulers/test_schedulers.py
@@ -0,0 +1,868 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import json
+import os
+import tempfile
+import unittest
+import uuid
+from typing import Dict, List, Tuple
+
+import numpy as np
+import torch
+from huggingface_hub import delete_repo
+
+import diffusers
+from diffusers import (
+ CMStochasticIterativeScheduler,
+ DDIMScheduler,
+ DEISMultistepScheduler,
+ DiffusionPipeline,
+ EDMEulerScheduler,
+ EulerAncestralDiscreteScheduler,
+ EulerDiscreteScheduler,
+ IPNDMScheduler,
+ LMSDiscreteScheduler,
+ UniPCMultistepScheduler,
+ VQDiffusionScheduler,
+)
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.schedulers.scheduling_utils import SchedulerMixin
+from diffusers.utils import logging
+from diffusers.utils.testing_utils import CaptureLogger, torch_device
+
+from ..others.test_utils import TOKEN, USER, is_staging_test
+
+
+torch.backends.cuda.matmul.allow_tf32 = False
+
+
+logger = logging.get_logger(__name__) # pylint: disable=invalid-name
+
+
+class SchedulerObject(SchedulerMixin, ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ e=[1, 3],
+ ):
+ pass
+
+
+class SchedulerObject2(SchedulerMixin, ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ f=[1, 3],
+ ):
+ pass
+
+
+class SchedulerObject3(SchedulerMixin, ConfigMixin):
+ config_name = "config.json"
+
+ @register_to_config
+ def __init__(
+ self,
+ a=2,
+ b=5,
+ c=(2, 5),
+ d="for diffusion",
+ e=[1, 3],
+ f=[1, 3],
+ ):
+ pass
+
+
+class SchedulerBaseTests(unittest.TestCase):
+ def test_save_load_from_different_config(self):
+ obj = SchedulerObject()
+
+ # mock add obj class to `diffusers`
+ setattr(diffusers, "SchedulerObject", SchedulerObject)
+ logger = logging.get_logger("diffusers.configuration_utils")
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ obj.save_config(tmpdirname)
+ with CaptureLogger(logger) as cap_logger_1:
+ config = SchedulerObject2.load_config(tmpdirname)
+ new_obj_1 = SchedulerObject2.from_config(config)
+
+ # now save a config parameter that is not expected
+ with open(os.path.join(tmpdirname, SchedulerObject.config_name), "r") as f:
+ data = json.load(f)
+ data["unexpected"] = True
+
+ with open(os.path.join(tmpdirname, SchedulerObject.config_name), "w") as f:
+ json.dump(data, f)
+
+ with CaptureLogger(logger) as cap_logger_2:
+ config = SchedulerObject.load_config(tmpdirname)
+ new_obj_2 = SchedulerObject.from_config(config)
+
+ with CaptureLogger(logger) as cap_logger_3:
+ config = SchedulerObject2.load_config(tmpdirname)
+ new_obj_3 = SchedulerObject2.from_config(config)
+
+ assert new_obj_1.__class__ == SchedulerObject2
+ assert new_obj_2.__class__ == SchedulerObject
+ assert new_obj_3.__class__ == SchedulerObject2
+
+ assert cap_logger_1.out == ""
+ assert (
+ cap_logger_2.out
+ == "The config attributes {'unexpected': True} were passed to SchedulerObject, but are not expected and"
+ " will"
+ " be ignored. Please verify your config.json configuration file.\n"
+ )
+ assert cap_logger_2.out.replace("SchedulerObject", "SchedulerObject2") == cap_logger_3.out
+
+ def test_save_load_compatible_schedulers(self):
+ SchedulerObject2._compatibles = ["SchedulerObject"]
+ SchedulerObject._compatibles = ["SchedulerObject2"]
+
+ obj = SchedulerObject()
+
+ # mock add obj class to `diffusers`
+ setattr(diffusers, "SchedulerObject", SchedulerObject)
+ setattr(diffusers, "SchedulerObject2", SchedulerObject2)
+ logger = logging.get_logger("diffusers.configuration_utils")
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ obj.save_config(tmpdirname)
+
+ # now save a config parameter that is expected by another class, but not origin class
+ with open(os.path.join(tmpdirname, SchedulerObject.config_name), "r") as f:
+ data = json.load(f)
+ data["f"] = [0, 0]
+ data["unexpected"] = True
+
+ with open(os.path.join(tmpdirname, SchedulerObject.config_name), "w") as f:
+ json.dump(data, f)
+
+ with CaptureLogger(logger) as cap_logger:
+ config = SchedulerObject.load_config(tmpdirname)
+ new_obj = SchedulerObject.from_config(config)
+
+ assert new_obj.__class__ == SchedulerObject
+
+ assert (
+ cap_logger.out
+ == "The config attributes {'unexpected': True} were passed to SchedulerObject, but are not expected and"
+ " will"
+ " be ignored. Please verify your config.json configuration file.\n"
+ )
+
+ def test_save_load_from_different_config_comp_schedulers(self):
+ SchedulerObject3._compatibles = ["SchedulerObject", "SchedulerObject2"]
+ SchedulerObject2._compatibles = ["SchedulerObject", "SchedulerObject3"]
+ SchedulerObject._compatibles = ["SchedulerObject2", "SchedulerObject3"]
+
+ obj = SchedulerObject()
+
+ # mock add obj class to `diffusers`
+ setattr(diffusers, "SchedulerObject", SchedulerObject)
+ setattr(diffusers, "SchedulerObject2", SchedulerObject2)
+ setattr(diffusers, "SchedulerObject3", SchedulerObject3)
+ logger = logging.get_logger("diffusers.configuration_utils")
+ logger.setLevel(diffusers.logging.INFO)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ obj.save_config(tmpdirname)
+
+ with CaptureLogger(logger) as cap_logger_1:
+ config = SchedulerObject.load_config(tmpdirname)
+ new_obj_1 = SchedulerObject.from_config(config)
+
+ with CaptureLogger(logger) as cap_logger_2:
+ config = SchedulerObject2.load_config(tmpdirname)
+ new_obj_2 = SchedulerObject2.from_config(config)
+
+ with CaptureLogger(logger) as cap_logger_3:
+ config = SchedulerObject3.load_config(tmpdirname)
+ new_obj_3 = SchedulerObject3.from_config(config)
+
+ assert new_obj_1.__class__ == SchedulerObject
+ assert new_obj_2.__class__ == SchedulerObject2
+ assert new_obj_3.__class__ == SchedulerObject3
+
+ assert cap_logger_1.out == ""
+ assert cap_logger_2.out == "{'f'} was not found in config. Values will be initialized to default values.\n"
+ assert cap_logger_3.out == "{'f'} was not found in config. Values will be initialized to default values.\n"
+
+ def test_default_arguments_not_in_config(self):
+ pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", torch_dtype=torch.float16
+ )
+ assert pipe.scheduler.__class__ == DDIMScheduler
+
+ # Default for DDIMScheduler
+ assert pipe.scheduler.config.timestep_spacing == "leading"
+
+ # Switch to a different one, verify we use the default for that class
+ pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
+ assert pipe.scheduler.config.timestep_spacing == "linspace"
+
+ # Override with kwargs
+ pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+ assert pipe.scheduler.config.timestep_spacing == "trailing"
+
+ # Verify overridden kwargs stick
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ assert pipe.scheduler.config.timestep_spacing == "trailing"
+
+ # And stick
+ pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+ assert pipe.scheduler.config.timestep_spacing == "trailing"
+
+ def test_default_solver_type_after_switch(self):
+ pipe = DiffusionPipeline.from_pretrained(
+ "hf-internal-testing/tiny-stable-diffusion-pipe", torch_dtype=torch.float16
+ )
+ assert pipe.scheduler.__class__ == DDIMScheduler
+
+ pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
+ assert pipe.scheduler.config.solver_type == "logrho"
+
+ # Switch to UniPC, verify the solver is the default
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+ assert pipe.scheduler.config.solver_type == "bh2"
+
+
+class SchedulerCommonTest(unittest.TestCase):
+ scheduler_classes = ()
+ forward_default_kwargs = ()
+
+ @property
+ def default_num_inference_steps(self):
+ return 50
+
+ @property
+ def default_timestep(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.get("num_inference_steps", self.default_num_inference_steps)
+
+ try:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ timestep = scheduler.timesteps[0]
+ except NotImplementedError:
+ logger.warning(
+ f"The scheduler {self.__class__.__name__} does not implement a `get_scheduler_config` method."
+ f" `default_timestep` will be set to the default value of 1."
+ )
+ timestep = 1
+
+ return timestep
+
+ # NOTE: currently taking the convention that default_timestep > default_timestep_2 (alternatively,
+ # default_timestep comes earlier in the timestep schedule than default_timestep_2)
+ @property
+ def default_timestep_2(self):
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.get("num_inference_steps", self.default_num_inference_steps)
+
+ try:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = self.scheduler_classes[0](**scheduler_config)
+
+ scheduler.set_timesteps(num_inference_steps)
+ if len(scheduler.timesteps) >= 2:
+ timestep_2 = scheduler.timesteps[1]
+ else:
+ logger.warning(
+ f"Using num_inference_steps from the scheduler testing class's default config leads to a timestep"
+ f" scheduler of length {len(scheduler.timesteps)} < 2. The default `default_timestep_2` value of 0"
+ f" will be used."
+ )
+ timestep_2 = 0
+ except NotImplementedError:
+ logger.warning(
+ f"The scheduler {self.__class__.__name__} does not implement a `get_scheduler_config` method."
+ f" `default_timestep_2` will be set to the default value of 0."
+ )
+ timestep_2 = 0
+
+ return timestep_2
+
+ @property
+ def dummy_sample(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ sample = torch.rand((batch_size, num_channels, height, width))
+
+ return sample
+
+ @property
+ def dummy_noise_deter(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ num_elems = batch_size * num_channels * height * width
+ sample = torch.arange(num_elems).flip(-1)
+ sample = sample.reshape(num_channels, height, width, batch_size)
+ sample = sample / num_elems
+ sample = sample.permute(3, 0, 1, 2)
+
+ return sample
+
+ @property
+ def dummy_sample_deter(self):
+ batch_size = 4
+ num_channels = 3
+ height = 8
+ width = 8
+
+ num_elems = batch_size * num_channels * height * width
+ sample = torch.arange(num_elems)
+ sample = sample.reshape(num_channels, height, width, batch_size)
+ sample = sample / num_elems
+ sample = sample.permute(3, 0, 1, 2)
+
+ return sample
+
+ def get_scheduler_config(self):
+ raise NotImplementedError
+
+ def dummy_model(self):
+ def model(sample, t, *args):
+ # if t is a tensor, match the number of dimensions of sample
+ if isinstance(t, torch.Tensor):
+ num_dims = len(sample.shape)
+ # pad t with 1s to match num_dims
+ t = t.reshape(-1, *(1,) * (num_dims - 1)).to(sample.device).to(sample.dtype)
+
+ return sample * t / (t + 1)
+
+ return model
+
+ def check_over_configs(self, time_step=0, **config):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ time_step = time_step if time_step is not None else self.default_timestep
+
+ for scheduler_class in self.scheduler_classes:
+ # TODO(Suraj) - delete the following two lines once DDPM, DDIM, and PNDM have timesteps casted to float by default
+ if scheduler_class in (EulerAncestralDiscreteScheduler, EulerDiscreteScheduler, LMSDiscreteScheduler):
+ time_step = float(time_step)
+
+ scheduler_config = self.get_scheduler_config(**config)
+ scheduler = scheduler_class(**scheduler_config)
+
+ if scheduler_class == CMStochasticIterativeScheduler:
+ # Get valid timestep based on sigma_max, which should always be in timestep schedule.
+ scaled_sigma_max = scheduler.sigma_to_t(scheduler.config.sigma_max)
+ time_step = scaled_sigma_max
+
+ if scheduler_class == EDMEulerScheduler:
+ time_step = scheduler.timesteps[-1]
+
+ if scheduler_class == VQDiffusionScheduler:
+ num_vec_classes = scheduler_config["num_vec_classes"]
+ sample = self.dummy_sample(num_vec_classes)
+ model = self.dummy_model(num_vec_classes)
+ residual = model(sample, time_step)
+ else:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ new_scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # Make sure `scale_model_input` is invoked to prevent a warning
+ if scheduler_class == CMStochasticIterativeScheduler:
+ # Get valid timestep based on sigma_max, which should always be in timestep schedule.
+ _ = scheduler.scale_model_input(sample, scaled_sigma_max)
+ _ = new_scheduler.scale_model_input(sample, scaled_sigma_max)
+ elif scheduler_class != VQDiffusionScheduler:
+ _ = scheduler.scale_model_input(sample, scheduler.timesteps[-1])
+ _ = new_scheduler.scale_model_input(sample, scheduler.timesteps[-1])
+
+ # Set the seed before step() as some schedulers are stochastic like EulerAncestralDiscreteScheduler, EulerDiscreteScheduler
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def check_over_forward(self, time_step=0, **forward_kwargs):
+ kwargs = dict(self.forward_default_kwargs)
+ kwargs.update(forward_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", None)
+ time_step = time_step if time_step is not None else self.default_timestep
+
+ for scheduler_class in self.scheduler_classes:
+ if scheduler_class in (EulerAncestralDiscreteScheduler, EulerDiscreteScheduler, LMSDiscreteScheduler):
+ time_step = float(time_step)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ if scheduler_class == VQDiffusionScheduler:
+ num_vec_classes = scheduler_config["num_vec_classes"]
+ sample = self.dummy_sample(num_vec_classes)
+ model = self.dummy_model(num_vec_classes)
+ residual = model(sample, time_step)
+ else:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ new_scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ output = scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ new_output = new_scheduler.step(residual, time_step, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_from_save_pretrained(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", self.default_num_inference_steps)
+
+ for scheduler_class in self.scheduler_classes:
+ timestep = self.default_timestep
+ if scheduler_class in (EulerAncestralDiscreteScheduler, EulerDiscreteScheduler, LMSDiscreteScheduler):
+ timestep = float(timestep)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ if scheduler_class == CMStochasticIterativeScheduler:
+ # Get valid timestep based on sigma_max, which should always be in timestep schedule.
+ timestep = scheduler.sigma_to_t(scheduler.config.sigma_max)
+
+ if scheduler_class == VQDiffusionScheduler:
+ num_vec_classes = scheduler_config["num_vec_classes"]
+ sample = self.dummy_sample(num_vec_classes)
+ model = self.dummy_model(num_vec_classes)
+ residual = model(sample, timestep)
+ else:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_config(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ new_scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ output = scheduler.step(residual, timestep, sample, **kwargs).prev_sample
+
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ new_output = new_scheduler.step(residual, timestep, sample, **kwargs).prev_sample
+
+ assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical"
+
+ def test_compatibles(self):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+
+ scheduler = scheduler_class(**scheduler_config)
+
+ assert all(c is not None for c in scheduler.compatibles)
+
+ for comp_scheduler_cls in scheduler.compatibles:
+ comp_scheduler = comp_scheduler_cls.from_config(scheduler.config)
+ assert comp_scheduler is not None
+
+ new_scheduler = scheduler_class.from_config(comp_scheduler.config)
+
+ new_scheduler_config = {k: v for k, v in new_scheduler.config.items() if k in scheduler.config}
+ scheduler_diff = {k: v for k, v in new_scheduler.config.items() if k not in scheduler.config}
+
+ # make sure that configs are essentially identical
+ assert new_scheduler_config == dict(scheduler.config)
+
+ # make sure that only differences are for configs that are not in init
+ init_keys = inspect.signature(scheduler_class.__init__).parameters.keys()
+ assert set(scheduler_diff.keys()).intersection(set(init_keys)) == set()
+
+ def test_from_pretrained(self):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+
+ scheduler = scheduler_class(**scheduler_config)
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_pretrained(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ # `_use_default_values` should not exist for just saved & loaded scheduler
+ scheduler_config = dict(scheduler.config)
+ del scheduler_config["_use_default_values"]
+
+ assert scheduler_config == new_scheduler.config
+
+ def test_step_shape(self):
+ kwargs = dict(self.forward_default_kwargs)
+
+ num_inference_steps = kwargs.pop("num_inference_steps", self.default_num_inference_steps)
+
+ timestep_0 = self.default_timestep
+ timestep_1 = self.default_timestep_2
+
+ for scheduler_class in self.scheduler_classes:
+ if scheduler_class in (EulerAncestralDiscreteScheduler, EulerDiscreteScheduler, LMSDiscreteScheduler):
+ timestep_0 = float(timestep_0)
+ timestep_1 = float(timestep_1)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ if scheduler_class == VQDiffusionScheduler:
+ num_vec_classes = scheduler_config["num_vec_classes"]
+ sample = self.dummy_sample(num_vec_classes)
+ model = self.dummy_model(num_vec_classes)
+ residual = model(sample, timestep_0)
+ else:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ output_0 = scheduler.step(residual, timestep_0, sample, **kwargs).prev_sample
+ output_1 = scheduler.step(residual, timestep_1, sample, **kwargs).prev_sample
+
+ self.assertEqual(output_0.shape, sample.shape)
+ self.assertEqual(output_0.shape, output_1.shape)
+
+ def test_scheduler_outputs_equivalence(self):
+ def set_nan_tensor_to_zero(t):
+ t[t != t] = 0
+ return t
+
+ def recursive_check(tuple_object, dict_object):
+ if isinstance(tuple_object, (List, Tuple)):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif isinstance(tuple_object, Dict):
+ for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()):
+ recursive_check(tuple_iterable_value, dict_iterable_value)
+ elif tuple_object is None:
+ return
+ else:
+ self.assertTrue(
+ torch.allclose(
+ set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
+ ),
+ msg=(
+ "Tuple and dict output are not equal. Difference:"
+ f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:"
+ f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has"
+ f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}."
+ ),
+ )
+
+ kwargs = dict(self.forward_default_kwargs)
+ num_inference_steps = kwargs.pop("num_inference_steps", self.default_num_inference_steps)
+
+ timestep = self.default_timestep
+ if len(self.scheduler_classes) > 0 and self.scheduler_classes[0] == IPNDMScheduler:
+ timestep = 1
+
+ for scheduler_class in self.scheduler_classes:
+ if scheduler_class in (EulerAncestralDiscreteScheduler, EulerDiscreteScheduler, LMSDiscreteScheduler):
+ timestep = float(timestep)
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ if scheduler_class == CMStochasticIterativeScheduler:
+ # Get valid timestep based on sigma_max, which should always be in timestep schedule.
+ timestep = scheduler.sigma_to_t(scheduler.config.sigma_max)
+
+ if scheduler_class == VQDiffusionScheduler:
+ num_vec_classes = scheduler_config["num_vec_classes"]
+ sample = self.dummy_sample(num_vec_classes)
+ model = self.dummy_model(num_vec_classes)
+ residual = model(sample, timestep)
+ else:
+ sample = self.dummy_sample
+ residual = 0.1 * sample
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # Set the seed before state as some schedulers are stochastic like EulerAncestralDiscreteScheduler, EulerDiscreteScheduler
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ outputs_dict = scheduler.step(residual, timestep, sample, **kwargs)
+
+ if num_inference_steps is not None and hasattr(scheduler, "set_timesteps"):
+ scheduler.set_timesteps(num_inference_steps)
+ elif num_inference_steps is not None and not hasattr(scheduler, "set_timesteps"):
+ kwargs["num_inference_steps"] = num_inference_steps
+
+ # Set the seed before state as some schedulers are stochastic like EulerAncestralDiscreteScheduler, EulerDiscreteScheduler
+ if "generator" in set(inspect.signature(scheduler.step).parameters.keys()):
+ kwargs["generator"] = torch.manual_seed(0)
+ outputs_tuple = scheduler.step(residual, timestep, sample, return_dict=False, **kwargs)
+
+ recursive_check(outputs_tuple, outputs_dict)
+
+ def test_scheduler_public_api(self):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ if scheduler_class != VQDiffusionScheduler:
+ self.assertTrue(
+ hasattr(scheduler, "init_noise_sigma"),
+ f"{scheduler_class} does not implement a required attribute `init_noise_sigma`",
+ )
+ self.assertTrue(
+ hasattr(scheduler, "scale_model_input"),
+ (
+ f"{scheduler_class} does not implement a required class method `scale_model_input(sample,"
+ " timestep)`"
+ ),
+ )
+ self.assertTrue(
+ hasattr(scheduler, "step"),
+ f"{scheduler_class} does not implement a required class method `step(...)`",
+ )
+
+ if scheduler_class != VQDiffusionScheduler:
+ sample = self.dummy_sample
+ if scheduler_class == CMStochasticIterativeScheduler:
+ # Get valid timestep based on sigma_max, which should always be in timestep schedule.
+ scaled_sigma_max = scheduler.sigma_to_t(scheduler.config.sigma_max)
+ scaled_sample = scheduler.scale_model_input(sample, scaled_sigma_max)
+ elif scheduler_class == EDMEulerScheduler:
+ scaled_sample = scheduler.scale_model_input(sample, scheduler.timesteps[-1])
+ else:
+ scaled_sample = scheduler.scale_model_input(sample, 0.0)
+ self.assertEqual(sample.shape, scaled_sample.shape)
+
+ def test_add_noise_device(self):
+ for scheduler_class in self.scheduler_classes:
+ if scheduler_class == IPNDMScheduler:
+ continue
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+ scheduler.set_timesteps(self.default_num_inference_steps)
+
+ sample = self.dummy_sample.to(torch_device)
+ if scheduler_class == CMStochasticIterativeScheduler:
+ # Get valid timestep based on sigma_max, which should always be in timestep schedule.
+ scaled_sigma_max = scheduler.sigma_to_t(scheduler.config.sigma_max)
+ scaled_sample = scheduler.scale_model_input(sample, scaled_sigma_max)
+ if scheduler_class == EDMEulerScheduler:
+ scaled_sample = scheduler.scale_model_input(sample, scheduler.timesteps[-1])
+ else:
+ scaled_sample = scheduler.scale_model_input(sample, 0.0)
+ self.assertEqual(sample.shape, scaled_sample.shape)
+
+ noise = torch.randn_like(scaled_sample).to(torch_device)
+ t = scheduler.timesteps[5][None]
+ noised = scheduler.add_noise(scaled_sample, noise, t)
+ self.assertEqual(noised.shape, scaled_sample.shape)
+
+ def test_deprecated_kwargs(self):
+ for scheduler_class in self.scheduler_classes:
+ has_kwarg_in_model_class = "kwargs" in inspect.signature(scheduler_class.__init__).parameters
+ has_deprecated_kwarg = len(scheduler_class._deprecated_kwargs) > 0
+
+ if has_kwarg_in_model_class and not has_deprecated_kwarg:
+ raise ValueError(
+ f"{scheduler_class} has `**kwargs` in its __init__ method but has not defined any deprecated"
+ " kwargs under the `_deprecated_kwargs` class attribute. Make sure to either remove `**kwargs` if"
+ " there are no deprecated arguments or add the deprecated argument with `_deprecated_kwargs ="
+ " []`"
+ )
+
+ if not has_kwarg_in_model_class and has_deprecated_kwarg:
+ raise ValueError(
+ f"{scheduler_class} doesn't have `**kwargs` in its __init__ method but has defined deprecated"
+ " kwargs under the `_deprecated_kwargs` class attribute. Make sure to either add the `**kwargs`"
+ f" argument to {self.model_class}.__init__ if there are deprecated arguments or remove the"
+ " deprecated argument from `_deprecated_kwargs = []`"
+ )
+
+ def test_trained_betas(self):
+ for scheduler_class in self.scheduler_classes:
+ if scheduler_class in (VQDiffusionScheduler, CMStochasticIterativeScheduler):
+ continue
+
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config, trained_betas=np.array([0.1, 0.3]))
+
+ with tempfile.TemporaryDirectory() as tmpdirname:
+ scheduler.save_pretrained(tmpdirname)
+ new_scheduler = scheduler_class.from_pretrained(tmpdirname)
+
+ assert scheduler.betas.tolist() == new_scheduler.betas.tolist()
+
+ def test_getattr_is_correct(self):
+ for scheduler_class in self.scheduler_classes:
+ scheduler_config = self.get_scheduler_config()
+ scheduler = scheduler_class(**scheduler_config)
+
+ # save some things to test
+ scheduler.dummy_attribute = 5
+ scheduler.register_to_config(test_attribute=5)
+
+ logger = logging.get_logger("diffusers.configuration_utils")
+ # 30 for warning
+ logger.setLevel(30)
+ with CaptureLogger(logger) as cap_logger:
+ assert hasattr(scheduler, "dummy_attribute")
+ assert getattr(scheduler, "dummy_attribute") == 5
+ assert scheduler.dummy_attribute == 5
+
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ logger = logging.get_logger("diffusers.schedulers.schedulering_utils")
+ # 30 for warning
+ logger.setLevel(30)
+ with CaptureLogger(logger) as cap_logger:
+ assert hasattr(scheduler, "save_pretrained")
+ fn = scheduler.save_pretrained
+ fn_1 = getattr(scheduler, "save_pretrained")
+
+ assert fn == fn_1
+ # no warning should be thrown
+ assert cap_logger.out == ""
+
+ # warning should be thrown
+ with self.assertWarns(FutureWarning):
+ assert scheduler.test_attribute == 5
+
+ with self.assertWarns(FutureWarning):
+ assert getattr(scheduler, "test_attribute") == 5
+
+ with self.assertRaises(AttributeError) as error:
+ scheduler.does_not_exist
+
+ assert str(error.exception) == f"'{type(scheduler).__name__}' object has no attribute 'does_not_exist'"
+
+
+@is_staging_test
+class SchedulerPushToHubTester(unittest.TestCase):
+ identifier = uuid.uuid4()
+ repo_id = f"test-scheduler-{identifier}"
+ org_repo_id = f"valid_org/{repo_id}-org"
+
+ def test_push_to_hub(self):
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ scheduler.push_to_hub(self.repo_id, token=TOKEN)
+ scheduler_loaded = DDIMScheduler.from_pretrained(f"{USER}/{self.repo_id}")
+
+ assert type(scheduler) == type(scheduler_loaded)
+
+ # Reset repo
+ delete_repo(token=TOKEN, repo_id=self.repo_id)
+
+ # Push to hub via save_config
+ with tempfile.TemporaryDirectory() as tmp_dir:
+ scheduler.save_config(tmp_dir, repo_id=self.repo_id, push_to_hub=True, token=TOKEN)
+
+ scheduler_loaded = DDIMScheduler.from_pretrained(f"{USER}/{self.repo_id}")
+
+ assert type(scheduler) == type(scheduler_loaded)
+
+ # Reset repo
+ delete_repo(token=TOKEN, repo_id=self.repo_id)
+
+ def test_push_to_hub_in_organization(self):
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ set_alpha_to_one=False,
+ )
+ scheduler.push_to_hub(self.org_repo_id, token=TOKEN)
+ scheduler_loaded = DDIMScheduler.from_pretrained(self.org_repo_id)
+
+ assert type(scheduler) == type(scheduler_loaded)
+
+ # Reset repo
+ delete_repo(token=TOKEN, repo_id=self.org_repo_id)
+
+ # Push to hub via save_config
+ with tempfile.TemporaryDirectory() as tmp_dir:
+ scheduler.save_config(tmp_dir, repo_id=self.org_repo_id, push_to_hub=True, token=TOKEN)
+
+ scheduler_loaded = DDIMScheduler.from_pretrained(self.org_repo_id)
+
+ assert type(scheduler) == type(scheduler_loaded)
+
+ # Reset repo
+ delete_repo(token=TOKEN, repo_id=self.org_repo_id)
diff --git a/utils/check_config_docstrings.py b/utils/check_config_docstrings.py
new file mode 100644
index 0000000..626a9a4
--- /dev/null
+++ b/utils/check_config_docstrings.py
@@ -0,0 +1,84 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+import inspect
+import os
+import re
+
+
+# All paths are set with the intent you should run this script from the root of the repo with the command
+# python utils/check_config_docstrings.py
+PATH_TO_TRANSFORMERS = "src/transformers"
+
+
+# This is to make sure the transformers module imported is the one in the repo.
+spec = importlib.util.spec_from_file_location(
+ "transformers",
+ os.path.join(PATH_TO_TRANSFORMERS, "__init__.py"),
+ submodule_search_locations=[PATH_TO_TRANSFORMERS],
+)
+transformers = spec.loader.load_module()
+
+CONFIG_MAPPING = transformers.models.auto.configuration_auto.CONFIG_MAPPING
+
+# Regex pattern used to find the checkpoint mentioned in the docstring of `config_class`.
+# For example, `[bert-base-uncased](https://huggingface.co/bert-base-uncased)`
+_re_checkpoint = re.compile(r"\[(.+?)\]\((https://huggingface\.co/.+?)\)")
+
+
+CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK = {
+ "CLIPConfigMixin",
+ "DecisionTransformerConfigMixin",
+ "EncoderDecoderConfigMixin",
+ "RagConfigMixin",
+ "SpeechEncoderDecoderConfigMixin",
+ "VisionEncoderDecoderConfigMixin",
+ "VisionTextDualEncoderConfigMixin",
+}
+
+
+def check_config_docstrings_have_checkpoints():
+ configs_without_checkpoint = []
+
+ for config_class in list(CONFIG_MAPPING.values()):
+ checkpoint_found = False
+
+ # source code of `config_class`
+ config_source = inspect.getsource(config_class)
+ checkpoints = _re_checkpoint.findall(config_source)
+
+ for checkpoint in checkpoints:
+ # Each `checkpoint` is a tuple of a checkpoint name and a checkpoint link.
+ # For example, `('bert-base-uncased', 'https://huggingface.co/bert-base-uncased')`
+ ckpt_name, ckpt_link = checkpoint
+
+ # verify the checkpoint name corresponds to the checkpoint link
+ ckpt_link_from_name = f"https://huggingface.co/{ckpt_name}"
+ if ckpt_link == ckpt_link_from_name:
+ checkpoint_found = True
+ break
+
+ name = config_class.__name__
+ if not checkpoint_found and name not in CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK:
+ configs_without_checkpoint.append(name)
+
+ if len(configs_without_checkpoint) > 0:
+ message = "\n".join(sorted(configs_without_checkpoint))
+ raise ValueError(f"The following configurations don't contain any valid checkpoint:\n{message}")
+
+
+if __name__ == "__main__":
+ check_config_docstrings_have_checkpoints()
diff --git a/utils/check_copies.py b/utils/check_copies.py
new file mode 100644
index 0000000..20449e7
--- /dev/null
+++ b/utils/check_copies.py
@@ -0,0 +1,222 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+import os
+import re
+import subprocess
+
+
+# All paths are set with the intent you should run this script from the root of the repo with the command
+# python utils/check_copies.py
+DIFFUSERS_PATH = "src/diffusers"
+REPO_PATH = "."
+
+
+def _should_continue(line, indent):
+ return line.startswith(indent) or len(line) <= 1 or re.search(r"^\s*\)(\s*->.*:|:)\s*$", line) is not None
+
+
+def find_code_in_diffusers(object_name):
+ """Find and return the code source code of `object_name`."""
+ parts = object_name.split(".")
+ i = 0
+
+ # First let's find the module where our object lives.
+ module = parts[i]
+ while i < len(parts) and not os.path.isfile(os.path.join(DIFFUSERS_PATH, f"{module}.py")):
+ i += 1
+ if i < len(parts):
+ module = os.path.join(module, parts[i])
+ if i >= len(parts):
+ raise ValueError(f"`object_name` should begin with the name of a module of diffusers but got {object_name}.")
+
+ with open(
+ os.path.join(DIFFUSERS_PATH, f"{module}.py"),
+ "r",
+ encoding="utf-8",
+ newline="\n",
+ ) as f:
+ lines = f.readlines()
+
+ # Now let's find the class / func in the code!
+ indent = ""
+ line_index = 0
+ for name in parts[i + 1 :]:
+ while (
+ line_index < len(lines) and re.search(rf"^{indent}(class|def)\s+{name}(\(|\:)", lines[line_index]) is None
+ ):
+ line_index += 1
+ indent += " "
+ line_index += 1
+
+ if line_index >= len(lines):
+ raise ValueError(f" {object_name} does not match any function or class in {module}.")
+
+ # We found the beginning of the class / func, now let's find the end (when the indent diminishes).
+ start_index = line_index
+ while line_index < len(lines) and _should_continue(lines[line_index], indent):
+ line_index += 1
+ # Clean up empty lines at the end (if any).
+ while len(lines[line_index - 1]) <= 1:
+ line_index -= 1
+
+ code_lines = lines[start_index:line_index]
+ return "".join(code_lines)
+
+
+_re_copy_warning = re.compile(r"^(\s*)#\s*Copied from\s+diffusers\.(\S+\.\S+)\s*($|\S.*$)")
+_re_replace_pattern = re.compile(r"^\s*(\S+)->(\S+)(\s+.*|$)")
+_re_fill_pattern = re.compile(r"]*>")
+
+
+def get_indent(code):
+ lines = code.split("\n")
+ idx = 0
+ while idx < len(lines) and len(lines[idx]) == 0:
+ idx += 1
+ if idx < len(lines):
+ return re.search(r"^(\s*)\S", lines[idx]).groups()[0]
+ return ""
+
+
+def run_ruff(code):
+ command = ["ruff", "format", "-", "--config", "pyproject.toml", "--silent"]
+ process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE)
+ stdout, _ = process.communicate(input=code.encode())
+ return stdout.decode()
+
+
+def stylify(code: str) -> str:
+ """
+ Applies the ruff part of our `make style` command to some code. This formats the code using `ruff format`.
+ As `ruff` does not provide a python api this cannot be done on the fly.
+
+ Args:
+ code (`str`): The code to format.
+
+ Returns:
+ `str`: The formatted code.
+ """
+ has_indent = len(get_indent(code)) > 0
+ if has_indent:
+ code = f"class Bla:\n{code}"
+ formatted_code = run_ruff(code)
+ return formatted_code[len("class Bla:\n") :] if has_indent else formatted_code
+
+
+def is_copy_consistent(filename, overwrite=False):
+ """
+ Check if the code commented as a copy in `filename` matches the original.
+ Return the differences or overwrites the content depending on `overwrite`.
+ """
+ with open(filename, "r", encoding="utf-8", newline="\n") as f:
+ lines = f.readlines()
+ diffs = []
+ line_index = 0
+ # Not a for loop cause `lines` is going to change (if `overwrite=True`).
+ while line_index < len(lines):
+ search = _re_copy_warning.search(lines[line_index])
+ if search is None:
+ line_index += 1
+ continue
+
+ # There is some copied code here, let's retrieve the original.
+ indent, object_name, replace_pattern = search.groups()
+ theoretical_code = find_code_in_diffusers(object_name)
+ theoretical_indent = get_indent(theoretical_code)
+
+ start_index = line_index + 1 if indent == theoretical_indent else line_index + 2
+ indent = theoretical_indent
+ line_index = start_index
+
+ # Loop to check the observed code, stop when indentation diminishes or if we see a End copy comment.
+ should_continue = True
+ while line_index < len(lines) and should_continue:
+ line_index += 1
+ if line_index >= len(lines):
+ break
+ line = lines[line_index]
+ should_continue = _should_continue(line, indent) and re.search(f"^{indent}# End copy", line) is None
+ # Clean up empty lines at the end (if any).
+ while len(lines[line_index - 1]) <= 1:
+ line_index -= 1
+
+ observed_code_lines = lines[start_index:line_index]
+ observed_code = "".join(observed_code_lines)
+
+ # Remove any nested `Copied from` comments to avoid circular copies
+ theoretical_code = [line for line in theoretical_code.split("\n") if _re_copy_warning.search(line) is None]
+ theoretical_code = "\n".join(theoretical_code)
+
+ # Before comparing, use the `replace_pattern` on the original code.
+ if len(replace_pattern) > 0:
+ patterns = replace_pattern.replace("with", "").split(",")
+ patterns = [_re_replace_pattern.search(p) for p in patterns]
+ for pattern in patterns:
+ if pattern is None:
+ continue
+ obj1, obj2, option = pattern.groups()
+ theoretical_code = re.sub(obj1, obj2, theoretical_code)
+ if option.strip() == "all-casing":
+ theoretical_code = re.sub(obj1.lower(), obj2.lower(), theoretical_code)
+ theoretical_code = re.sub(obj1.upper(), obj2.upper(), theoretical_code)
+
+ # stylify after replacement. To be able to do that, we need the header (class or function definition)
+ # from the previous line
+ theoretical_code = stylify(lines[start_index - 1] + theoretical_code)
+ theoretical_code = theoretical_code[len(lines[start_index - 1]) :]
+
+ # Test for a diff and act accordingly.
+ if observed_code != theoretical_code:
+ diffs.append([object_name, start_index])
+ if overwrite:
+ lines = lines[:start_index] + [theoretical_code] + lines[line_index:]
+ line_index = start_index + 1
+
+ if overwrite and len(diffs) > 0:
+ # Warn the user a file has been modified.
+ print(f"Detected changes, rewriting {filename}.")
+ with open(filename, "w", encoding="utf-8", newline="\n") as f:
+ f.writelines(lines)
+ return diffs
+
+
+def check_copies(overwrite: bool = False):
+ all_files = glob.glob(os.path.join(DIFFUSERS_PATH, "**/*.py"), recursive=True)
+ diffs = []
+ for filename in all_files:
+ new_diffs = is_copy_consistent(filename, overwrite)
+ diffs += [f"- {filename}: copy does not match {d[0]} at line {d[1]}" for d in new_diffs]
+ if not overwrite and len(diffs) > 0:
+ diff = "\n".join(diffs)
+ raise Exception(
+ "Found the following copy inconsistencies:\n"
+ + diff
+ + "\nRun `make fix-copies` or `python utils/check_copies.py --fix_and_overwrite` to fix them."
+ )
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "--fix_and_overwrite",
+ action="store_true",
+ help="Whether to fix inconsistencies.",
+ )
+ args = parser.parse_args()
+
+ check_copies(args.fix_and_overwrite)
diff --git a/utils/check_doc_toc.py b/utils/check_doc_toc.py
new file mode 100644
index 0000000..35ded93
--- /dev/null
+++ b/utils/check_doc_toc.py
@@ -0,0 +1,158 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from collections import defaultdict
+
+import yaml
+
+
+PATH_TO_TOC = "docs/source/en/_toctree.yml"
+
+
+def clean_doc_toc(doc_list):
+ """
+ Cleans the table of content of the model documentation by removing duplicates and sorting models alphabetically.
+ """
+ counts = defaultdict(int)
+ overview_doc = []
+ new_doc_list = []
+ for doc in doc_list:
+ if "local" in doc:
+ counts[doc["local"]] += 1
+
+ if doc["title"].lower() == "overview":
+ overview_doc.append({"local": doc["local"], "title": doc["title"]})
+ else:
+ new_doc_list.append(doc)
+
+ doc_list = new_doc_list
+ duplicates = [key for key, value in counts.items() if value > 1]
+
+ new_doc = []
+ for duplicate_key in duplicates:
+ titles = list({doc["title"] for doc in doc_list if doc["local"] == duplicate_key})
+ if len(titles) > 1:
+ raise ValueError(
+ f"{duplicate_key} is present several times in the documentation table of content at "
+ "`docs/source/en/_toctree.yml` with different *Title* values. Choose one of those and remove the "
+ "others."
+ )
+ # Only add this once
+ new_doc.append({"local": duplicate_key, "title": titles[0]})
+
+ # Add none duplicate-keys
+ new_doc.extend([doc for doc in doc_list if "local" not in counts or counts[doc["local"]] == 1])
+ new_doc = sorted(new_doc, key=lambda s: s["title"].lower())
+
+ # "overview" gets special treatment and is always first
+ if len(overview_doc) > 1:
+ raise ValueError("{doc_list} has two 'overview' docs which is not allowed.")
+
+ overview_doc.extend(new_doc)
+
+ # Sort
+ return overview_doc
+
+
+def check_scheduler_doc(overwrite=False):
+ with open(PATH_TO_TOC, encoding="utf-8") as f:
+ content = yaml.safe_load(f.read())
+
+ # Get to the API doc
+ api_idx = 0
+ while content[api_idx]["title"] != "API":
+ api_idx += 1
+ api_doc = content[api_idx]["sections"]
+
+ # Then to the model doc
+ scheduler_idx = 0
+ while api_doc[scheduler_idx]["title"] != "Schedulers":
+ scheduler_idx += 1
+
+ scheduler_doc = api_doc[scheduler_idx]["sections"]
+ new_scheduler_doc = clean_doc_toc(scheduler_doc)
+
+ diff = False
+ if new_scheduler_doc != scheduler_doc:
+ diff = True
+ if overwrite:
+ api_doc[scheduler_idx]["sections"] = new_scheduler_doc
+
+ if diff:
+ if overwrite:
+ content[api_idx]["sections"] = api_doc
+ with open(PATH_TO_TOC, "w", encoding="utf-8") as f:
+ f.write(yaml.dump(content, allow_unicode=True))
+ else:
+ raise ValueError(
+ "The model doc part of the table of content is not properly sorted, run `make style` to fix this."
+ )
+
+
+def check_pipeline_doc(overwrite=False):
+ with open(PATH_TO_TOC, encoding="utf-8") as f:
+ content = yaml.safe_load(f.read())
+
+ # Get to the API doc
+ api_idx = 0
+ while content[api_idx]["title"] != "API":
+ api_idx += 1
+ api_doc = content[api_idx]["sections"]
+
+ # Then to the model doc
+ pipeline_idx = 0
+ while api_doc[pipeline_idx]["title"] != "Pipelines":
+ pipeline_idx += 1
+
+ diff = False
+ pipeline_docs = api_doc[pipeline_idx]["sections"]
+ new_pipeline_docs = []
+
+ # sort sub pipeline docs
+ for pipeline_doc in pipeline_docs:
+ if "section" in pipeline_doc:
+ sub_pipeline_doc = pipeline_doc["section"]
+ new_sub_pipeline_doc = clean_doc_toc(sub_pipeline_doc)
+ if overwrite:
+ pipeline_doc["section"] = new_sub_pipeline_doc
+ new_pipeline_docs.append(pipeline_doc)
+
+ # sort overall pipeline doc
+ new_pipeline_docs = clean_doc_toc(new_pipeline_docs)
+
+ if new_pipeline_docs != pipeline_docs:
+ diff = True
+ if overwrite:
+ api_doc[pipeline_idx]["sections"] = new_pipeline_docs
+
+ if diff:
+ if overwrite:
+ content[api_idx]["sections"] = api_doc
+ with open(PATH_TO_TOC, "w", encoding="utf-8") as f:
+ f.write(yaml.dump(content, allow_unicode=True))
+ else:
+ raise ValueError(
+ "The model doc part of the table of content is not properly sorted, run `make style` to fix this."
+ )
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--fix_and_overwrite", action="store_true", help="Whether to fix inconsistencies.")
+ args = parser.parse_args()
+
+ check_scheduler_doc(args.fix_and_overwrite)
+ check_pipeline_doc(args.fix_and_overwrite)
diff --git a/utils/check_dummies.py b/utils/check_dummies.py
new file mode 100644
index 0000000..af99eeb
--- /dev/null
+++ b/utils/check_dummies.py
@@ -0,0 +1,175 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import re
+
+
+# All paths are set with the intent you should run this script from the root of the repo with the command
+# python utils/check_dummies.py
+PATH_TO_DIFFUSERS = "src/diffusers"
+
+# Matches is_xxx_available()
+_re_backend = re.compile(r"is\_([a-z_]*)_available\(\)")
+# Matches from xxx import bla
+_re_single_line_import = re.compile(r"\s+from\s+\S*\s+import\s+([^\(\s].*)\n")
+
+
+DUMMY_CONSTANT = """
+{0} = None
+"""
+
+DUMMY_CLASS = """
+class {0}(metaclass=DummyObject):
+ _backends = {1}
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, {1})
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, {1})
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, {1})
+"""
+
+
+DUMMY_FUNCTION = """
+def {0}(*args, **kwargs):
+ requires_backends({0}, {1})
+"""
+
+
+def find_backend(line):
+ """Find one (or multiple) backend in a code line of the init."""
+ backends = _re_backend.findall(line)
+ if len(backends) == 0:
+ return None
+
+ return "_and_".join(backends)
+
+
+def read_init():
+ """Read the init and extracts PyTorch, TensorFlow, SentencePiece and Tokenizers objects."""
+ with open(os.path.join(PATH_TO_DIFFUSERS, "__init__.py"), "r", encoding="utf-8", newline="\n") as f:
+ lines = f.readlines()
+
+ # Get to the point we do the actual imports for type checking
+ line_index = 0
+ while not lines[line_index].startswith("if TYPE_CHECKING"):
+ line_index += 1
+
+ backend_specific_objects = {}
+ # Go through the end of the file
+ while line_index < len(lines):
+ # If the line contains is_backend_available, we grab all objects associated with the `else` block
+ backend = find_backend(lines[line_index])
+ if backend is not None:
+ while not lines[line_index].startswith(" else:"):
+ line_index += 1
+ line_index += 1
+ objects = []
+ # Until we unindent, add backend objects to the list
+ while len(lines[line_index]) <= 1 or lines[line_index].startswith(" " * 8):
+ line = lines[line_index]
+ single_line_import_search = _re_single_line_import.search(line)
+ if single_line_import_search is not None:
+ objects.extend(single_line_import_search.groups()[0].split(", "))
+ elif line.startswith(" " * 12):
+ objects.append(line[12:-2])
+ line_index += 1
+
+ if len(objects) > 0:
+ backend_specific_objects[backend] = objects
+ else:
+ line_index += 1
+
+ return backend_specific_objects
+
+
+def create_dummy_object(name, backend_name):
+ """Create the code for the dummy object corresponding to `name`."""
+ if name.isupper():
+ return DUMMY_CONSTANT.format(name)
+ elif name.islower():
+ return DUMMY_FUNCTION.format(name, backend_name)
+ else:
+ return DUMMY_CLASS.format(name, backend_name)
+
+
+def create_dummy_files(backend_specific_objects=None):
+ """Create the content of the dummy files."""
+ if backend_specific_objects is None:
+ backend_specific_objects = read_init()
+ # For special correspondence backend to module name as used in the function requires_modulename
+ dummy_files = {}
+
+ for backend, objects in backend_specific_objects.items():
+ backend_name = "[" + ", ".join(f'"{b}"' for b in backend.split("_and_")) + "]"
+ dummy_file = "# This file is autogenerated by the command `make fix-copies`, do not edit.\n"
+ dummy_file += "from ..utils import DummyObject, requires_backends\n\n"
+ dummy_file += "\n".join([create_dummy_object(o, backend_name) for o in objects])
+ dummy_files[backend] = dummy_file
+
+ return dummy_files
+
+
+def check_dummies(overwrite=False):
+ """Check if the dummy files are up to date and maybe `overwrite` with the right content."""
+ dummy_files = create_dummy_files()
+ # For special correspondence backend to shortcut as used in utils/dummy_xxx_objects.py
+ short_names = {"torch": "pt"}
+
+ # Locate actual dummy modules and read their content.
+ path = os.path.join(PATH_TO_DIFFUSERS, "utils")
+ dummy_file_paths = {
+ backend: os.path.join(path, f"dummy_{short_names.get(backend, backend)}_objects.py")
+ for backend in dummy_files.keys()
+ }
+
+ actual_dummies = {}
+ for backend, file_path in dummy_file_paths.items():
+ if os.path.isfile(file_path):
+ with open(file_path, "r", encoding="utf-8", newline="\n") as f:
+ actual_dummies[backend] = f.read()
+ else:
+ actual_dummies[backend] = ""
+
+ for backend in dummy_files.keys():
+ if dummy_files[backend] != actual_dummies[backend]:
+ if overwrite:
+ print(
+ f"Updating diffusers.utils.dummy_{short_names.get(backend, backend)}_objects.py as the main "
+ "__init__ has new objects."
+ )
+ with open(dummy_file_paths[backend], "w", encoding="utf-8", newline="\n") as f:
+ f.write(dummy_files[backend])
+ else:
+ raise ValueError(
+ "The main __init__ has objects that are not present in "
+ f"diffusers.utils.dummy_{short_names.get(backend, backend)}_objects.py. Run `make fix-copies` "
+ "to fix this."
+ )
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--fix_and_overwrite", action="store_true", help="Whether to fix inconsistencies.")
+ args = parser.parse_args()
+
+ check_dummies(args.fix_and_overwrite)
diff --git a/utils/check_inits.py b/utils/check_inits.py
new file mode 100644
index 0000000..2c51404
--- /dev/null
+++ b/utils/check_inits.py
@@ -0,0 +1,299 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import importlib.util
+import os
+import re
+from pathlib import Path
+
+
+PATH_TO_TRANSFORMERS = "src/transformers"
+
+
+# Matches is_xxx_available()
+_re_backend = re.compile(r"is\_([a-z_]*)_available()")
+# Catches a one-line _import_struct = {xxx}
+_re_one_line_import_struct = re.compile(r"^_import_structure\s+=\s+\{([^\}]+)\}")
+# Catches a line with a key-values pattern: "bla": ["foo", "bar"]
+_re_import_struct_key_value = re.compile(r'\s+"\S*":\s+\[([^\]]*)\]')
+# Catches a line if not is_foo_available
+_re_test_backend = re.compile(r"^\s*if\s+not\s+is\_[a-z_]*\_available\(\)")
+# Catches a line _import_struct["bla"].append("foo")
+_re_import_struct_add_one = re.compile(r'^\s*_import_structure\["\S*"\]\.append\("(\S*)"\)')
+# Catches a line _import_struct["bla"].extend(["foo", "bar"]) or _import_struct["bla"] = ["foo", "bar"]
+_re_import_struct_add_many = re.compile(r"^\s*_import_structure\[\S*\](?:\.extend\(|\s*=\s+)\[([^\]]*)\]")
+# Catches a line with an object between quotes and a comma: "MyModel",
+_re_quote_object = re.compile(r'^\s+"([^"]+)",')
+# Catches a line with objects between brackets only: ["foo", "bar"],
+_re_between_brackets = re.compile(r"^\s+\[([^\]]+)\]")
+# Catches a line with from foo import bar, bla, boo
+_re_import = re.compile(r"\s+from\s+\S*\s+import\s+([^\(\s].*)\n")
+# Catches a line with try:
+_re_try = re.compile(r"^\s*try:")
+# Catches a line with else:
+_re_else = re.compile(r"^\s*else:")
+
+
+def find_backend(line):
+ """Find one (or multiple) backend in a code line of the init."""
+ if _re_test_backend.search(line) is None:
+ return None
+ backends = [b[0] for b in _re_backend.findall(line)]
+ backends.sort()
+ return "_and_".join(backends)
+
+
+def parse_init(init_file):
+ """
+ Read an init_file and parse (per backend) the _import_structure objects defined and the TYPE_CHECKING objects
+ defined
+ """
+ with open(init_file, "r", encoding="utf-8", newline="\n") as f:
+ lines = f.readlines()
+
+ line_index = 0
+ while line_index < len(lines) and not lines[line_index].startswith("_import_structure = {"):
+ line_index += 1
+
+ # If this is a traditional init, just return.
+ if line_index >= len(lines):
+ return None
+
+ # First grab the objects without a specific backend in _import_structure
+ objects = []
+ while not lines[line_index].startswith("if TYPE_CHECKING") and find_backend(lines[line_index]) is None:
+ line = lines[line_index]
+ # If we have everything on a single line, let's deal with it.
+ if _re_one_line_import_struct.search(line):
+ content = _re_one_line_import_struct.search(line).groups()[0]
+ imports = re.findall(r"\[([^\]]+)\]", content)
+ for imp in imports:
+ objects.extend([obj[1:-1] for obj in imp.split(", ")])
+ line_index += 1
+ continue
+ single_line_import_search = _re_import_struct_key_value.search(line)
+ if single_line_import_search is not None:
+ imports = [obj[1:-1] for obj in single_line_import_search.groups()[0].split(", ") if len(obj) > 0]
+ objects.extend(imports)
+ elif line.startswith(" " * 8 + '"'):
+ objects.append(line[9:-3])
+ line_index += 1
+
+ import_dict_objects = {"none": objects}
+ # Let's continue with backend-specific objects in _import_structure
+ while not lines[line_index].startswith("if TYPE_CHECKING"):
+ # If the line is an if not is_backend_available, we grab all objects associated.
+ backend = find_backend(lines[line_index])
+ # Check if the backend declaration is inside a try block:
+ if _re_try.search(lines[line_index - 1]) is None:
+ backend = None
+
+ if backend is not None:
+ line_index += 1
+
+ # Scroll until we hit the else block of try-except-else
+ while _re_else.search(lines[line_index]) is None:
+ line_index += 1
+
+ line_index += 1
+
+ objects = []
+ # Until we unindent, add backend objects to the list
+ while len(lines[line_index]) <= 1 or lines[line_index].startswith(" " * 4):
+ line = lines[line_index]
+ if _re_import_struct_add_one.search(line) is not None:
+ objects.append(_re_import_struct_add_one.search(line).groups()[0])
+ elif _re_import_struct_add_many.search(line) is not None:
+ imports = _re_import_struct_add_many.search(line).groups()[0].split(", ")
+ imports = [obj[1:-1] for obj in imports if len(obj) > 0]
+ objects.extend(imports)
+ elif _re_between_brackets.search(line) is not None:
+ imports = _re_between_brackets.search(line).groups()[0].split(", ")
+ imports = [obj[1:-1] for obj in imports if len(obj) > 0]
+ objects.extend(imports)
+ elif _re_quote_object.search(line) is not None:
+ objects.append(_re_quote_object.search(line).groups()[0])
+ elif line.startswith(" " * 8 + '"'):
+ objects.append(line[9:-3])
+ elif line.startswith(" " * 12 + '"'):
+ objects.append(line[13:-3])
+ line_index += 1
+
+ import_dict_objects[backend] = objects
+ else:
+ line_index += 1
+
+ # At this stage we are in the TYPE_CHECKING part, first grab the objects without a specific backend
+ objects = []
+ while (
+ line_index < len(lines)
+ and find_backend(lines[line_index]) is None
+ and not lines[line_index].startswith("else")
+ ):
+ line = lines[line_index]
+ single_line_import_search = _re_import.search(line)
+ if single_line_import_search is not None:
+ objects.extend(single_line_import_search.groups()[0].split(", "))
+ elif line.startswith(" " * 8):
+ objects.append(line[8:-2])
+ line_index += 1
+
+ type_hint_objects = {"none": objects}
+ # Let's continue with backend-specific objects
+ while line_index < len(lines):
+ # If the line is an if is_backend_available, we grab all objects associated.
+ backend = find_backend(lines[line_index])
+ # Check if the backend declaration is inside a try block:
+ if _re_try.search(lines[line_index - 1]) is None:
+ backend = None
+
+ if backend is not None:
+ line_index += 1
+
+ # Scroll until we hit the else block of try-except-else
+ while _re_else.search(lines[line_index]) is None:
+ line_index += 1
+
+ line_index += 1
+
+ objects = []
+ # Until we unindent, add backend objects to the list
+ while len(lines[line_index]) <= 1 or lines[line_index].startswith(" " * 8):
+ line = lines[line_index]
+ single_line_import_search = _re_import.search(line)
+ if single_line_import_search is not None:
+ objects.extend(single_line_import_search.groups()[0].split(", "))
+ elif line.startswith(" " * 12):
+ objects.append(line[12:-2])
+ line_index += 1
+
+ type_hint_objects[backend] = objects
+ else:
+ line_index += 1
+
+ return import_dict_objects, type_hint_objects
+
+
+def analyze_results(import_dict_objects, type_hint_objects):
+ """
+ Analyze the differences between _import_structure objects and TYPE_CHECKING objects found in an init.
+ """
+
+ def find_duplicates(seq):
+ return [k for k, v in collections.Counter(seq).items() if v > 1]
+
+ if list(import_dict_objects.keys()) != list(type_hint_objects.keys()):
+ return ["Both sides of the init do not have the same backends!"]
+
+ errors = []
+ for key in import_dict_objects.keys():
+ duplicate_imports = find_duplicates(import_dict_objects[key])
+ if duplicate_imports:
+ errors.append(f"Duplicate _import_structure definitions for: {duplicate_imports}")
+ duplicate_type_hints = find_duplicates(type_hint_objects[key])
+ if duplicate_type_hints:
+ errors.append(f"Duplicate TYPE_CHECKING objects for: {duplicate_type_hints}")
+
+ if sorted(set(import_dict_objects[key])) != sorted(set(type_hint_objects[key])):
+ name = "base imports" if key == "none" else f"{key} backend"
+ errors.append(f"Differences for {name}:")
+ for a in type_hint_objects[key]:
+ if a not in import_dict_objects[key]:
+ errors.append(f" {a} in TYPE_HINT but not in _import_structure.")
+ for a in import_dict_objects[key]:
+ if a not in type_hint_objects[key]:
+ errors.append(f" {a} in _import_structure but not in TYPE_HINT.")
+ return errors
+
+
+def check_all_inits():
+ """
+ Check all inits in the transformers repo and raise an error if at least one does not define the same objects in
+ both halves.
+ """
+ failures = []
+ for root, _, files in os.walk(PATH_TO_TRANSFORMERS):
+ if "__init__.py" in files:
+ fname = os.path.join(root, "__init__.py")
+ objects = parse_init(fname)
+ if objects is not None:
+ errors = analyze_results(*objects)
+ if len(errors) > 0:
+ errors[0] = f"Problem in {fname}, both halves do not define the same objects.\n{errors[0]}"
+ failures.append("\n".join(errors))
+ if len(failures) > 0:
+ raise ValueError("\n\n".join(failures))
+
+
+def get_transformers_submodules():
+ """
+ Returns the list of Transformers submodules.
+ """
+ submodules = []
+ for path, directories, files in os.walk(PATH_TO_TRANSFORMERS):
+ for folder in directories:
+ # Ignore private modules
+ if folder.startswith("_"):
+ directories.remove(folder)
+ continue
+ # Ignore leftovers from branches (empty folders apart from pycache)
+ if len(list((Path(path) / folder).glob("*.py"))) == 0:
+ continue
+ short_path = str((Path(path) / folder).relative_to(PATH_TO_TRANSFORMERS))
+ submodule = short_path.replace(os.path.sep, ".")
+ submodules.append(submodule)
+ for fname in files:
+ if fname == "__init__.py":
+ continue
+ short_path = str((Path(path) / fname).relative_to(PATH_TO_TRANSFORMERS))
+ submodule = short_path.replace(".py", "").replace(os.path.sep, ".")
+ if len(submodule.split(".")) == 1:
+ submodules.append(submodule)
+ return submodules
+
+
+IGNORE_SUBMODULES = [
+ "convert_pytorch_checkpoint_to_tf2",
+ "modeling_flax_pytorch_utils",
+]
+
+
+def check_submodules():
+ # This is to make sure the transformers module imported is the one in the repo.
+ spec = importlib.util.spec_from_file_location(
+ "transformers",
+ os.path.join(PATH_TO_TRANSFORMERS, "__init__.py"),
+ submodule_search_locations=[PATH_TO_TRANSFORMERS],
+ )
+ transformers = spec.loader.load_module()
+
+ module_not_registered = [
+ module
+ for module in get_transformers_submodules()
+ if module not in IGNORE_SUBMODULES and module not in transformers._import_structure.keys()
+ ]
+ if len(module_not_registered) > 0:
+ list_of_modules = "\n".join(f"- {module}" for module in module_not_registered)
+ raise ValueError(
+ "The following submodules are not properly registered in the main init of Transformers:\n"
+ f"{list_of_modules}\n"
+ "Make sure they appear somewhere in the keys of `_import_structure` with an empty list as value."
+ )
+
+
+if __name__ == "__main__":
+ check_all_inits()
+ check_submodules()
diff --git a/utils/check_repo.py b/utils/check_repo.py
new file mode 100644
index 0000000..597893f
--- /dev/null
+++ b/utils/check_repo.py
@@ -0,0 +1,755 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+import inspect
+import os
+import re
+import warnings
+from collections import OrderedDict
+from difflib import get_close_matches
+from pathlib import Path
+
+from diffusers.models.auto import get_values
+from diffusers.utils import ENV_VARS_TRUE_VALUES, is_flax_available, is_torch_available
+
+
+# All paths are set with the intent you should run this script from the root of the repo with the command
+# python utils/check_repo.py
+PATH_TO_DIFFUSERS = "src/diffusers"
+PATH_TO_TESTS = "tests"
+PATH_TO_DOC = "docs/source/en"
+
+# Update this list with models that are supposed to be private.
+PRIVATE_MODELS = [
+ "DPRSpanPredictor",
+ "RealmBertModel",
+ "T5Stack",
+ "TFDPRSpanPredictor",
+]
+
+# Update this list for models that are not tested with a comment explaining the reason it should not be.
+# Being in this list is an exception and should **not** be the rule.
+IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
+ # models to ignore for not tested
+ "OPTDecoder", # Building part of bigger (tested) model.
+ "DecisionTransformerGPT2Model", # Building part of bigger (tested) model.
+ "SegformerDecodeHead", # Building part of bigger (tested) model.
+ "PLBartEncoder", # Building part of bigger (tested) model.
+ "PLBartDecoder", # Building part of bigger (tested) model.
+ "PLBartDecoderWrapper", # Building part of bigger (tested) model.
+ "BigBirdPegasusEncoder", # Building part of bigger (tested) model.
+ "BigBirdPegasusDecoder", # Building part of bigger (tested) model.
+ "BigBirdPegasusDecoderWrapper", # Building part of bigger (tested) model.
+ "DetrEncoder", # Building part of bigger (tested) model.
+ "DetrDecoder", # Building part of bigger (tested) model.
+ "DetrDecoderWrapper", # Building part of bigger (tested) model.
+ "M2M100Encoder", # Building part of bigger (tested) model.
+ "M2M100Decoder", # Building part of bigger (tested) model.
+ "Speech2TextEncoder", # Building part of bigger (tested) model.
+ "Speech2TextDecoder", # Building part of bigger (tested) model.
+ "LEDEncoder", # Building part of bigger (tested) model.
+ "LEDDecoder", # Building part of bigger (tested) model.
+ "BartDecoderWrapper", # Building part of bigger (tested) model.
+ "BartEncoder", # Building part of bigger (tested) model.
+ "BertLMHeadModel", # Needs to be setup as decoder.
+ "BlenderbotSmallEncoder", # Building part of bigger (tested) model.
+ "BlenderbotSmallDecoderWrapper", # Building part of bigger (tested) model.
+ "BlenderbotEncoder", # Building part of bigger (tested) model.
+ "BlenderbotDecoderWrapper", # Building part of bigger (tested) model.
+ "MBartEncoder", # Building part of bigger (tested) model.
+ "MBartDecoderWrapper", # Building part of bigger (tested) model.
+ "MegatronBertLMHeadModel", # Building part of bigger (tested) model.
+ "MegatronBertEncoder", # Building part of bigger (tested) model.
+ "MegatronBertDecoder", # Building part of bigger (tested) model.
+ "MegatronBertDecoderWrapper", # Building part of bigger (tested) model.
+ "PegasusEncoder", # Building part of bigger (tested) model.
+ "PegasusDecoderWrapper", # Building part of bigger (tested) model.
+ "DPREncoder", # Building part of bigger (tested) model.
+ "ProphetNetDecoderWrapper", # Building part of bigger (tested) model.
+ "RealmBertModel", # Building part of bigger (tested) model.
+ "RealmReader", # Not regular model.
+ "RealmScorer", # Not regular model.
+ "RealmForOpenQA", # Not regular model.
+ "ReformerForMaskedLM", # Needs to be setup as decoder.
+ "Speech2Text2DecoderWrapper", # Building part of bigger (tested) model.
+ "TFDPREncoder", # Building part of bigger (tested) model.
+ "TFElectraMainLayer", # Building part of bigger (tested) model (should it be a TFModelMixin ?)
+ "TFRobertaForMultipleChoice", # TODO: fix
+ "TrOCRDecoderWrapper", # Building part of bigger (tested) model.
+ "SeparableConv1D", # Building part of bigger (tested) model.
+ "FlaxBartForCausalLM", # Building part of bigger (tested) model.
+ "FlaxBertForCausalLM", # Building part of bigger (tested) model. Tested implicitly through FlaxRobertaForCausalLM.
+ "OPTDecoderWrapper",
+]
+
+# Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
+# trigger the common tests.
+TEST_FILES_WITH_NO_COMMON_TESTS = [
+ "models/decision_transformer/test_modeling_decision_transformer.py",
+ "models/camembert/test_modeling_camembert.py",
+ "models/mt5/test_modeling_flax_mt5.py",
+ "models/mbart/test_modeling_mbart.py",
+ "models/mt5/test_modeling_mt5.py",
+ "models/pegasus/test_modeling_pegasus.py",
+ "models/camembert/test_modeling_tf_camembert.py",
+ "models/mt5/test_modeling_tf_mt5.py",
+ "models/xlm_roberta/test_modeling_tf_xlm_roberta.py",
+ "models/xlm_roberta/test_modeling_flax_xlm_roberta.py",
+ "models/xlm_prophetnet/test_modeling_xlm_prophetnet.py",
+ "models/xlm_roberta/test_modeling_xlm_roberta.py",
+ "models/vision_text_dual_encoder/test_modeling_vision_text_dual_encoder.py",
+ "models/vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py",
+ "models/decision_transformer/test_modeling_decision_transformer.py",
+]
+
+# Update this list for models that are not in any of the auto MODEL_XXX_MAPPING. Being in this list is an exception and
+# should **not** be the rule.
+IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
+ # models to ignore for model xxx mapping
+ "DPTForDepthEstimation",
+ "DecisionTransformerGPT2Model",
+ "GLPNForDepthEstimation",
+ "ViltForQuestionAnswering",
+ "ViltForImagesAndTextClassification",
+ "ViltForImageAndTextRetrieval",
+ "ViltForMaskedLM",
+ "XGLMEncoder",
+ "XGLMDecoder",
+ "XGLMDecoderWrapper",
+ "PerceiverForMultimodalAutoencoding",
+ "PerceiverForOpticalFlow",
+ "SegformerDecodeHead",
+ "FlaxBeitForMaskedImageModeling",
+ "PLBartEncoder",
+ "PLBartDecoder",
+ "PLBartDecoderWrapper",
+ "BeitForMaskedImageModeling",
+ "CLIPTextModel",
+ "CLIPVisionModel",
+ "TFCLIPTextModel",
+ "TFCLIPVisionModel",
+ "FlaxCLIPTextModel",
+ "FlaxCLIPVisionModel",
+ "FlaxWav2Vec2ForCTC",
+ "DetrForSegmentation",
+ "DPRReader",
+ "FlaubertForQuestionAnswering",
+ "FlavaImageCodebook",
+ "FlavaTextModel",
+ "FlavaImageModel",
+ "FlavaMultimodalModel",
+ "GPT2DoubleHeadsModel",
+ "LukeForMaskedLM",
+ "LukeForEntityClassification",
+ "LukeForEntityPairClassification",
+ "LukeForEntitySpanClassification",
+ "OpenAIGPTDoubleHeadsModel",
+ "RagModel",
+ "RagSequenceForGeneration",
+ "RagTokenForGeneration",
+ "RealmEmbedder",
+ "RealmForOpenQA",
+ "RealmScorer",
+ "RealmReader",
+ "TFDPRReader",
+ "TFGPT2DoubleHeadsModel",
+ "TFOpenAIGPTDoubleHeadsModel",
+ "TFRagModel",
+ "TFRagSequenceForGeneration",
+ "TFRagTokenForGeneration",
+ "Wav2Vec2ForCTC",
+ "HubertForCTC",
+ "SEWForCTC",
+ "SEWDForCTC",
+ "XLMForQuestionAnswering",
+ "XLNetForQuestionAnswering",
+ "SeparableConv1D",
+ "VisualBertForRegionToPhraseAlignment",
+ "VisualBertForVisualReasoning",
+ "VisualBertForQuestionAnswering",
+ "VisualBertForMultipleChoice",
+ "TFWav2Vec2ForCTC",
+ "TFHubertForCTC",
+ "MaskFormerForInstanceSegmentation",
+]
+
+# Update this list for models that have multiple model types for the same
+# model doc
+MODEL_TYPE_TO_DOC_MAPPING = OrderedDict(
+ [
+ ("data2vec-text", "data2vec"),
+ ("data2vec-audio", "data2vec"),
+ ("data2vec-vision", "data2vec"),
+ ]
+)
+
+
+# This is to make sure the transformers module imported is the one in the repo.
+spec = importlib.util.spec_from_file_location(
+ "diffusers",
+ os.path.join(PATH_TO_DIFFUSERS, "__init__.py"),
+ submodule_search_locations=[PATH_TO_DIFFUSERS],
+)
+diffusers = spec.loader.load_module()
+
+
+def check_model_list():
+ """Check the model list inside the transformers library."""
+ # Get the models from the directory structure of `src/diffusers/models/`
+ models_dir = os.path.join(PATH_TO_DIFFUSERS, "models")
+ _models = []
+ for model in os.listdir(models_dir):
+ model_dir = os.path.join(models_dir, model)
+ if os.path.isdir(model_dir) and "__init__.py" in os.listdir(model_dir):
+ _models.append(model)
+
+ # Get the models from the directory structure of `src/transformers/models/`
+ models = [model for model in dir(diffusers.models) if not model.startswith("__")]
+
+ missing_models = sorted(set(_models).difference(models))
+ if missing_models:
+ raise Exception(
+ f"The following models should be included in {models_dir}/__init__.py: {','.join(missing_models)}."
+ )
+
+
+# If some modeling modules should be ignored for all checks, they should be added in the nested list
+# _ignore_modules of this function.
+def get_model_modules():
+ """Get the model modules inside the transformers library."""
+ _ignore_modules = [
+ "modeling_auto",
+ "modeling_encoder_decoder",
+ "modeling_marian",
+ "modeling_mmbt",
+ "modeling_outputs",
+ "modeling_retribert",
+ "modeling_utils",
+ "modeling_flax_auto",
+ "modeling_flax_encoder_decoder",
+ "modeling_flax_utils",
+ "modeling_speech_encoder_decoder",
+ "modeling_flax_speech_encoder_decoder",
+ "modeling_flax_vision_encoder_decoder",
+ "modeling_transfo_xl_utilities",
+ "modeling_tf_auto",
+ "modeling_tf_encoder_decoder",
+ "modeling_tf_outputs",
+ "modeling_tf_pytorch_utils",
+ "modeling_tf_utils",
+ "modeling_tf_transfo_xl_utilities",
+ "modeling_tf_vision_encoder_decoder",
+ "modeling_vision_encoder_decoder",
+ ]
+ modules = []
+ for model in dir(diffusers.models):
+ # There are some magic dunder attributes in the dir, we ignore them
+ if not model.startswith("__"):
+ model_module = getattr(diffusers.models, model)
+ for submodule in dir(model_module):
+ if submodule.startswith("modeling") and submodule not in _ignore_modules:
+ modeling_module = getattr(model_module, submodule)
+ if inspect.ismodule(modeling_module):
+ modules.append(modeling_module)
+ return modules
+
+
+def get_models(module, include_pretrained=False):
+ """Get the objects in module that are models."""
+ models = []
+ model_classes = (diffusers.ModelMixin, diffusers.TFModelMixin, diffusers.FlaxModelMixin)
+ for attr_name in dir(module):
+ if not include_pretrained and ("Pretrained" in attr_name or "PreTrained" in attr_name):
+ continue
+ attr = getattr(module, attr_name)
+ if isinstance(attr, type) and issubclass(attr, model_classes) and attr.__module__ == module.__name__:
+ models.append((attr_name, attr))
+ return models
+
+
+def is_a_private_model(model):
+ """Returns True if the model should not be in the main init."""
+ if model in PRIVATE_MODELS:
+ return True
+
+ # Wrapper, Encoder and Decoder are all privates
+ if model.endswith("Wrapper"):
+ return True
+ if model.endswith("Encoder"):
+ return True
+ if model.endswith("Decoder"):
+ return True
+ return False
+
+
+def check_models_are_in_init():
+ """Checks all models defined in the library are in the main init."""
+ models_not_in_init = []
+ dir_transformers = dir(diffusers)
+ for module in get_model_modules():
+ models_not_in_init += [
+ model[0] for model in get_models(module, include_pretrained=True) if model[0] not in dir_transformers
+ ]
+
+ # Remove private models
+ models_not_in_init = [model for model in models_not_in_init if not is_a_private_model(model)]
+ if len(models_not_in_init) > 0:
+ raise Exception(f"The following models should be in the main init: {','.join(models_not_in_init)}.")
+
+
+# If some test_modeling files should be ignored when checking models are all tested, they should be added in the
+# nested list _ignore_files of this function.
+def get_model_test_files():
+ """Get the model test files.
+
+ The returned files should NOT contain the `tests` (i.e. `PATH_TO_TESTS` defined in this script). They will be
+ considered as paths relative to `tests`. A caller has to use `os.path.join(PATH_TO_TESTS, ...)` to access the files.
+ """
+
+ _ignore_files = [
+ "test_modeling_common",
+ "test_modeling_encoder_decoder",
+ "test_modeling_flax_encoder_decoder",
+ "test_modeling_flax_speech_encoder_decoder",
+ "test_modeling_marian",
+ "test_modeling_tf_common",
+ "test_modeling_tf_encoder_decoder",
+ ]
+ test_files = []
+ # Check both `PATH_TO_TESTS` and `PATH_TO_TESTS/models`
+ model_test_root = os.path.join(PATH_TO_TESTS, "models")
+ model_test_dirs = []
+ for x in os.listdir(model_test_root):
+ x = os.path.join(model_test_root, x)
+ if os.path.isdir(x):
+ model_test_dirs.append(x)
+
+ for target_dir in [PATH_TO_TESTS] + model_test_dirs:
+ for file_or_dir in os.listdir(target_dir):
+ path = os.path.join(target_dir, file_or_dir)
+ if os.path.isfile(path):
+ filename = os.path.split(path)[-1]
+ if "test_modeling" in filename and os.path.splitext(filename)[0] not in _ignore_files:
+ file = os.path.join(*path.split(os.sep)[1:])
+ test_files.append(file)
+
+ return test_files
+
+
+# This is a bit hacky but I didn't find a way to import the test_file as a module and read inside the tester class
+# for the all_model_classes variable.
+def find_tested_models(test_file):
+ """Parse the content of test_file to detect what's in all_model_classes"""
+ # This is a bit hacky but I didn't find a way to import the test_file as a module and read inside the class
+ with open(os.path.join(PATH_TO_TESTS, test_file), "r", encoding="utf-8", newline="\n") as f:
+ content = f.read()
+ all_models = re.findall(r"all_model_classes\s+=\s+\(\s*\(([^\)]*)\)", content)
+ # Check with one less parenthesis as well
+ all_models += re.findall(r"all_model_classes\s+=\s+\(([^\)]*)\)", content)
+ if len(all_models) > 0:
+ model_tested = []
+ for entry in all_models:
+ for line in entry.split(","):
+ name = line.strip()
+ if len(name) > 0:
+ model_tested.append(name)
+ return model_tested
+
+
+def check_models_are_tested(module, test_file):
+ """Check models defined in module are tested in test_file."""
+ # XxxModelMixin are not tested
+ defined_models = get_models(module)
+ tested_models = find_tested_models(test_file)
+ if tested_models is None:
+ if test_file.replace(os.path.sep, "/") in TEST_FILES_WITH_NO_COMMON_TESTS:
+ return
+ return [
+ f"{test_file} should define `all_model_classes` to apply common tests to the models it tests. "
+ + "If this intentional, add the test filename to `TEST_FILES_WITH_NO_COMMON_TESTS` in the file "
+ + "`utils/check_repo.py`."
+ ]
+ failures = []
+ for model_name, _ in defined_models:
+ if model_name not in tested_models and model_name not in IGNORE_NON_TESTED:
+ failures.append(
+ f"{model_name} is defined in {module.__name__} but is not tested in "
+ + f"{os.path.join(PATH_TO_TESTS, test_file)}. Add it to the all_model_classes in that file."
+ + "If common tests should not applied to that model, add its name to `IGNORE_NON_TESTED`"
+ + "in the file `utils/check_repo.py`."
+ )
+ return failures
+
+
+def check_all_models_are_tested():
+ """Check all models are properly tested."""
+ modules = get_model_modules()
+ test_files = get_model_test_files()
+ failures = []
+ for module in modules:
+ test_file = [file for file in test_files if f"test_{module.__name__.split('.')[-1]}.py" in file]
+ if len(test_file) == 0:
+ failures.append(f"{module.__name__} does not have its corresponding test file {test_file}.")
+ elif len(test_file) > 1:
+ failures.append(f"{module.__name__} has several test files: {test_file}.")
+ else:
+ test_file = test_file[0]
+ new_failures = check_models_are_tested(module, test_file)
+ if new_failures is not None:
+ failures += new_failures
+ if len(failures) > 0:
+ raise Exception(f"There were {len(failures)} failures:\n" + "\n".join(failures))
+
+
+def get_all_auto_configured_models():
+ """Return the list of all models in at least one auto class."""
+ result = set() # To avoid duplicates we concatenate all model classes in a set.
+ if is_torch_available():
+ for attr_name in dir(diffusers.models.auto.modeling_auto):
+ if attr_name.startswith("MODEL_") and attr_name.endswith("MAPPING_NAMES"):
+ result = result | set(get_values(getattr(diffusers.models.auto.modeling_auto, attr_name)))
+ if is_flax_available():
+ for attr_name in dir(diffusers.models.auto.modeling_flax_auto):
+ if attr_name.startswith("FLAX_MODEL_") and attr_name.endswith("MAPPING_NAMES"):
+ result = result | set(get_values(getattr(diffusers.models.auto.modeling_flax_auto, attr_name)))
+ return list(result)
+
+
+def ignore_unautoclassed(model_name):
+ """Rules to determine if `name` should be in an auto class."""
+ # Special white list
+ if model_name in IGNORE_NON_AUTO_CONFIGURED:
+ return True
+ # Encoder and Decoder should be ignored
+ if "Encoder" in model_name or "Decoder" in model_name:
+ return True
+ return False
+
+
+def check_models_are_auto_configured(module, all_auto_models):
+ """Check models defined in module are each in an auto class."""
+ defined_models = get_models(module)
+ failures = []
+ for model_name, _ in defined_models:
+ if model_name not in all_auto_models and not ignore_unautoclassed(model_name):
+ failures.append(
+ f"{model_name} is defined in {module.__name__} but is not present in any of the auto mapping. "
+ "If that is intended behavior, add its name to `IGNORE_NON_AUTO_CONFIGURED` in the file "
+ "`utils/check_repo.py`."
+ )
+ return failures
+
+
+def check_all_models_are_auto_configured():
+ """Check all models are each in an auto class."""
+ missing_backends = []
+ if not is_torch_available():
+ missing_backends.append("PyTorch")
+ if not is_flax_available():
+ missing_backends.append("Flax")
+ if len(missing_backends) > 0:
+ missing = ", ".join(missing_backends)
+ if os.getenv("TRANSFORMERS_IS_CI", "").upper() in ENV_VARS_TRUE_VALUES:
+ raise Exception(
+ "Full quality checks require all backends to be installed (with `pip install -e .[dev]` in the "
+ f"Transformers repo, the following are missing: {missing}."
+ )
+ else:
+ warnings.warn(
+ "Full quality checks require all backends to be installed (with `pip install -e .[dev]` in the "
+ f"Transformers repo, the following are missing: {missing}. While it's probably fine as long as you "
+ "didn't make any change in one of those backends modeling files, you should probably execute the "
+ "command above to be on the safe side."
+ )
+ modules = get_model_modules()
+ all_auto_models = get_all_auto_configured_models()
+ failures = []
+ for module in modules:
+ new_failures = check_models_are_auto_configured(module, all_auto_models)
+ if new_failures is not None:
+ failures += new_failures
+ if len(failures) > 0:
+ raise Exception(f"There were {len(failures)} failures:\n" + "\n".join(failures))
+
+
+_re_decorator = re.compile(r"^\s*@(\S+)\s+$")
+
+
+def check_decorator_order(filename):
+ """Check that in the test file `filename` the slow decorator is always last."""
+ with open(filename, "r", encoding="utf-8", newline="\n") as f:
+ lines = f.readlines()
+ decorator_before = None
+ errors = []
+ for i, line in enumerate(lines):
+ search = _re_decorator.search(line)
+ if search is not None:
+ decorator_name = search.groups()[0]
+ if decorator_before is not None and decorator_name.startswith("parameterized"):
+ errors.append(i)
+ decorator_before = decorator_name
+ elif decorator_before is not None:
+ decorator_before = None
+ return errors
+
+
+def check_all_decorator_order():
+ """Check that in all test files, the slow decorator is always last."""
+ errors = []
+ for fname in os.listdir(PATH_TO_TESTS):
+ if fname.endswith(".py"):
+ filename = os.path.join(PATH_TO_TESTS, fname)
+ new_errors = check_decorator_order(filename)
+ errors += [f"- {filename}, line {i}" for i in new_errors]
+ if len(errors) > 0:
+ msg = "\n".join(errors)
+ raise ValueError(
+ "The parameterized decorator (and its variants) should always be first, but this is not the case in the"
+ f" following files:\n{msg}"
+ )
+
+
+def find_all_documented_objects():
+ """Parse the content of all doc files to detect which classes and functions it documents"""
+ documented_obj = []
+ for doc_file in Path(PATH_TO_DOC).glob("**/*.rst"):
+ with open(doc_file, "r", encoding="utf-8", newline="\n") as f:
+ content = f.read()
+ raw_doc_objs = re.findall(r"(?:autoclass|autofunction):: transformers.(\S+)\s+", content)
+ documented_obj += [obj.split(".")[-1] for obj in raw_doc_objs]
+ for doc_file in Path(PATH_TO_DOC).glob("**/*.md"):
+ with open(doc_file, "r", encoding="utf-8", newline="\n") as f:
+ content = f.read()
+ raw_doc_objs = re.findall(r"\[\[autodoc\]\]\s+(\S+)\s+", content)
+ documented_obj += [obj.split(".")[-1] for obj in raw_doc_objs]
+ return documented_obj
+
+
+# One good reason for not being documented is to be deprecated. Put in this list deprecated objects.
+DEPRECATED_OBJECTS = [
+ "AutoModelWithLMHead",
+ "BartPretrainedModel",
+ "DataCollator",
+ "DataCollatorForSOP",
+ "GlueDataset",
+ "GlueDataTrainingArguments",
+ "LineByLineTextDataset",
+ "LineByLineWithRefDataset",
+ "LineByLineWithSOPTextDataset",
+ "PretrainedBartModel",
+ "PretrainedFSMTModel",
+ "SingleSentenceClassificationProcessor",
+ "SquadDataTrainingArguments",
+ "SquadDataset",
+ "SquadExample",
+ "SquadFeatures",
+ "SquadV1Processor",
+ "SquadV2Processor",
+ "TFAutoModelWithLMHead",
+ "TFBartPretrainedModel",
+ "TextDataset",
+ "TextDatasetForNextSentencePrediction",
+ "Wav2Vec2ForMaskedLM",
+ "Wav2Vec2Tokenizer",
+ "glue_compute_metrics",
+ "glue_convert_examples_to_features",
+ "glue_output_modes",
+ "glue_processors",
+ "glue_tasks_num_labels",
+ "squad_convert_examples_to_features",
+ "xnli_compute_metrics",
+ "xnli_output_modes",
+ "xnli_processors",
+ "xnli_tasks_num_labels",
+ "TFTrainer",
+ "TFTrainingArguments",
+]
+
+# Exceptionally, some objects should not be documented after all rules passed.
+# ONLY PUT SOMETHING IN THIS LIST AS A LAST RESORT!
+UNDOCUMENTED_OBJECTS = [
+ "AddedToken", # This is a tokenizers class.
+ "BasicTokenizer", # Internal, should never have been in the main init.
+ "CharacterTokenizer", # Internal, should never have been in the main init.
+ "DPRPretrainedReader", # Like an Encoder.
+ "DummyObject", # Just picked by mistake sometimes.
+ "MecabTokenizer", # Internal, should never have been in the main init.
+ "ModelCard", # Internal type.
+ "SqueezeBertModule", # Internal building block (should have been called SqueezeBertLayer)
+ "TFDPRPretrainedReader", # Like an Encoder.
+ "TransfoXLCorpus", # Internal type.
+ "WordpieceTokenizer", # Internal, should never have been in the main init.
+ "absl", # External module
+ "add_end_docstrings", # Internal, should never have been in the main init.
+ "add_start_docstrings", # Internal, should never have been in the main init.
+ "cached_path", # Internal used for downloading models.
+ "convert_tf_weight_name_to_pt_weight_name", # Internal used to convert model weights
+ "logger", # Internal logger
+ "logging", # External module
+ "requires_backends", # Internal function
+]
+
+# This list should be empty. Objects in it should get their own doc page.
+SHOULD_HAVE_THEIR_OWN_PAGE = [
+ # Benchmarks
+ "PyTorchBenchmark",
+ "PyTorchBenchmarkArguments",
+ "TensorFlowBenchmark",
+ "TensorFlowBenchmarkArguments",
+]
+
+
+def ignore_undocumented(name):
+ """Rules to determine if `name` should be undocumented."""
+ # NOT DOCUMENTED ON PURPOSE.
+ # Constants uppercase are not documented.
+ if name.isupper():
+ return True
+ # ModelMixins / Encoders / Decoders / Layers / Embeddings / Attention are not documented.
+ if (
+ name.endswith("ModelMixin")
+ or name.endswith("Decoder")
+ or name.endswith("Encoder")
+ or name.endswith("Layer")
+ or name.endswith("Embeddings")
+ or name.endswith("Attention")
+ ):
+ return True
+ # Submodules are not documented.
+ if os.path.isdir(os.path.join(PATH_TO_DIFFUSERS, name)) or os.path.isfile(
+ os.path.join(PATH_TO_DIFFUSERS, f"{name}.py")
+ ):
+ return True
+ # All load functions are not documented.
+ if name.startswith("load_tf") or name.startswith("load_pytorch"):
+ return True
+ # is_xxx_available functions are not documented.
+ if name.startswith("is_") and name.endswith("_available"):
+ return True
+ # Deprecated objects are not documented.
+ if name in DEPRECATED_OBJECTS or name in UNDOCUMENTED_OBJECTS:
+ return True
+ # MMBT model does not really work.
+ if name.startswith("MMBT"):
+ return True
+ if name in SHOULD_HAVE_THEIR_OWN_PAGE:
+ return True
+ return False
+
+
+def check_all_objects_are_documented():
+ """Check all models are properly documented."""
+ documented_objs = find_all_documented_objects()
+ modules = diffusers._modules
+ objects = [c for c in dir(diffusers) if c not in modules and not c.startswith("_")]
+ undocumented_objs = [c for c in objects if c not in documented_objs and not ignore_undocumented(c)]
+ if len(undocumented_objs) > 0:
+ raise Exception(
+ "The following objects are in the public init so should be documented:\n - "
+ + "\n - ".join(undocumented_objs)
+ )
+ check_docstrings_are_in_md()
+ check_model_type_doc_match()
+
+
+def check_model_type_doc_match():
+ """Check all doc pages have a corresponding model type."""
+ model_doc_folder = Path(PATH_TO_DOC) / "model_doc"
+ model_docs = [m.stem for m in model_doc_folder.glob("*.md")]
+
+ model_types = list(diffusers.models.auto.configuration_auto.MODEL_NAMES_MAPPING.keys())
+ model_types = [MODEL_TYPE_TO_DOC_MAPPING[m] if m in MODEL_TYPE_TO_DOC_MAPPING else m for m in model_types]
+
+ errors = []
+ for m in model_docs:
+ if m not in model_types and m != "auto":
+ close_matches = get_close_matches(m, model_types)
+ error_message = f"{m} is not a proper model identifier."
+ if len(close_matches) > 0:
+ close_matches = "/".join(close_matches)
+ error_message += f" Did you mean {close_matches}?"
+ errors.append(error_message)
+
+ if len(errors) > 0:
+ raise ValueError(
+ "Some model doc pages do not match any existing model type:\n"
+ + "\n".join(errors)
+ + "\nYou can add any missing model type to the `MODEL_NAMES_MAPPING` constant in "
+ "models/auto/configuration_auto.py."
+ )
+
+
+# Re pattern to catch :obj:`xx`, :class:`xx`, :func:`xx` or :meth:`xx`.
+_re_rst_special_words = re.compile(r":(?:obj|func|class|meth):`([^`]+)`")
+# Re pattern to catch things between double backquotes.
+_re_double_backquotes = re.compile(r"(^|[^`])``([^`]+)``([^`]|$)")
+# Re pattern to catch example introduction.
+_re_rst_example = re.compile(r"^\s*Example.*::\s*$", flags=re.MULTILINE)
+
+
+def is_rst_docstring(docstring):
+ """
+ Returns `True` if `docstring` is written in rst.
+ """
+ if _re_rst_special_words.search(docstring) is not None:
+ return True
+ if _re_double_backquotes.search(docstring) is not None:
+ return True
+ if _re_rst_example.search(docstring) is not None:
+ return True
+ return False
+
+
+def check_docstrings_are_in_md():
+ """Check all docstrings are in md"""
+ files_with_rst = []
+ for file in Path(PATH_TO_DIFFUSERS).glob("**/*.py"):
+ with open(file, "r") as f:
+ code = f.read()
+ docstrings = code.split('"""')
+
+ for idx, docstring in enumerate(docstrings):
+ if idx % 2 == 0 or not is_rst_docstring(docstring):
+ continue
+ files_with_rst.append(file)
+ break
+
+ if len(files_with_rst) > 0:
+ raise ValueError(
+ "The following files have docstrings written in rst:\n"
+ + "\n".join([f"- {f}" for f in files_with_rst])
+ + "\nTo fix this run `doc-builder convert path_to_py_file` after installing `doc-builder`\n"
+ "(`pip install git+https://github.com/huggingface/doc-builder`)"
+ )
+
+
+def check_repo_quality():
+ """Check all models are properly tested and documented."""
+ print("Checking all models are included.")
+ check_model_list()
+ print("Checking all models are public.")
+ check_models_are_in_init()
+ print("Checking all models are properly tested.")
+ check_all_decorator_order()
+ check_all_models_are_tested()
+ print("Checking all objects are properly documented.")
+ check_all_objects_are_documented()
+ print("Checking all models are in at least one auto class.")
+ check_all_models_are_auto_configured()
+
+
+if __name__ == "__main__":
+ check_repo_quality()
diff --git a/utils/check_table.py b/utils/check_table.py
new file mode 100644
index 0000000..aa7554c
--- /dev/null
+++ b/utils/check_table.py
@@ -0,0 +1,185 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import collections
+import importlib.util
+import os
+import re
+
+
+# All paths are set with the intent you should run this script from the root of the repo with the command
+# python utils/check_table.py
+TRANSFORMERS_PATH = "src/diffusers"
+PATH_TO_DOCS = "docs/source/en"
+REPO_PATH = "."
+
+
+def _find_text_in_file(filename, start_prompt, end_prompt):
+ """
+ Find the text in `filename` between a line beginning with `start_prompt` and before `end_prompt`, removing empty
+ lines.
+ """
+ with open(filename, "r", encoding="utf-8", newline="\n") as f:
+ lines = f.readlines()
+ # Find the start prompt.
+ start_index = 0
+ while not lines[start_index].startswith(start_prompt):
+ start_index += 1
+ start_index += 1
+
+ end_index = start_index
+ while not lines[end_index].startswith(end_prompt):
+ end_index += 1
+ end_index -= 1
+
+ while len(lines[start_index]) <= 1:
+ start_index += 1
+ while len(lines[end_index]) <= 1:
+ end_index -= 1
+ end_index += 1
+ return "".join(lines[start_index:end_index]), start_index, end_index, lines
+
+
+# Add here suffixes that are used to identify models, separated by |
+ALLOWED_MODEL_SUFFIXES = "Model|Encoder|Decoder|ForConditionalGeneration"
+# Regexes that match TF/Flax/PT model names.
+_re_tf_models = re.compile(r"TF(.*)(?:Model|Encoder|Decoder|ForConditionalGeneration)")
+_re_flax_models = re.compile(r"Flax(.*)(?:Model|Encoder|Decoder|ForConditionalGeneration)")
+# Will match any TF or Flax model too so need to be in an else branch afterthe two previous regexes.
+_re_pt_models = re.compile(r"(.*)(?:Model|Encoder|Decoder|ForConditionalGeneration)")
+
+
+# This is to make sure the diffusers module imported is the one in the repo.
+spec = importlib.util.spec_from_file_location(
+ "diffusers",
+ os.path.join(TRANSFORMERS_PATH, "__init__.py"),
+ submodule_search_locations=[TRANSFORMERS_PATH],
+)
+diffusers_module = spec.loader.load_module()
+
+
+# Thanks to https://stackoverflow.com/questions/29916065/how-to-do-camelcase-split-in-python
+def camel_case_split(identifier):
+ "Split a camelcased `identifier` into words."
+ matches = re.finditer(".+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)", identifier)
+ return [m.group(0) for m in matches]
+
+
+def _center_text(text, width):
+ text_length = 2 if text == "โ " or text == "โ" else len(text)
+ left_indent = (width - text_length) // 2
+ right_indent = width - text_length - left_indent
+ return " " * left_indent + text + " " * right_indent
+
+
+def get_model_table_from_auto_modules():
+ """Generates an up-to-date model table from the content of the auto modules."""
+ # Dictionary model names to config.
+ config_mapping_names = diffusers_module.models.auto.configuration_auto.CONFIG_MAPPING_NAMES
+ model_name_to_config = {
+ name: config_mapping_names[code]
+ for code, name in diffusers_module.MODEL_NAMES_MAPPING.items()
+ if code in config_mapping_names
+ }
+ model_name_to_prefix = {name: config.replace("ConfigMixin", "") for name, config in model_name_to_config.items()}
+
+ # Dictionaries flagging if each model prefix has a slow/fast tokenizer, backend in PT/TF/Flax.
+ slow_tokenizers = collections.defaultdict(bool)
+ fast_tokenizers = collections.defaultdict(bool)
+ pt_models = collections.defaultdict(bool)
+ tf_models = collections.defaultdict(bool)
+ flax_models = collections.defaultdict(bool)
+
+ # Let's lookup through all diffusers object (once).
+ for attr_name in dir(diffusers_module):
+ lookup_dict = None
+ if attr_name.endswith("Tokenizer"):
+ lookup_dict = slow_tokenizers
+ attr_name = attr_name[:-9]
+ elif attr_name.endswith("TokenizerFast"):
+ lookup_dict = fast_tokenizers
+ attr_name = attr_name[:-13]
+ elif _re_tf_models.match(attr_name) is not None:
+ lookup_dict = tf_models
+ attr_name = _re_tf_models.match(attr_name).groups()[0]
+ elif _re_flax_models.match(attr_name) is not None:
+ lookup_dict = flax_models
+ attr_name = _re_flax_models.match(attr_name).groups()[0]
+ elif _re_pt_models.match(attr_name) is not None:
+ lookup_dict = pt_models
+ attr_name = _re_pt_models.match(attr_name).groups()[0]
+
+ if lookup_dict is not None:
+ while len(attr_name) > 0:
+ if attr_name in model_name_to_prefix.values():
+ lookup_dict[attr_name] = True
+ break
+ # Try again after removing the last word in the name
+ attr_name = "".join(camel_case_split(attr_name)[:-1])
+
+ # Let's build that table!
+ model_names = list(model_name_to_config.keys())
+ model_names.sort(key=str.lower)
+ columns = ["Model", "Tokenizer slow", "Tokenizer fast", "PyTorch support", "TensorFlow support", "Flax Support"]
+ # We'll need widths to properly display everything in the center (+2 is to leave one extra space on each side).
+ widths = [len(c) + 2 for c in columns]
+ widths[0] = max([len(name) for name in model_names]) + 2
+
+ # Build the table per se
+ table = "|" + "|".join([_center_text(c, w) for c, w in zip(columns, widths)]) + "|\n"
+ # Use ":-----:" format to center-aligned table cell texts
+ table += "|" + "|".join([":" + "-" * (w - 2) + ":" for w in widths]) + "|\n"
+
+ check = {True: "โ ", False: "โ"}
+ for name in model_names:
+ prefix = model_name_to_prefix[name]
+ line = [
+ name,
+ check[slow_tokenizers[prefix]],
+ check[fast_tokenizers[prefix]],
+ check[pt_models[prefix]],
+ check[tf_models[prefix]],
+ check[flax_models[prefix]],
+ ]
+ table += "|" + "|".join([_center_text(l, w) for l, w in zip(line, widths)]) + "|\n"
+ return table
+
+
+def check_model_table(overwrite=False):
+ """Check the model table in the index.rst is consistent with the state of the lib and maybe `overwrite`."""
+ current_table, start_index, end_index, lines = _find_text_in_file(
+ filename=os.path.join(PATH_TO_DOCS, "index.md"),
+ start_prompt="",
+ )
+ new_table = get_model_table_from_auto_modules()
+
+ if current_table != new_table:
+ if overwrite:
+ with open(os.path.join(PATH_TO_DOCS, "index.md"), "w", encoding="utf-8", newline="\n") as f:
+ f.writelines(lines[:start_index] + [new_table] + lines[end_index:])
+ else:
+ raise ValueError(
+ "The model table in the `index.md` has not been updated. Run `make fix-copies` to fix this."
+ )
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--fix_and_overwrite", action="store_true", help="Whether to fix inconsistencies.")
+ args = parser.parse_args()
+
+ check_model_table(args.fix_and_overwrite)
diff --git a/utils/custom_init_isort.py b/utils/custom_init_isort.py
new file mode 100644
index 0000000..acba69a
--- /dev/null
+++ b/utils/custom_init_isort.py
@@ -0,0 +1,329 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Utility that sorts the imports in the custom inits of Diffusers. Diffusers uses init files that delay the
+import of an object to when it's actually needed. This is to avoid the main init importing all models, which would
+make the line `import transformers` very slow when the user has all optional dependencies installed. The inits with
+delayed imports have two halves: one defining a dictionary `_import_structure` which maps modules to the name of the
+objects in each module, and one in `TYPE_CHECKING` which looks like a normal init for type-checkers. `isort` or `ruff`
+properly sort the second half which looks like traditionl imports, the goal of this script is to sort the first half.
+
+Use from the root of the repo with:
+
+```bash
+python utils/custom_init_isort.py
+```
+
+which will auto-sort the imports (used in `make style`).
+
+For a check only (as used in `make quality`) run:
+
+```bash
+python utils/custom_init_isort.py --check_only
+```
+"""
+import argparse
+import os
+import re
+from typing import Any, Callable, List, Optional
+
+
+# Path is defined with the intent you should run this script from the root of the repo.
+PATH_TO_TRANSFORMERS = "src/diffusers"
+
+# Pattern that looks at the indentation in a line.
+_re_indent = re.compile(r"^(\s*)\S")
+# Pattern that matches `"key":" and puts `key` in group 0.
+_re_direct_key = re.compile(r'^\s*"([^"]+)":')
+# Pattern that matches `_import_structure["key"]` and puts `key` in group 0.
+_re_indirect_key = re.compile(r'^\s*_import_structure\["([^"]+)"\]')
+# Pattern that matches `"key",` and puts `key` in group 0.
+_re_strip_line = re.compile(r'^\s*"([^"]+)",\s*$')
+# Pattern that matches any `[stuff]` and puts `stuff` in group 0.
+_re_bracket_content = re.compile(r"\[([^\]]+)\]")
+
+
+def get_indent(line: str) -> str:
+ """Returns the indent in given line (as string)."""
+ search = _re_indent.search(line)
+ return "" if search is None else search.groups()[0]
+
+
+def split_code_in_indented_blocks(
+ code: str, indent_level: str = "", start_prompt: Optional[str] = None, end_prompt: Optional[str] = None
+) -> List[str]:
+ """
+ Split some code into its indented blocks, starting at a given level.
+
+ Args:
+ code (`str`): The code to split.
+ indent_level (`str`): The indent level (as string) to use for identifying the blocks to split.
+ start_prompt (`str`, *optional*): If provided, only starts splitting at the line where this text is.
+ end_prompt (`str`, *optional*): If provided, stops splitting at a line where this text is.
+
+ Warning:
+ The text before `start_prompt` or after `end_prompt` (if provided) is not ignored, just not split. The input `code`
+ can thus be retrieved by joining the result.
+
+ Returns:
+ `List[str]`: The list of blocks.
+ """
+ # Let's split the code into lines and move to start_index.
+ index = 0
+ lines = code.split("\n")
+ if start_prompt is not None:
+ while not lines[index].startswith(start_prompt):
+ index += 1
+ blocks = ["\n".join(lines[:index])]
+ else:
+ blocks = []
+
+ # This variable contains the block treated at a given time.
+ current_block = [lines[index]]
+ index += 1
+ # We split into blocks until we get to the `end_prompt` (or the end of the file).
+ while index < len(lines) and (end_prompt is None or not lines[index].startswith(end_prompt)):
+ # We have a non-empty line with the proper indent -> start of a new block
+ if len(lines[index]) > 0 and get_indent(lines[index]) == indent_level:
+ # Store the current block in the result and rest. There are two cases: the line is part of the block (like
+ # a closing parenthesis) or not.
+ if len(current_block) > 0 and get_indent(current_block[-1]).startswith(indent_level + " "):
+ # Line is part of the current block
+ current_block.append(lines[index])
+ blocks.append("\n".join(current_block))
+ if index < len(lines) - 1:
+ current_block = [lines[index + 1]]
+ index += 1
+ else:
+ current_block = []
+ else:
+ # Line is not part of the current block
+ blocks.append("\n".join(current_block))
+ current_block = [lines[index]]
+ else:
+ # Just add the line to the current block
+ current_block.append(lines[index])
+ index += 1
+
+ # Adds current block if it's nonempty.
+ if len(current_block) > 0:
+ blocks.append("\n".join(current_block))
+
+ # Add final block after end_prompt if provided.
+ if end_prompt is not None and index < len(lines):
+ blocks.append("\n".join(lines[index:]))
+
+ return blocks
+
+
+def ignore_underscore_and_lowercase(key: Callable[[Any], str]) -> Callable[[Any], str]:
+ """
+ Wraps a key function (as used in a sort) to lowercase and ignore underscores.
+ """
+
+ def _inner(x):
+ return key(x).lower().replace("_", "")
+
+ return _inner
+
+
+def sort_objects(objects: List[Any], key: Optional[Callable[[Any], str]] = None) -> List[Any]:
+ """
+ Sort a list of objects following the rules of isort (all uppercased first, camel-cased second and lower-cased
+ last).
+
+ Args:
+ objects (`List[Any]`):
+ The list of objects to sort.
+ key (`Callable[[Any], str]`, *optional*):
+ A function taking an object as input and returning a string, used to sort them by alphabetical order.
+ If not provided, will default to noop (so a `key` must be provided if the `objects` are not of type string).
+
+ Returns:
+ `List[Any]`: The sorted list with the same elements as in the inputs
+ """
+
+ # If no key is provided, we use a noop.
+ def noop(x):
+ return x
+
+ if key is None:
+ key = noop
+ # Constants are all uppercase, they go first.
+ constants = [obj for obj in objects if key(obj).isupper()]
+ # Classes are not all uppercase but start with a capital, they go second.
+ classes = [obj for obj in objects if key(obj)[0].isupper() and not key(obj).isupper()]
+ # Functions begin with a lowercase, they go last.
+ functions = [obj for obj in objects if not key(obj)[0].isupper()]
+
+ # Then we sort each group.
+ key1 = ignore_underscore_and_lowercase(key)
+ return sorted(constants, key=key1) + sorted(classes, key=key1) + sorted(functions, key=key1)
+
+
+def sort_objects_in_import(import_statement: str) -> str:
+ """
+ Sorts the imports in a single import statement.
+
+ Args:
+ import_statement (`str`): The import statement in which to sort the imports.
+
+ Returns:
+ `str`: The same as the input, but with objects properly sorted.
+ """
+
+ # This inner function sort imports between [ ].
+ def _replace(match):
+ imports = match.groups()[0]
+ # If there is one import only, nothing to do.
+ if "," not in imports:
+ return f"[{imports}]"
+ keys = [part.strip().replace('"', "") for part in imports.split(",")]
+ # We will have a final empty element if the line finished with a comma.
+ if len(keys[-1]) == 0:
+ keys = keys[:-1]
+ return "[" + ", ".join([f'"{k}"' for k in sort_objects(keys)]) + "]"
+
+ lines = import_statement.split("\n")
+ if len(lines) > 3:
+ # Here we have to sort internal imports that are on several lines (one per name):
+ # key: [
+ # "object1",
+ # "object2",
+ # ...
+ # ]
+
+ # We may have to ignore one or two lines on each side.
+ idx = 2 if lines[1].strip() == "[" else 1
+ keys_to_sort = [(i, _re_strip_line.search(line).groups()[0]) for i, line in enumerate(lines[idx:-idx])]
+ sorted_indices = sort_objects(keys_to_sort, key=lambda x: x[1])
+ sorted_lines = [lines[x[0] + idx] for x in sorted_indices]
+ return "\n".join(lines[:idx] + sorted_lines + lines[-idx:])
+ elif len(lines) == 3:
+ # Here we have to sort internal imports that are on one separate line:
+ # key: [
+ # "object1", "object2", ...
+ # ]
+ if _re_bracket_content.search(lines[1]) is not None:
+ lines[1] = _re_bracket_content.sub(_replace, lines[1])
+ else:
+ keys = [part.strip().replace('"', "") for part in lines[1].split(",")]
+ # We will have a final empty element if the line finished with a comma.
+ if len(keys[-1]) == 0:
+ keys = keys[:-1]
+ lines[1] = get_indent(lines[1]) + ", ".join([f'"{k}"' for k in sort_objects(keys)])
+ return "\n".join(lines)
+ else:
+ # Finally we have to deal with imports fitting on one line
+ import_statement = _re_bracket_content.sub(_replace, import_statement)
+ return import_statement
+
+
+def sort_imports(file: str, check_only: bool = True):
+ """
+ Sort the imports defined in the `_import_structure` of a given init.
+
+ Args:
+ file (`str`): The path to the init to check/fix.
+ check_only (`bool`, *optional*, defaults to `True`): Whether or not to just check (and not auto-fix) the init.
+ """
+ with open(file, encoding="utf-8") as f:
+ code = f.read()
+
+ # If the file is not a custom init, there is nothing to do.
+ if "_import_structure" not in code:
+ return
+
+ # Blocks of indent level 0
+ main_blocks = split_code_in_indented_blocks(
+ code, start_prompt="_import_structure = {", end_prompt="if TYPE_CHECKING:"
+ )
+
+ # We ignore block 0 (everything untils start_prompt) and the last block (everything after end_prompt).
+ for block_idx in range(1, len(main_blocks) - 1):
+ # Check if the block contains some `_import_structure`s thingy to sort.
+ block = main_blocks[block_idx]
+ block_lines = block.split("\n")
+
+ # Get to the start of the imports.
+ line_idx = 0
+ while line_idx < len(block_lines) and "_import_structure" not in block_lines[line_idx]:
+ # Skip dummy import blocks
+ if "import dummy" in block_lines[line_idx]:
+ line_idx = len(block_lines)
+ else:
+ line_idx += 1
+ if line_idx >= len(block_lines):
+ continue
+
+ # Ignore beginning and last line: they don't contain anything.
+ internal_block_code = "\n".join(block_lines[line_idx:-1])
+ indent = get_indent(block_lines[1])
+ # Slit the internal block into blocks of indent level 1.
+ internal_blocks = split_code_in_indented_blocks(internal_block_code, indent_level=indent)
+ # We have two categories of import key: list or _import_structure[key].append/extend
+ pattern = _re_direct_key if "_import_structure = {" in block_lines[0] else _re_indirect_key
+ # Grab the keys, but there is a trap: some lines are empty or just comments.
+ keys = [(pattern.search(b).groups()[0] if pattern.search(b) is not None else None) for b in internal_blocks]
+ # We only sort the lines with a key.
+ keys_to_sort = [(i, key) for i, key in enumerate(keys) if key is not None]
+ sorted_indices = [x[0] for x in sorted(keys_to_sort, key=lambda x: x[1])]
+
+ # We reorder the blocks by leaving empty lines/comments as they were and reorder the rest.
+ count = 0
+ reordered_blocks = []
+ for i in range(len(internal_blocks)):
+ if keys[i] is None:
+ reordered_blocks.append(internal_blocks[i])
+ else:
+ block = sort_objects_in_import(internal_blocks[sorted_indices[count]])
+ reordered_blocks.append(block)
+ count += 1
+
+ # And we put our main block back together with its first and last line.
+ main_blocks[block_idx] = "\n".join(block_lines[:line_idx] + reordered_blocks + [block_lines[-1]])
+
+ if code != "\n".join(main_blocks):
+ if check_only:
+ return True
+ else:
+ print(f"Overwriting {file}.")
+ with open(file, "w", encoding="utf-8") as f:
+ f.write("\n".join(main_blocks))
+
+
+def sort_imports_in_all_inits(check_only=True):
+ """
+ Sort the imports defined in the `_import_structure` of all inits in the repo.
+
+ Args:
+ check_only (`bool`, *optional*, defaults to `True`): Whether or not to just check (and not auto-fix) the init.
+ """
+ failures = []
+ for root, _, files in os.walk(PATH_TO_TRANSFORMERS):
+ if "__init__.py" in files:
+ result = sort_imports(os.path.join(root, "__init__.py"), check_only=check_only)
+ if result:
+ failures = [os.path.join(root, "__init__.py")]
+ if len(failures) > 0:
+ raise ValueError(f"Would overwrite {len(failures)} files, run `make style`.")
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--check_only", action="store_true", help="Whether to only check or fix style.")
+ args = parser.parse_args()
+
+ sort_imports_in_all_inits(check_only=args.check_only)
diff --git a/utils/fetch_torch_cuda_pipeline_test_matrix.py b/utils/fetch_torch_cuda_pipeline_test_matrix.py
new file mode 100644
index 0000000..744201c
--- /dev/null
+++ b/utils/fetch_torch_cuda_pipeline_test_matrix.py
@@ -0,0 +1,102 @@
+import json
+import logging
+import os
+from collections import defaultdict
+from pathlib import Path
+
+from huggingface_hub import HfApi, ModelFilter
+
+import diffusers
+
+
+PATH_TO_REPO = Path(__file__).parent.parent.resolve()
+ALWAYS_TEST_PIPELINE_MODULES = [
+ "controlnet",
+ "stable_diffusion",
+ "stable_diffusion_2",
+ "stable_diffusion_xl",
+ "stable_diffusion_adapter",
+ "deepfloyd_if",
+ "ip_adapters",
+ "kandinsky",
+ "kandinsky2_2",
+ "text_to_video_synthesis",
+ "wuerstchen",
+]
+PIPELINE_USAGE_CUTOFF = int(os.getenv("PIPELINE_USAGE_CUTOFF", 50000))
+
+logger = logging.getLogger(__name__)
+api = HfApi()
+filter = ModelFilter(library="diffusers")
+
+
+def filter_pipelines(usage_dict, usage_cutoff=10000):
+ output = []
+ for diffusers_object, usage in usage_dict.items():
+ if usage < usage_cutoff:
+ continue
+
+ is_diffusers_pipeline = hasattr(diffusers.pipelines, diffusers_object)
+ if not is_diffusers_pipeline:
+ continue
+
+ output.append(diffusers_object)
+
+ return output
+
+
+def fetch_pipeline_objects():
+ models = api.list_models(filter=filter)
+ downloads = defaultdict(int)
+
+ for model in models:
+ is_counted = False
+ for tag in model.tags:
+ if tag.startswith("diffusers:"):
+ is_counted = True
+ downloads[tag[len("diffusers:") :]] += model.downloads
+
+ if not is_counted:
+ downloads["other"] += model.downloads
+
+ # Remove 0 downloads
+ downloads = {k: v for k, v in downloads.items() if v > 0}
+ pipeline_objects = filter_pipelines(downloads, PIPELINE_USAGE_CUTOFF)
+
+ return pipeline_objects
+
+
+def fetch_pipeline_modules_to_test():
+ try:
+ pipeline_objects = fetch_pipeline_objects()
+ except Exception as e:
+ logger.error(e)
+ raise RuntimeError("Unable to fetch model list from HuggingFace Hub.")
+
+ test_modules = []
+ for pipeline_name in pipeline_objects:
+ module = getattr(diffusers, pipeline_name)
+
+ test_module = module.__module__.split(".")[-2].strip()
+ test_modules.append(test_module)
+
+ return test_modules
+
+
+def main():
+ test_modules = fetch_pipeline_modules_to_test()
+ test_modules.extend(ALWAYS_TEST_PIPELINE_MODULES)
+
+ # Get unique modules
+ test_modules = list(set(test_modules))
+ print(json.dumps(test_modules))
+
+ save_path = f"{PATH_TO_REPO}/reports"
+ os.makedirs(save_path, exist_ok=True)
+
+ with open(f"{save_path}/test-pipelines.json", "w") as f:
+ json.dump({"pipeline_test_modules": test_modules}, f)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/utils/get_modified_files.py b/utils/get_modified_files.py
new file mode 100644
index 0000000..a252bc6
--- /dev/null
+++ b/utils/get_modified_files.py
@@ -0,0 +1,34 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# this script reports modified .py files under the desired list of top-level sub-dirs passed as a list of arguments, e.g.:
+# python ./utils/get_modified_files.py utils src tests examples
+#
+# it uses git to find the forking point and which files were modified - i.e. files not under git won't be considered
+# since the output of this script is fed into Makefile commands it doesn't print a newline after the results
+
+import re
+import subprocess
+import sys
+
+
+fork_point_sha = subprocess.check_output("git merge-base main HEAD".split()).decode("utf-8")
+modified_files = subprocess.check_output(f"git diff --name-only {fork_point_sha}".split()).decode("utf-8").split()
+
+joined_dirs = "|".join(sys.argv[1:])
+regex = re.compile(rf"^({joined_dirs}).*?\.py$")
+
+relevant_modified_files = [x for x in modified_files if regex.match(x)]
+print(" ".join(relevant_modified_files), end="")
diff --git a/utils/overwrite_expected_slice.py b/utils/overwrite_expected_slice.py
new file mode 100644
index 0000000..57177a4
--- /dev/null
+++ b/utils/overwrite_expected_slice.py
@@ -0,0 +1,90 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from collections import defaultdict
+
+
+def overwrite_file(file, class_name, test_name, correct_line, done_test):
+ _id = f"{file}_{class_name}_{test_name}"
+ done_test[_id] += 1
+
+ with open(file, "r") as f:
+ lines = f.readlines()
+
+ class_regex = f"class {class_name}("
+ test_regex = f"{4 * ' '}def {test_name}("
+ line_begin_regex = f"{8 * ' '}{correct_line.split()[0]}"
+ another_line_begin_regex = f"{16 * ' '}{correct_line.split()[0]}"
+ in_class = False
+ in_func = False
+ in_line = False
+ insert_line = False
+ count = 0
+ spaces = 0
+
+ new_lines = []
+ for line in lines:
+ if line.startswith(class_regex):
+ in_class = True
+ elif in_class and line.startswith(test_regex):
+ in_func = True
+ elif in_class and in_func and (line.startswith(line_begin_regex) or line.startswith(another_line_begin_regex)):
+ spaces = len(line.split(correct_line.split()[0])[0])
+ count += 1
+
+ if count == done_test[_id]:
+ in_line = True
+
+ if in_class and in_func and in_line:
+ if ")" not in line:
+ continue
+ else:
+ insert_line = True
+
+ if in_class and in_func and in_line and insert_line:
+ new_lines.append(f"{spaces * ' '}{correct_line}")
+ in_class = in_func = in_line = insert_line = False
+ else:
+ new_lines.append(line)
+
+ with open(file, "w") as f:
+ for line in new_lines:
+ f.write(line)
+
+
+def main(correct, fail=None):
+ if fail is not None:
+ with open(fail, "r") as f:
+ test_failures = {l.strip() for l in f.readlines()}
+ else:
+ test_failures = None
+
+ with open(correct, "r") as f:
+ correct_lines = f.readlines()
+
+ done_tests = defaultdict(int)
+ for line in correct_lines:
+ file, class_name, test_name, correct_line = line.split(";")
+ if test_failures is None or "::".join([file, class_name, test_name]) in test_failures:
+ overwrite_file(file, class_name, test_name, correct_line, done_tests)
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--correct_filename", help="filename of tests with expected result")
+ parser.add_argument("--fail_filename", help="filename of test failures", type=str, default=None)
+ args = parser.parse_args()
+
+ main(args.correct_filename, args.fail_filename)
diff --git a/utils/print_env.py b/utils/print_env.py
new file mode 100644
index 0000000..3e4495c
--- /dev/null
+++ b/utils/print_env.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# this script dumps information about the environment
+
+import os
+import platform
+import sys
+
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
+
+print("Python version:", sys.version)
+
+print("OS platform:", platform.platform())
+print("OS architecture:", platform.machine())
+
+try:
+ import torch
+
+ print("Torch version:", torch.__version__)
+ print("Cuda available:", torch.cuda.is_available())
+ print("Cuda version:", torch.version.cuda)
+ print("CuDNN version:", torch.backends.cudnn.version())
+ print("Number of GPUs available:", torch.cuda.device_count())
+except ImportError:
+ print("Torch version:", None)
+
+try:
+ import transformers
+
+ print("transformers version:", transformers.__version__)
+except ImportError:
+ print("transformers version:", None)
diff --git a/utils/release.py b/utils/release.py
new file mode 100644
index 0000000..a0800b9
--- /dev/null
+++ b/utils/release.py
@@ -0,0 +1,162 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import re
+
+import packaging.version
+
+
+PATH_TO_EXAMPLES = "examples/"
+REPLACE_PATTERNS = {
+ "examples": (re.compile(r'^check_min_version\("[^"]+"\)\s*$', re.MULTILINE), 'check_min_version("VERSION")\n'),
+ "init": (re.compile(r'^__version__\s+=\s+"([^"]+)"\s*$', re.MULTILINE), '__version__ = "VERSION"\n'),
+ "setup": (re.compile(r'^(\s*)version\s*=\s*"[^"]+",', re.MULTILINE), r'\1version="VERSION",'),
+ "doc": (re.compile(r'^(\s*)release\s*=\s*"[^"]+"$', re.MULTILINE), 'release = "VERSION"\n'),
+}
+REPLACE_FILES = {
+ "init": "src/diffusers/__init__.py",
+ "setup": "setup.py",
+}
+README_FILE = "README.md"
+
+
+def update_version_in_file(fname, version, pattern):
+ """Update the version in one file using a specific pattern."""
+ with open(fname, "r", encoding="utf-8", newline="\n") as f:
+ code = f.read()
+ re_pattern, replace = REPLACE_PATTERNS[pattern]
+ replace = replace.replace("VERSION", version)
+ code = re_pattern.sub(replace, code)
+ with open(fname, "w", encoding="utf-8", newline="\n") as f:
+ f.write(code)
+
+
+def update_version_in_examples(version):
+ """Update the version in all examples files."""
+ for folder, directories, fnames in os.walk(PATH_TO_EXAMPLES):
+ # Removing some of the folders with non-actively maintained examples from the walk
+ if "research_projects" in directories:
+ directories.remove("research_projects")
+ if "legacy" in directories:
+ directories.remove("legacy")
+ for fname in fnames:
+ if fname.endswith(".py"):
+ update_version_in_file(os.path.join(folder, fname), version, pattern="examples")
+
+
+def global_version_update(version, patch=False):
+ """Update the version in all needed files."""
+ for pattern, fname in REPLACE_FILES.items():
+ update_version_in_file(fname, version, pattern)
+ if not patch:
+ update_version_in_examples(version)
+
+
+def clean_main_ref_in_model_list():
+ """Replace the links from main doc tp stable doc in the model list of the README."""
+ # If the introduction or the conclusion of the list change, the prompts may need to be updated.
+ _start_prompt = "๐ค Transformers currently provides the following architectures"
+ _end_prompt = "1. Want to contribute a new model?"
+ with open(README_FILE, "r", encoding="utf-8", newline="\n") as f:
+ lines = f.readlines()
+
+ # Find the start of the list.
+ start_index = 0
+ while not lines[start_index].startswith(_start_prompt):
+ start_index += 1
+ start_index += 1
+
+ index = start_index
+ # Update the lines in the model list.
+ while not lines[index].startswith(_end_prompt):
+ if lines[index].startswith("1."):
+ lines[index] = lines[index].replace(
+ "https://huggingface.co/docs/diffusers/main/model_doc",
+ "https://huggingface.co/docs/diffusers/model_doc",
+ )
+ index += 1
+
+ with open(README_FILE, "w", encoding="utf-8", newline="\n") as f:
+ f.writelines(lines)
+
+
+def get_version():
+ """Reads the current version in the __init__."""
+ with open(REPLACE_FILES["init"], "r") as f:
+ code = f.read()
+ default_version = REPLACE_PATTERNS["init"][0].search(code).groups()[0]
+ return packaging.version.parse(default_version)
+
+
+def pre_release_work(patch=False):
+ """Do all the necessary pre-release steps."""
+ # First let's get the default version: base version if we are in dev, bump minor otherwise.
+ default_version = get_version()
+ if patch and default_version.is_devrelease:
+ raise ValueError("Can't create a patch version from the dev branch, checkout a released version!")
+ if default_version.is_devrelease:
+ default_version = default_version.base_version
+ elif patch:
+ default_version = f"{default_version.major}.{default_version.minor}.{default_version.micro + 1}"
+ else:
+ default_version = f"{default_version.major}.{default_version.minor + 1}.0"
+
+ # Now let's ask nicely if that's the right one.
+ version = input(f"Which version are you releasing? [{default_version}]")
+ if len(version) == 0:
+ version = default_version
+
+ print(f"Updating version to {version}.")
+ global_version_update(version, patch=patch)
+
+
+# if not patch:
+# print("Cleaning main README, don't forget to run `make fix-copies`.")
+# clean_main_ref_in_model_list()
+
+
+def post_release_work():
+ """Do all the necessary post-release steps."""
+ # First let's get the current version
+ current_version = get_version()
+ dev_version = f"{current_version.major}.{current_version.minor + 1}.0.dev0"
+ current_version = current_version.base_version
+
+ # Check with the user we got that right.
+ version = input(f"Which version are we developing now? [{dev_version}]")
+ if len(version) == 0:
+ version = dev_version
+
+ print(f"Updating version to {version}.")
+ global_version_update(version)
+
+
+# print("Cleaning main README, don't forget to run `make fix-copies`.")
+# clean_main_ref_in_model_list()
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--post_release", action="store_true", help="Whether this is pre or post release.")
+ parser.add_argument("--patch", action="store_true", help="Whether or not this is a patch release.")
+ args = parser.parse_args()
+ if not args.post_release:
+ pre_release_work(patch=args.patch)
+ elif args.patch:
+ print("Nothing to do after a patch :-)")
+ else:
+ post_release_work()
diff --git a/utils/stale.py b/utils/stale.py
new file mode 100644
index 0000000..3c0f8ab
--- /dev/null
+++ b/utils/stale.py
@@ -0,0 +1,67 @@
+# Copyright 2024 The HuggingFace Team, the AllenNLP library authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Script to close stale issue. Taken in part from the AllenNLP repository.
+https://github.com/allenai/allennlp.
+"""
+import os
+from datetime import datetime as dt
+from datetime import timezone
+
+from github import Github
+
+
+LABELS_TO_EXEMPT = [
+ "good first issue",
+ "good second issue",
+ "good difficult issue",
+ "enhancement",
+ "new pipeline/model",
+ "new scheduler",
+ "wip",
+]
+
+
+def main():
+ g = Github(os.environ["GITHUB_TOKEN"])
+ repo = g.get_repo("huggingface/diffusers")
+ open_issues = repo.get_issues(state="open")
+
+ for issue in open_issues:
+ labels = [label.name.lower() for label in issue.get_labels()]
+ if "stale" in labels:
+ comments = sorted(issue.get_comments(), key=lambda i: i.created_at, reverse=True)
+ last_comment = comments[0] if len(comments) > 0 else None
+ if last_comment is not None and last_comment.user.login != "github-actions[bot]":
+ # Opens the issue if someone other than Stalebot commented.
+ issue.edit(state="open")
+ issue.remove_from_labels("stale")
+ elif (
+ (dt.now(timezone.utc) - issue.updated_at).days > 23
+ and (dt.now(timezone.utc) - issue.created_at).days >= 30
+ and not any(label in LABELS_TO_EXEMPT for label in labels)
+ ):
+ # Post a Stalebot notification after 23 days of inactivity.
+ issue.create_comment(
+ "This issue has been automatically marked as stale because it has not had "
+ "recent activity. If you think this still needs to be addressed "
+ "please comment on this thread.\n\nPlease note that issues that do not follow the "
+ "[contributing guidelines](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md) "
+ "are likely to be ignored."
+ )
+ issue.add_to_labels("stale")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/utils/tests_fetcher.py b/utils/tests_fetcher.py
new file mode 100644
index 0000000..dfa8f90
--- /dev/null
+++ b/utils/tests_fetcher.py
@@ -0,0 +1,1128 @@
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Welcome to tests_fetcher V2.
+
+This util is designed to fetch tests to run on a PR so that only the tests impacted by the modifications are run, and
+when too many models are being impacted, only run the tests of a subset of core models. It works like this.
+
+Stage 1: Identify the modified files. For jobs that run on the main branch, it's just the diff with the last commit.
+On a PR, this takes all the files from the branching point to the current commit (so all modifications in a PR, not
+just the last commit) but excludes modifications that are on docstrings or comments only.
+
+Stage 2: Extract the tests to run. This is done by looking at the imports in each module and test file: if module A
+imports module B, then changing module B impacts module A, so the tests using module A should be run. We thus get the
+dependencies of each model and then recursively builds the 'reverse' map of dependencies to get all modules and tests
+impacted by a given file. We then only keep the tests (and only the core models tests if there are too many modules).
+
+Caveats:
+ - This module only filters tests by files (not individual tests) so it's better to have tests for different things
+ in different files.
+ - This module assumes inits are just importing things, not really building objects, so it's better to structure
+ them this way and move objects building in separate submodules.
+
+Usage:
+
+Base use to fetch the tests in a pull request
+
+```bash
+python utils/tests_fetcher.py
+```
+
+Base use to fetch the tests on a the main branch (with diff from the last commit):
+
+```bash
+python utils/tests_fetcher.py --diff_with_last_commit
+```
+"""
+
+import argparse
+import collections
+import json
+import os
+import re
+from contextlib import contextmanager
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple, Union
+
+from git import Repo
+
+
+PATH_TO_REPO = Path(__file__).parent.parent.resolve()
+PATH_TO_EXAMPLES = PATH_TO_REPO / "examples"
+PATH_TO_DIFFUSERS = PATH_TO_REPO / "src/diffusers"
+PATH_TO_TESTS = PATH_TO_REPO / "tests"
+
+# Ignore fixtures in tests folder
+# Ignore lora since they are always tested
+MODULES_TO_IGNORE = ["fixtures", "lora"]
+
+IMPORTANT_PIPELINES = [
+ "controlnet",
+ "stable_diffusion",
+ "stable_diffusion_2",
+ "stable_diffusion_xl",
+ "stable_video_diffusion",
+ "deepfloyd_if",
+ "kandinsky",
+ "kandinsky2_2",
+ "text_to_video_synthesis",
+ "wuerstchen",
+]
+
+
+@contextmanager
+def checkout_commit(repo: Repo, commit_id: str):
+ """
+ Context manager that checks out a given commit when entered, but gets back to the reference it was at on exit.
+
+ Args:
+ repo (`git.Repo`): A git repository (for instance the Transformers repo).
+ commit_id (`str`): The commit reference to checkout inside the context manager.
+ """
+ current_head = repo.head.commit if repo.head.is_detached else repo.head.ref
+
+ try:
+ repo.git.checkout(commit_id)
+ yield
+
+ finally:
+ repo.git.checkout(current_head)
+
+
+def clean_code(content: str) -> str:
+ """
+ Remove docstrings, empty line or comments from some code (used to detect if a diff is real or only concern
+ comments or docstings).
+
+ Args:
+ content (`str`): The code to clean
+
+ Returns:
+ `str`: The cleaned code.
+ """
+ # We need to deactivate autoformatting here to write escaped triple quotes (we cannot use real triple quotes or
+ # this would mess up the result if this function applied to this particular file).
+ # fmt: off
+ # Remove docstrings by splitting on triple " then triple ':
+ splits = content.split('\"\"\"')
+ content = "".join(splits[::2])
+ splits = content.split("\'\'\'")
+ # fmt: on
+ content = "".join(splits[::2])
+
+ # Remove empty lines and comments
+ lines_to_keep = []
+ for line in content.split("\n"):
+ # remove anything that is after a # sign.
+ line = re.sub("#.*$", "", line)
+ # remove white lines
+ if len(line) != 0 and not line.isspace():
+ lines_to_keep.append(line)
+ return "\n".join(lines_to_keep)
+
+
+def keep_doc_examples_only(content: str) -> str:
+ """
+ Remove everything from the code content except the doc examples (used to determined if a diff should trigger doc
+ tests or not).
+
+ Args:
+ content (`str`): The code to clean
+
+ Returns:
+ `str`: The cleaned code.
+ """
+ # Keep doc examples only by splitting on triple "`"
+ splits = content.split("```")
+ # Add leading and trailing "```" so the navigation is easier when compared to the original input `content`
+ content = "```" + "```".join(splits[1::2]) + "```"
+
+ # Remove empty lines and comments
+ lines_to_keep = []
+ for line in content.split("\n"):
+ # remove anything that is after a # sign.
+ line = re.sub("#.*$", "", line)
+ # remove white lines
+ if len(line) != 0 and not line.isspace():
+ lines_to_keep.append(line)
+ return "\n".join(lines_to_keep)
+
+
+def get_all_tests() -> List[str]:
+ """
+ Walks the `tests` folder to return a list of files/subfolders. This is used to split the tests to run when using
+ paralellism. The split is:
+
+ - folders under `tests`: (`tokenization`, `pipelines`, etc) except the subfolder `models` is excluded.
+ - folders under `tests/models`: `bert`, `gpt2`, etc.
+ - test files under `tests`: `test_modeling_common.py`, `test_tokenization_common.py`, etc.
+ """
+
+ # test folders/files directly under `tests` folder
+ tests = os.listdir(PATH_TO_TESTS)
+ tests = [f"tests/{f}" for f in tests if "__pycache__" not in f]
+ tests = sorted([f for f in tests if (PATH_TO_REPO / f).is_dir() or f.startswith("tests/test_")])
+
+ return tests
+
+
+def diff_is_docstring_only(repo: Repo, branching_point: str, filename: str) -> bool:
+ """
+ Check if the diff is only in docstrings (or comments and whitespace) in a filename.
+
+ Args:
+ repo (`git.Repo`): A git repository (for instance the Transformers repo).
+ branching_point (`str`): The commit reference of where to compare for the diff.
+ filename (`str`): The filename where we want to know if the diff isonly in docstrings/comments.
+
+ Returns:
+ `bool`: Whether the diff is docstring/comments only or not.
+ """
+ folder = Path(repo.working_dir)
+ with checkout_commit(repo, branching_point):
+ with open(folder / filename, "r", encoding="utf-8") as f:
+ old_content = f.read()
+
+ with open(folder / filename, "r", encoding="utf-8") as f:
+ new_content = f.read()
+
+ old_content_clean = clean_code(old_content)
+ new_content_clean = clean_code(new_content)
+
+ return old_content_clean == new_content_clean
+
+
+def diff_contains_doc_examples(repo: Repo, branching_point: str, filename: str) -> bool:
+ """
+ Check if the diff is only in code examples of the doc in a filename.
+
+ Args:
+ repo (`git.Repo`): A git repository (for instance the Transformers repo).
+ branching_point (`str`): The commit reference of where to compare for the diff.
+ filename (`str`): The filename where we want to know if the diff is only in codes examples.
+
+ Returns:
+ `bool`: Whether the diff is only in code examples of the doc or not.
+ """
+ folder = Path(repo.working_dir)
+ with checkout_commit(repo, branching_point):
+ with open(folder / filename, "r", encoding="utf-8") as f:
+ old_content = f.read()
+
+ with open(folder / filename, "r", encoding="utf-8") as f:
+ new_content = f.read()
+
+ old_content_clean = keep_doc_examples_only(old_content)
+ new_content_clean = keep_doc_examples_only(new_content)
+
+ return old_content_clean != new_content_clean
+
+
+def get_diff(repo: Repo, base_commit: str, commits: List[str]) -> List[str]:
+ """
+ Get the diff between a base commit and one or several commits.
+
+ Args:
+ repo (`git.Repo`):
+ A git repository (for instance the Transformers repo).
+ base_commit (`str`):
+ The commit reference of where to compare for the diff. This is the current commit, not the branching point!
+ commits (`List[str]`):
+ The list of commits with which to compare the repo at `base_commit` (so the branching point).
+
+ Returns:
+ `List[str]`: The list of Python files with a diff (files added, renamed or deleted are always returned, files
+ modified are returned if the diff in the file is not only in docstrings or comments, see
+ `diff_is_docstring_only`).
+ """
+ print("\n### DIFF ###\n")
+ code_diff = []
+ for commit in commits:
+ for diff_obj in commit.diff(base_commit):
+ # We always add new python files
+ if diff_obj.change_type == "A" and diff_obj.b_path.endswith(".py"):
+ code_diff.append(diff_obj.b_path)
+ # We check that deleted python files won't break corresponding tests.
+ elif diff_obj.change_type == "D" and diff_obj.a_path.endswith(".py"):
+ code_diff.append(diff_obj.a_path)
+ # Now for modified files
+ elif diff_obj.change_type in ["M", "R"] and diff_obj.b_path.endswith(".py"):
+ # In case of renames, we'll look at the tests using both the old and new name.
+ if diff_obj.a_path != diff_obj.b_path:
+ code_diff.extend([diff_obj.a_path, diff_obj.b_path])
+ else:
+ # Otherwise, we check modifications are in code and not docstrings.
+ if diff_is_docstring_only(repo, commit, diff_obj.b_path):
+ print(f"Ignoring diff in {diff_obj.b_path} as it only concerns docstrings or comments.")
+ else:
+ code_diff.append(diff_obj.a_path)
+
+ return code_diff
+
+
+def get_modified_python_files(diff_with_last_commit: bool = False) -> List[str]:
+ """
+ Return a list of python files that have been modified between:
+
+ - the current head and the main branch if `diff_with_last_commit=False` (default)
+ - the current head and its parent commit otherwise.
+
+ Returns:
+ `List[str]`: The list of Python files with a diff (files added, renamed or deleted are always returned, files
+ modified are returned if the diff in the file is not only in docstrings or comments, see
+ `diff_is_docstring_only`).
+ """
+ repo = Repo(PATH_TO_REPO)
+
+ if not diff_with_last_commit:
+ # Need to fetch refs for main using remotes when running with github actions.
+ upstream_main = repo.remotes.origin.refs.main
+
+ print(f"main is at {upstream_main.commit}")
+ print(f"Current head is at {repo.head.commit}")
+
+ branching_commits = repo.merge_base(upstream_main, repo.head)
+ for commit in branching_commits:
+ print(f"Branching commit: {commit}")
+ return get_diff(repo, repo.head.commit, branching_commits)
+ else:
+ print(f"main is at {repo.head.commit}")
+ parent_commits = repo.head.commit.parents
+ for commit in parent_commits:
+ print(f"Parent commit: {commit}")
+ return get_diff(repo, repo.head.commit, parent_commits)
+
+
+def get_diff_for_doctesting(repo: Repo, base_commit: str, commits: List[str]) -> List[str]:
+ """
+ Get the diff in doc examples between a base commit and one or several commits.
+
+ Args:
+ repo (`git.Repo`):
+ A git repository (for instance the Transformers repo).
+ base_commit (`str`):
+ The commit reference of where to compare for the diff. This is the current commit, not the branching point!
+ commits (`List[str]`):
+ The list of commits with which to compare the repo at `base_commit` (so the branching point).
+
+ Returns:
+ `List[str]`: The list of Python and Markdown files with a diff (files added or renamed are always returned, files
+ modified are returned if the diff in the file is only in doctest examples).
+ """
+ print("\n### DIFF ###\n")
+ code_diff = []
+ for commit in commits:
+ for diff_obj in commit.diff(base_commit):
+ # We only consider Python files and doc files.
+ if not diff_obj.b_path.endswith(".py") and not diff_obj.b_path.endswith(".md"):
+ continue
+ # We always add new python/md files
+ if diff_obj.change_type in ["A"]:
+ code_diff.append(diff_obj.b_path)
+ # Now for modified files
+ elif diff_obj.change_type in ["M", "R"]:
+ # In case of renames, we'll look at the tests using both the old and new name.
+ if diff_obj.a_path != diff_obj.b_path:
+ code_diff.extend([diff_obj.a_path, diff_obj.b_path])
+ else:
+ # Otherwise, we check modifications contain some doc example(s).
+ if diff_contains_doc_examples(repo, commit, diff_obj.b_path):
+ code_diff.append(diff_obj.a_path)
+ else:
+ print(f"Ignoring diff in {diff_obj.b_path} as it doesn't contain any doc example.")
+
+ return code_diff
+
+
+def get_all_doctest_files() -> List[str]:
+ """
+ Return the complete list of python and Markdown files on which we run doctest.
+
+ At this moment, we restrict this to only take files from `src/` or `docs/source/en/` that are not in `utils/not_doctested.txt`.
+
+ Returns:
+ `List[str]`: The complete list of Python and Markdown files on which we run doctest.
+ """
+ py_files = [str(x.relative_to(PATH_TO_REPO)) for x in PATH_TO_REPO.glob("**/*.py")]
+ md_files = [str(x.relative_to(PATH_TO_REPO)) for x in PATH_TO_REPO.glob("**/*.md")]
+ test_files_to_run = py_files + md_files
+
+ # only include files in `src` or `docs/source/en/`
+ test_files_to_run = [x for x in test_files_to_run if x.startswith(("src/", "docs/source/en/"))]
+ # not include init files
+ test_files_to_run = [x for x in test_files_to_run if not x.endswith(("__init__.py",))]
+
+ # These are files not doctested yet.
+ with open("utils/not_doctested.txt") as fp:
+ not_doctested = {x.split(" ")[0] for x in fp.read().strip().split("\n")}
+
+ # So far we don't have 100% coverage for doctest. This line will be removed once we achieve 100%.
+ test_files_to_run = [x for x in test_files_to_run if x not in not_doctested]
+
+ return sorted(test_files_to_run)
+
+
+def get_new_doctest_files(repo, base_commit, branching_commit) -> List[str]:
+ """
+ Get the list of files that were removed from "utils/not_doctested.txt", between `base_commit` and
+ `branching_commit`.
+
+ Returns:
+ `List[str]`: List of files that were removed from "utils/not_doctested.txt".
+ """
+ for diff_obj in branching_commit.diff(base_commit):
+ # Ignores all but the "utils/not_doctested.txt" file.
+ if diff_obj.a_path != "utils/not_doctested.txt":
+ continue
+ # Loads the two versions
+ folder = Path(repo.working_dir)
+ with checkout_commit(repo, branching_commit):
+ with open(folder / "utils/not_doctested.txt", "r", encoding="utf-8") as f:
+ old_content = f.read()
+ with open(folder / "utils/not_doctested.txt", "r", encoding="utf-8") as f:
+ new_content = f.read()
+ # Compute the removed lines and return them
+ removed_content = {x.split(" ")[0] for x in old_content.split("\n")} - {
+ x.split(" ")[0] for x in new_content.split("\n")
+ }
+ return sorted(removed_content)
+ return []
+
+
+def get_doctest_files(diff_with_last_commit: bool = False) -> List[str]:
+ """
+ Return a list of python and Markdown files where doc example have been modified between:
+
+ - the current head and the main branch if `diff_with_last_commit=False` (default)
+ - the current head and its parent commit otherwise.
+
+ Returns:
+ `List[str]`: The list of Python and Markdown files with a diff (files added or renamed are always returned, files
+ modified are returned if the diff in the file is only in doctest examples).
+ """
+ repo = Repo(PATH_TO_REPO)
+
+ test_files_to_run = [] # noqa
+ if not diff_with_last_commit:
+ upstream_main = repo.remotes.origin.refs.main
+ print(f"main is at {upstream_main.commit}")
+ print(f"Current head is at {repo.head.commit}")
+
+ branching_commits = repo.merge_base(upstream_main, repo.head)
+ for commit in branching_commits:
+ print(f"Branching commit: {commit}")
+ test_files_to_run = get_diff_for_doctesting(repo, repo.head.commit, branching_commits)
+ else:
+ print(f"main is at {repo.head.commit}")
+ parent_commits = repo.head.commit.parents
+ for commit in parent_commits:
+ print(f"Parent commit: {commit}")
+ test_files_to_run = get_diff_for_doctesting(repo, repo.head.commit, parent_commits)
+
+ all_test_files_to_run = get_all_doctest_files()
+
+ # Add to the test files to run any removed entry from "utils/not_doctested.txt".
+ new_test_files = get_new_doctest_files(repo, repo.head.commit, upstream_main.commit)
+ test_files_to_run = list(set(test_files_to_run + new_test_files))
+
+ # Do not run slow doctest tests on CircleCI
+ with open("utils/slow_documentation_tests.txt") as fp:
+ slow_documentation_tests = set(fp.read().strip().split("\n"))
+ test_files_to_run = [
+ x for x in test_files_to_run if x in all_test_files_to_run and x not in slow_documentation_tests
+ ]
+
+ # Make sure we did not end up with a test file that was removed
+ test_files_to_run = [f for f in test_files_to_run if (PATH_TO_REPO / f).exists()]
+
+ return sorted(test_files_to_run)
+
+
+# (:?^|\n) -> Non-catching group for the beginning of the doc or a new line.
+# \s*from\s+(\.+\S+)\s+import\s+([^\n]+) -> Line only contains from .xxx import yyy and we catch .xxx and yyy
+# (?=\n) -> Look-ahead to a new line. We can't just put \n here or using find_all on this re will only catch every
+# other import.
+_re_single_line_relative_imports = re.compile(r"(?:^|\n)\s*from\s+(\.+\S+)\s+import\s+([^\n]+)(?=\n)")
+# (:?^|\n) -> Non-catching group for the beginning of the doc or a new line.
+# \s*from\s+(\.+\S+)\s+import\s+\(([^\)]+)\) -> Line continues with from .xxx import (yyy) and we catch .xxx and yyy
+# yyy will take multiple lines otherwise there wouldn't be parenthesis.
+_re_multi_line_relative_imports = re.compile(r"(?:^|\n)\s*from\s+(\.+\S+)\s+import\s+\(([^\)]+)\)")
+# (:?^|\n) -> Non-catching group for the beginning of the doc or a new line.
+# \s*from\s+transformers(\S*)\s+import\s+([^\n]+) -> Line only contains from transformers.xxx import yyy and we catch
+# .xxx and yyy
+# (?=\n) -> Look-ahead to a new line. We can't just put \n here or using find_all on this re will only catch every
+# other import.
+_re_single_line_direct_imports = re.compile(r"(?:^|\n)\s*from\s+diffusers(\S*)\s+import\s+([^\n]+)(?=\n)")
+# (:?^|\n) -> Non-catching group for the beginning of the doc or a new line.
+# \s*from\s+transformers(\S*)\s+import\s+\(([^\)]+)\) -> Line continues with from transformers.xxx import (yyy) and we
+# catch .xxx and yyy. yyy will take multiple lines otherwise there wouldn't be parenthesis.
+_re_multi_line_direct_imports = re.compile(r"(?:^|\n)\s*from\s+diffusers(\S*)\s+import\s+\(([^\)]+)\)")
+
+
+def extract_imports(module_fname: str, cache: Dict[str, List[str]] = None) -> List[str]:
+ """
+ Get the imports a given module makes.
+
+ Args:
+ module_fname (`str`):
+ The name of the file of the module where we want to look at the imports (given relative to the root of
+ the repo).
+ cache (Dictionary `str` to `List[str]`, *optional*):
+ To speed up this function if it was previously called on `module_fname`, the cache of all previously
+ computed results.
+
+ Returns:
+ `List[str]`: The list of module filenames imported in the input `module_fname` (a submodule we import from that
+ is a subfolder will give its init file).
+ """
+ if cache is not None and module_fname in cache:
+ return cache[module_fname]
+
+ with open(PATH_TO_REPO / module_fname, "r", encoding="utf-8") as f:
+ content = f.read()
+
+ # Filter out all docstrings to not get imports in code examples. As before we need to deactivate formatting to
+ # keep this as escaped quotes and avoid this function failing on this file.
+ # fmt: off
+ splits = content.split('\"\"\"')
+ # fmt: on
+ content = "".join(splits[::2])
+
+ module_parts = str(module_fname).split(os.path.sep)
+ imported_modules = []
+
+ # Let's start with relative imports
+ relative_imports = _re_single_line_relative_imports.findall(content)
+ relative_imports = [
+ (mod, imp) for mod, imp in relative_imports if "# tests_ignore" not in imp and imp.strip() != "("
+ ]
+ multiline_relative_imports = _re_multi_line_relative_imports.findall(content)
+ relative_imports += [(mod, imp) for mod, imp in multiline_relative_imports if "# tests_ignore" not in imp]
+
+ # We need to remove parts of the module name depending on the depth of the relative imports.
+ for module, imports in relative_imports:
+ level = 0
+ while module.startswith("."):
+ module = module[1:]
+ level += 1
+
+ if len(module) > 0:
+ dep_parts = module_parts[: len(module_parts) - level] + module.split(".")
+ else:
+ dep_parts = module_parts[: len(module_parts) - level]
+ imported_module = os.path.sep.join(dep_parts)
+ imported_modules.append((imported_module, [imp.strip() for imp in imports.split(",")]))
+
+ # Let's continue with direct imports
+ direct_imports = _re_single_line_direct_imports.findall(content)
+ direct_imports = [(mod, imp) for mod, imp in direct_imports if "# tests_ignore" not in imp and imp.strip() != "("]
+ multiline_direct_imports = _re_multi_line_direct_imports.findall(content)
+ direct_imports += [(mod, imp) for mod, imp in multiline_direct_imports if "# tests_ignore" not in imp]
+
+ # We need to find the relative path of those imports.
+ for module, imports in direct_imports:
+ import_parts = module.split(".")[1:] # ignore the name of the repo since we add it below.
+ dep_parts = ["src", "diffusers"] + import_parts
+ imported_module = os.path.sep.join(dep_parts)
+ imported_modules.append((imported_module, [imp.strip() for imp in imports.split(",")]))
+
+ result = []
+ # Double check we get proper modules (either a python file or a folder with an init).
+ for module_file, imports in imported_modules:
+ if (PATH_TO_REPO / f"{module_file}.py").is_file():
+ module_file = f"{module_file}.py"
+ elif (PATH_TO_REPO / module_file).is_dir() and (PATH_TO_REPO / module_file / "__init__.py").is_file():
+ module_file = os.path.sep.join([module_file, "__init__.py"])
+ imports = [imp for imp in imports if len(imp) > 0 and re.match("^[A-Za-z0-9_]*$", imp)]
+ if len(imports) > 0:
+ result.append((module_file, imports))
+
+ if cache is not None:
+ cache[module_fname] = result
+
+ return result
+
+
+def get_module_dependencies(module_fname: str, cache: Dict[str, List[str]] = None) -> List[str]:
+ """
+ Refines the result of `extract_imports` to remove subfolders and get a proper list of module filenames: if a file
+ as an import `from utils import Foo, Bar`, with `utils` being a subfolder containing many files, this will traverse
+ the `utils` init file to check where those dependencies come from: for instance the files utils/foo.py and utils/bar.py.
+
+ Warning: This presupposes that all intermediate inits are properly built (with imports from the respective
+ submodules) and work better if objects are defined in submodules and not the intermediate init (otherwise the
+ intermediate init is added, and inits usually have a lot of dependencies).
+
+ Args:
+ module_fname (`str`):
+ The name of the file of the module where we want to look at the imports (given relative to the root of
+ the repo).
+ cache (Dictionary `str` to `List[str]`, *optional*):
+ To speed up this function if it was previously called on `module_fname`, the cache of all previously
+ computed results.
+
+ Returns:
+ `List[str]`: The list of module filenames imported in the input `module_fname` (with submodule imports refined).
+ """
+ dependencies = []
+ imported_modules = extract_imports(module_fname, cache=cache)
+ # The while loop is to recursively traverse all inits we may encounter: we will add things as we go.
+ while len(imported_modules) > 0:
+ new_modules = []
+ for module, imports in imported_modules:
+ # If we end up in an __init__ we are often not actually importing from this init (except in the case where
+ # the object is fully defined in the __init__)
+ if module.endswith("__init__.py"):
+ # So we get the imports from that init then try to find where our objects come from.
+ new_imported_modules = extract_imports(module, cache=cache)
+ for new_module, new_imports in new_imported_modules:
+ if any(i in new_imports for i in imports):
+ if new_module not in dependencies:
+ new_modules.append((new_module, [i for i in new_imports if i in imports]))
+ imports = [i for i in imports if i not in new_imports]
+ if len(imports) > 0:
+ # If there are any objects lefts, they may be a submodule
+ path_to_module = PATH_TO_REPO / module.replace("__init__.py", "")
+ dependencies.extend(
+ [
+ os.path.join(module.replace("__init__.py", ""), f"{i}.py")
+ for i in imports
+ if (path_to_module / f"{i}.py").is_file()
+ ]
+ )
+ imports = [i for i in imports if not (path_to_module / f"{i}.py").is_file()]
+ if len(imports) > 0:
+ # Then if there are still objects left, they are fully defined in the init, so we keep it as a
+ # dependency.
+ dependencies.append(module)
+ else:
+ dependencies.append(module)
+
+ imported_modules = new_modules
+
+ return dependencies
+
+
+def create_reverse_dependency_tree() -> List[Tuple[str, str]]:
+ """
+ Create a list of all edges (a, b) which mean that modifying a impacts b with a going over all module and test files.
+ """
+ cache = {}
+ all_modules = list(PATH_TO_DIFFUSERS.glob("**/*.py")) + list(PATH_TO_TESTS.glob("**/*.py"))
+ all_modules = [str(mod.relative_to(PATH_TO_REPO)) for mod in all_modules]
+ edges = [(dep, mod) for mod in all_modules for dep in get_module_dependencies(mod, cache=cache)]
+
+ return list(set(edges))
+
+
+def get_tree_starting_at(module: str, edges: List[Tuple[str, str]]) -> List[Union[str, List[str]]]:
+ """
+ Returns the tree starting at a given module following all edges.
+
+ Args:
+ module (`str`): The module that will be the root of the subtree we want.
+ eges (`List[Tuple[str, str]]`): The list of all edges of the tree.
+
+ Returns:
+ `List[Union[str, List[str]]]`: The tree to print in the following format: [module, [list of edges
+ starting at module], [list of edges starting at the preceding level], ...]
+ """
+ vertices_seen = [module]
+ new_edges = [edge for edge in edges if edge[0] == module and edge[1] != module and "__init__.py" not in edge[1]]
+ tree = [module]
+ while len(new_edges) > 0:
+ tree.append(new_edges)
+ final_vertices = list({edge[1] for edge in new_edges})
+ vertices_seen.extend(final_vertices)
+ new_edges = [
+ edge
+ for edge in edges
+ if edge[0] in final_vertices and edge[1] not in vertices_seen and "__init__.py" not in edge[1]
+ ]
+
+ return tree
+
+
+def print_tree_deps_of(module, all_edges=None):
+ """
+ Prints the tree of modules depending on a given module.
+
+ Args:
+ module (`str`): The module that will be the root of the subtree we want.
+ all_eges (`List[Tuple[str, str]]`, *optional*):
+ The list of all edges of the tree. Will be set to `create_reverse_dependency_tree()` if not passed.
+ """
+ if all_edges is None:
+ all_edges = create_reverse_dependency_tree()
+ tree = get_tree_starting_at(module, all_edges)
+
+ # The list of lines is a list of tuples (line_to_be_printed, module)
+ # Keeping the modules lets us know where to insert each new lines in the list.
+ lines = [(tree[0], tree[0])]
+ for index in range(1, len(tree)):
+ edges = tree[index]
+ start_edges = {edge[0] for edge in edges}
+
+ for start in start_edges:
+ end_edges = {edge[1] for edge in edges if edge[0] == start}
+ # We will insert all those edges just after the line showing start.
+ pos = 0
+ while lines[pos][1] != start:
+ pos += 1
+ lines = lines[: pos + 1] + [(" " * (2 * index) + end, end) for end in end_edges] + lines[pos + 1 :]
+
+ for line in lines:
+ # We don't print the refs that where just here to help build lines.
+ print(line[0])
+
+
+def init_test_examples_dependencies() -> Tuple[Dict[str, List[str]], List[str]]:
+ """
+ The test examples do not import from the examples (which are just scripts, not modules) so we need som extra
+ care initializing the dependency map, which is the goal of this function. It initializes the dependency map for
+ example files by linking each example to the example test file for the example framework.
+
+ Returns:
+ `Tuple[Dict[str, List[str]], List[str]]`: A tuple with two elements: the initialized dependency map which is a
+ dict test example file to list of example files potentially tested by that test file, and the list of all
+ example files (to avoid recomputing it later).
+ """
+ test_example_deps = {}
+ all_examples = []
+ for framework in ["flax", "pytorch", "tensorflow"]:
+ test_files = list((PATH_TO_EXAMPLES / framework).glob("test_*.py"))
+ all_examples.extend(test_files)
+ # Remove the files at the root of examples/framework since they are not proper examples (they are eith utils
+ # or example test files).
+ examples = [
+ f for f in (PATH_TO_EXAMPLES / framework).glob("**/*.py") if f.parent != PATH_TO_EXAMPLES / framework
+ ]
+ all_examples.extend(examples)
+ for test_file in test_files:
+ with open(test_file, "r", encoding="utf-8") as f:
+ content = f.read()
+ # Map all examples to the test files found in examples/framework.
+ test_example_deps[str(test_file.relative_to(PATH_TO_REPO))] = [
+ str(e.relative_to(PATH_TO_REPO)) for e in examples if e.name in content
+ ]
+ # Also map the test files to themselves.
+ test_example_deps[str(test_file.relative_to(PATH_TO_REPO))].append(
+ str(test_file.relative_to(PATH_TO_REPO))
+ )
+ return test_example_deps, all_examples
+
+
+def create_reverse_dependency_map() -> Dict[str, List[str]]:
+ """
+ Create the dependency map from module/test filename to the list of modules/tests that depend on it recursively.
+
+ Returns:
+ `Dict[str, List[str]]`: The reverse dependency map as a dictionary mapping filenames to all the filenames
+ depending on it recursively. This way the tests impacted by a change in file A are the test files in the list
+ corresponding to key A in this result.
+ """
+ cache = {}
+ # Start from the example deps init.
+ example_deps, examples = init_test_examples_dependencies()
+ # Add all modules and all tests to all examples
+ all_modules = list(PATH_TO_DIFFUSERS.glob("**/*.py")) + list(PATH_TO_TESTS.glob("**/*.py")) + examples
+ all_modules = [str(mod.relative_to(PATH_TO_REPO)) for mod in all_modules]
+ # Compute the direct dependencies of all modules.
+ direct_deps = {m: get_module_dependencies(m, cache=cache) for m in all_modules}
+ direct_deps.update(example_deps)
+
+ # This recurses the dependencies
+ something_changed = True
+ while something_changed:
+ something_changed = False
+ for m in all_modules:
+ for d in direct_deps[m]:
+ # We stop recursing at an init (cause we always end up in the main init and we don't want to add all
+ # files which the main init imports)
+ if d.endswith("__init__.py"):
+ continue
+ if d not in direct_deps:
+ raise ValueError(f"KeyError:{d}. From {m}")
+ new_deps = set(direct_deps[d]) - set(direct_deps[m])
+ if len(new_deps) > 0:
+ direct_deps[m].extend(list(new_deps))
+ something_changed = True
+
+ # Finally we can build the reverse map.
+ reverse_map = collections.defaultdict(list)
+ for m in all_modules:
+ for d in direct_deps[m]:
+ reverse_map[d].append(m)
+
+ # For inits, we don't do the reverse deps but the direct deps: if modifying an init, we want to make sure we test
+ # all the modules impacted by that init.
+ for m in [f for f in all_modules if f.endswith("__init__.py")]:
+ direct_deps = get_module_dependencies(m, cache=cache)
+ deps = sum([reverse_map[d] for d in direct_deps if not d.endswith("__init__.py")], direct_deps)
+ reverse_map[m] = list(set(deps) - {m})
+
+ return reverse_map
+
+
+def create_module_to_test_map(reverse_map: Dict[str, List[str]] = None) -> Dict[str, List[str]]:
+ """
+ Extract the tests from the reverse_dependency_map and potentially filters the model tests.
+
+ Args:
+ reverse_map (`Dict[str, List[str]]`, *optional*):
+ The reverse dependency map as created by `create_reverse_dependency_map`. Will default to the result of
+ that function if not provided.
+ filter_pipelines (`bool`, *optional*, defaults to `False`):
+ Whether or not to filter pipeline tests to only include core pipelines if a file impacts a lot of models.
+
+ Returns:
+ `Dict[str, List[str]]`: A dictionary that maps each file to the tests to execute if that file was modified.
+ """
+ if reverse_map is None:
+ reverse_map = create_reverse_dependency_map()
+
+ # Utility that tells us if a given file is a test (taking test examples into account)
+ def is_test(fname):
+ if fname.startswith("tests"):
+ return True
+ if fname.startswith("examples") and fname.split(os.path.sep)[-1].startswith("test"):
+ return True
+ return False
+
+ # Build the test map
+ test_map = {module: [f for f in deps if is_test(f)] for module, deps in reverse_map.items()}
+
+ return test_map
+
+
+def check_imports_all_exist():
+ """
+ Isn't used per se by the test fetcher but might be used later as a quality check. Putting this here for now so the
+ code is not lost. This checks all imports in a given file do exist.
+ """
+ cache = {}
+ all_modules = list(PATH_TO_DIFFUSERS.glob("**/*.py")) + list(PATH_TO_TESTS.glob("**/*.py"))
+ all_modules = [str(mod.relative_to(PATH_TO_REPO)) for mod in all_modules]
+ direct_deps = {m: get_module_dependencies(m, cache=cache) for m in all_modules}
+
+ for module, deps in direct_deps.items():
+ for dep in deps:
+ if not (PATH_TO_REPO / dep).is_file():
+ print(f"{module} has dependency on {dep} which does not exist.")
+
+
+def _print_list(l) -> str:
+ """
+ Pretty print a list of elements with one line per element and a - starting each line.
+ """
+ return "\n".join([f"- {f}" for f in l])
+
+
+def update_test_map_with_core_pipelines(json_output_file: str):
+ print(f"\n### ADD CORE PIPELINE TESTS ###\n{_print_list(IMPORTANT_PIPELINES)}")
+ with open(json_output_file, "rb") as fp:
+ test_map = json.load(fp)
+
+ # Add core pipelines as their own test group
+ test_map["core_pipelines"] = " ".join(
+ sorted([str(PATH_TO_TESTS / f"pipelines/{pipe}") for pipe in IMPORTANT_PIPELINES])
+ )
+
+ # If there are no existing pipeline tests save the map
+ if "pipelines" not in test_map:
+ with open(json_output_file, "w", encoding="UTF-8") as fp:
+ json.dump(test_map, fp, ensure_ascii=False)
+
+ pipeline_tests = test_map.pop("pipelines")
+ pipeline_tests = pipeline_tests.split(" ")
+
+ # Remove core pipeline tests from the fetched pipeline tests
+ updated_pipeline_tests = []
+ for pipe in pipeline_tests:
+ if pipe == "tests/pipelines" or Path(pipe).parts[2] in IMPORTANT_PIPELINES:
+ continue
+ updated_pipeline_tests.append(pipe)
+
+ if len(updated_pipeline_tests) > 0:
+ test_map["pipelines"] = " ".join(sorted(updated_pipeline_tests))
+
+ with open(json_output_file, "w", encoding="UTF-8") as fp:
+ json.dump(test_map, fp, ensure_ascii=False)
+
+
+def create_json_map(test_files_to_run: List[str], json_output_file: Optional[str] = None):
+ """
+ Creates a map from a list of tests to run to easily split them by category, when running parallelism of slow tests.
+
+ Args:
+ test_files_to_run (`List[str]`): The list of tests to run.
+ json_output_file (`str`): The path where to store the built json map.
+ """
+ if json_output_file is None:
+ return
+
+ test_map = {}
+ for test_file in test_files_to_run:
+ # `test_file` is a path to a test folder/file, starting with `tests/`. For example,
+ # - `tests/models/bert/test_modeling_bert.py` or `tests/models/bert`
+ # - `tests/trainer/test_trainer.py` or `tests/trainer`
+ # - `tests/test_modeling_common.py`
+ names = test_file.split(os.path.sep)
+ module = names[1]
+ if module in MODULES_TO_IGNORE:
+ continue
+
+ if len(names) > 2 or not test_file.endswith(".py"):
+ # test folders under `tests` or python files under them
+ # take the part like tokenization, `pipeline`, etc. for other test categories
+ key = os.path.sep.join(names[1:2])
+ else:
+ # common test files directly under `tests/`
+ key = "common"
+
+ if key not in test_map:
+ test_map[key] = []
+ test_map[key].append(test_file)
+
+ # sort the keys & values
+ keys = sorted(test_map.keys())
+ test_map = {k: " ".join(sorted(test_map[k])) for k in keys}
+
+ with open(json_output_file, "w", encoding="UTF-8") as fp:
+ json.dump(test_map, fp, ensure_ascii=False)
+
+
+def infer_tests_to_run(
+ output_file: str,
+ diff_with_last_commit: bool = False,
+ json_output_file: Optional[str] = None,
+):
+ """
+ The main function called by the test fetcher. Determines the tests to run from the diff.
+
+ Args:
+ output_file (`str`):
+ The path where to store the summary of the test fetcher analysis. Other files will be stored in the same
+ folder:
+
+ - examples_test_list.txt: The list of examples tests to run.
+ - test_repo_utils.txt: Will indicate if the repo utils tests should be run or not.
+ - doctest_list.txt: The list of doctests to run.
+
+ diff_with_last_commit (`bool`, *optional*, defaults to `False`):
+ Whether to analyze the diff with the last commit (for use on the main branch after a PR is merged) or with
+ the branching point from main (for use on each PR).
+ filter_models (`bool`, *optional*, defaults to `True`):
+ Whether or not to filter the tests to core models only, when a file modified results in a lot of model
+ tests.
+ json_output_file (`str`, *optional*):
+ The path where to store the json file mapping categories of tests to tests to run (used for parallelism or
+ the slow tests).
+ """
+ modified_files = get_modified_python_files(diff_with_last_commit=diff_with_last_commit)
+ print(f"\n### MODIFIED FILES ###\n{_print_list(modified_files)}")
+ # Create the map that will give us all impacted modules.
+ reverse_map = create_reverse_dependency_map()
+ impacted_files = modified_files.copy()
+ for f in modified_files:
+ if f in reverse_map:
+ impacted_files.extend(reverse_map[f])
+
+ # Remove duplicates
+ impacted_files = sorted(set(impacted_files))
+ print(f"\n### IMPACTED FILES ###\n{_print_list(impacted_files)}")
+
+ # Grab the corresponding test files:
+ if any(x in modified_files for x in ["setup.py"]):
+ test_files_to_run = ["tests", "examples"]
+
+ # in order to trigger pipeline tests even if no code change at all
+ if "tests/utils/tiny_model_summary.json" in modified_files:
+ test_files_to_run = ["tests"]
+ any(f.split(os.path.sep)[0] == "utils" for f in modified_files)
+ else:
+ # All modified tests need to be run.
+ test_files_to_run = [
+ f for f in modified_files if f.startswith("tests") and f.split(os.path.sep)[-1].startswith("test")
+ ]
+ # Then we grab the corresponding test files.
+ test_map = create_module_to_test_map(reverse_map=reverse_map)
+ for f in modified_files:
+ if f in test_map:
+ test_files_to_run.extend(test_map[f])
+ test_files_to_run = sorted(set(test_files_to_run))
+ # Make sure we did not end up with a test file that was removed
+ test_files_to_run = [f for f in test_files_to_run if (PATH_TO_REPO / f).exists()]
+
+ any(f.split(os.path.sep)[0] == "utils" for f in modified_files)
+
+ examples_tests_to_run = [f for f in test_files_to_run if f.startswith("examples")]
+ test_files_to_run = [f for f in test_files_to_run if not f.startswith("examples")]
+ print(f"\n### TEST TO RUN ###\n{_print_list(test_files_to_run)}")
+ if len(test_files_to_run) > 0:
+ with open(output_file, "w", encoding="utf-8") as f:
+ f.write(" ".join(test_files_to_run))
+
+ # Create a map that maps test categories to test files, i.e. `models/bert` -> [...test_modeling_bert.py, ...]
+
+ # Get all test directories (and some common test files) under `tests` and `tests/models` if `test_files_to_run`
+ # contains `tests` (i.e. when `setup.py` is changed).
+ if "tests" in test_files_to_run:
+ test_files_to_run = get_all_tests()
+
+ create_json_map(test_files_to_run, json_output_file)
+
+ print(f"\n### EXAMPLES TEST TO RUN ###\n{_print_list(examples_tests_to_run)}")
+ if len(examples_tests_to_run) > 0:
+ # We use `all` in the case `commit_flags["test_all"]` as well as in `create_circleci_config.py` for processing
+ if examples_tests_to_run == ["examples"]:
+ examples_tests_to_run = ["all"]
+ example_file = Path(output_file).parent / "examples_test_list.txt"
+ with open(example_file, "w", encoding="utf-8") as f:
+ f.write(" ".join(examples_tests_to_run))
+
+
+def filter_tests(output_file: str, filters: List[str]):
+ """
+ Reads the content of the output file and filters out all the tests in a list of given folders.
+
+ Args:
+ output_file (`str` or `os.PathLike`): The path to the output file of the tests fetcher.
+ filters (`List[str]`): A list of folders to filter.
+ """
+ if not os.path.isfile(output_file):
+ print("No test file found.")
+ return
+ with open(output_file, "r", encoding="utf-8") as f:
+ test_files = f.read().split(" ")
+
+ if len(test_files) == 0 or test_files == [""]:
+ print("No tests to filter.")
+ return
+
+ if test_files == ["tests"]:
+ test_files = [os.path.join("tests", f) for f in os.listdir("tests") if f not in ["__init__.py"] + filters]
+ else:
+ test_files = [f for f in test_files if f.split(os.path.sep)[1] not in filters]
+
+ with open(output_file, "w", encoding="utf-8") as f:
+ f.write(" ".join(test_files))
+
+
+def parse_commit_message(commit_message: str) -> Dict[str, bool]:
+ """
+ Parses the commit message to detect if a command is there to skip, force all or part of the CI.
+
+ Args:
+ commit_message (`str`): The commit message of the current commit.
+
+ Returns:
+ `Dict[str, bool]`: A dictionary of strings to bools with keys the following keys: `"skip"`,
+ `"test_all_models"` and `"test_all"`.
+ """
+ if commit_message is None:
+ return {"skip": False, "no_filter": False, "test_all": False}
+
+ command_search = re.search(r"\[([^\]]*)\]", commit_message)
+ if command_search is not None:
+ command = command_search.groups()[0]
+ command = command.lower().replace("-", " ").replace("_", " ")
+ skip = command in ["ci skip", "skip ci", "circleci skip", "skip circleci"]
+ no_filter = set(command.split(" ")) == {"no", "filter"}
+ test_all = set(command.split(" ")) == {"test", "all"}
+ return {"skip": skip, "no_filter": no_filter, "test_all": test_all}
+ else:
+ return {"skip": False, "no_filter": False, "test_all": False}
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "--output_file", type=str, default="test_list.txt", help="Where to store the list of tests to run"
+ )
+ parser.add_argument(
+ "--json_output_file",
+ type=str,
+ default="test_map.json",
+ help="Where to store the tests to run in a dictionary format mapping test categories to test files",
+ )
+ parser.add_argument(
+ "--diff_with_last_commit",
+ action="store_true",
+ help="To fetch the tests between the current commit and the last commit",
+ )
+ parser.add_argument(
+ "--filter_tests",
+ action="store_true",
+ help="Will filter the pipeline/repo utils tests outside of the generated list of tests.",
+ )
+ parser.add_argument(
+ "--print_dependencies_of",
+ type=str,
+ help="Will only print the tree of modules depending on the file passed.",
+ default=None,
+ )
+ parser.add_argument(
+ "--commit_message",
+ type=str,
+ help="The commit message (which could contain a command to force all tests or skip the CI).",
+ default=None,
+ )
+ args = parser.parse_args()
+ if args.print_dependencies_of is not None:
+ print_tree_deps_of(args.print_dependencies_of)
+ else:
+ repo = Repo(PATH_TO_REPO)
+ commit_message = repo.head.commit.message
+ commit_flags = parse_commit_message(commit_message)
+ if commit_flags["skip"]:
+ print("Force-skipping the CI")
+ quit()
+ if commit_flags["no_filter"]:
+ print("Running all tests fetched without filtering.")
+ if commit_flags["test_all"]:
+ print("Force-launching all tests")
+
+ diff_with_last_commit = args.diff_with_last_commit
+ if not diff_with_last_commit and not repo.head.is_detached and repo.head.ref == repo.refs.main:
+ print("main branch detected, fetching tests against last commit.")
+ diff_with_last_commit = True
+
+ if not commit_flags["test_all"]:
+ try:
+ infer_tests_to_run(
+ args.output_file,
+ diff_with_last_commit=diff_with_last_commit,
+ json_output_file=args.json_output_file,
+ )
+ filter_tests(args.output_file, ["repo_utils"])
+ update_test_map_with_core_pipelines(json_output_file=args.json_output_file)
+
+ except Exception as e:
+ print(f"\nError when trying to grab the relevant tests: {e}\n\nRunning all tests.")
+ commit_flags["test_all"] = True
+
+ if commit_flags["test_all"]:
+ with open(args.output_file, "w", encoding="utf-8") as f:
+ f.write("tests")
+ example_file = Path(args.output_file).parent / "examples_test_list.txt"
+ with open(example_file, "w", encoding="utf-8") as f:
+ f.write("all")
+
+ test_files_to_run = get_all_tests()
+ create_json_map(test_files_to_run, args.json_output_file)
+ update_test_map_with_core_pipelines(json_output_file=args.json_output_file)