diff --git a/README.md b/README.md index a17cf11f5..e9034d35e 100644 --- a/README.md +++ b/README.md @@ -83,62 +83,62 @@ The GUI allows you to set the training parameters and generate and run the requi ### About SDXL training -The feature of SDXL training is now available in sdxl branch as an experimental feature. +The feature of SDXL training is now available in sdxl branch as an experimental feature. -Sep 3, 2023: The feature will be merged into the main branch soon. Following are the changes from the previous version. +Sep 3, 2023: The feature will be merged into the main branch soon. Following are the changes from the previous version. - ControlNet-LLLite is added. See [documentation](./docs/train_lllite_README.md) for details. -- JPEG XL is supported. [#786](https://github.com/kohya-ss/sd-scripts/pull/786) +- JPEG XL is supported. [#786](https://github.com/kohya-ss/sd-scripts/pull/786) - Peak memory usage is reduced. [#791](https://github.com/kohya-ss/sd-scripts/pull/791) - Input perturbation noise is added. See [#798](https://github.com/kohya-ss/sd-scripts/pull/798) for details. - Dataset subset now has `caption_prefix` and `caption_suffix` options. The strings are added to the beginning and the end of the captions before shuffling. You can specify the options in `.toml`. - Other minor changes. - Thanks for contributions from Isotr0py, vvern999, lansing and others! -Aug 13, 2023: +Aug 13, 2023: - LoRA-FA is added experimentally. Specify `--network_module networks.lora_fa` option instead of `--network_module networks.lora`. The trained model can be used as a normal LoRA model. -Aug 12, 2023: +Aug 12, 2023: - The default value of noise offset when omitted has been changed to 0 from 0.0357. - The different learning rates for each U-Net block are now supported. Specify with `--block_lr` option. Specify 23 values separated by commas like `--block_lr 1e-3,1e-3 ... 1e-3`. - 23 values correspond to `0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out`. -Aug 6, 2023: +Aug 6, 2023: - [SAI Model Spec](https://github.com/Stability-AI/ModelSpec) metadata is now supported partially. `hash_sha256` is not supported yet. - - The main items are set automatically. + - The main items are set automatically. - You can set title, author, description, license and tags with `--metadata_xxx` options in each training script. - Merging scripts also support minimum SAI Model Spec metadata. See the help message for the usage. - Metadata editor will be available soon. - SDXL LoRA has `sdxl_base_v1-0` now for `ss_base_model_version` metadata item, instead of `v0-9`. -Aug 4, 2023: +Aug 4, 2023: -- `bitsandbytes` is now optional. Please install it if you want to use it. The insructions are in the later section. -- `albumentations` is not required anymore. +- `bitsandbytes` is now optional. Please install it if you want to use it. The instructions are in the later section. +- `albumentations` is not required any more. - An issue for pooled output for Textual Inversion training is fixed. - `--v_pred_like_loss ratio` option is added. This option adds the loss like v-prediction loss in SDXL training. `0.1` means that the loss is added 10% of the v-prediction loss. The default value is None (disabled). - In v-prediction, the loss is higher in the early timesteps (near the noise). This option can be used to increase the loss in the early timesteps. - Arbitrary options can be used for Diffusers' schedulers. For example `--lr_scheduler_args "lr_end=1e-8"`. - `sdxl_gen_imgs.py` supports batch size > 1. -- Fix ControlNet to work with attention couple and reginal LoRA in `gen_img_diffusers.py`. +- Fix ControlNet to work with attention couple and regional LoRA in `gen_img_diffusers.py`. Summary of the feature: -- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. +- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. - The options are almost the same as `sdxl_train.py'. See the help message for the usage. - Please launch the script as follows: `accelerate launch --num_cpu_threads_per_process 1 tools/cache_latents.py ...` - This script should work with multi-GPU, but it is not tested in my environment. -- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. +- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. - The options are almost the same as `cache_latents.py' and `sdxl_train.py'. See the help message for the usage. - `sdxl_train.py` is a script for SDXL fine-tuning. The usage is almost the same as `fine_tune.py`, but it also supports DreamBooth dataset. - `--full_bf16` option is added. Thanks to KohakuBlueleaf! - - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. + - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. - However, bitsandbytes==0.35 doesn't seem to support this. Please use a newer version of bitsandbytes or another optimizer. - I cannot find bitsandbytes>0.35.0 that works correctly on Windows. - In addition, the full bfloat16 training might be unstable. Please use it at your own risk. @@ -159,11 +159,11 @@ Summary of the feature: 1. Training with captions. All captions must include the token string. The token string is replaced with multiple tokens. 2. Use `--use_object_template` or `--use_style_template` option. The captions are generated from the template. The existing captions are ignored. - See below for the format of the embeddings. - + - `sdxl_gen_img.py` is added. This script can be used to generate images with SDXL, including LoRA. See the help message for the usage. - Textual Inversion is supported, but the name for the embeds in the caption becomes alphabet only. For example, `neg_hand_v1.safetensors` can be activated with `neghandv`. -`requirements.txt` is updated to support SDXL training. +`requirements.txt` is updated to support SDXL training. #### Tips for SDXL training @@ -184,6 +184,7 @@ Summary of the feature: - `--bucket_reso_steps` can be set to 32 instead of the default value 64. Smaller values than 32 will not work for SDXL training. Example of the optimizer settings for Adafactor with the fixed learning rate: + ```toml optimizer_type = "adafactor" optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False" ] @@ -204,7 +205,6 @@ I would like to express my gratitude to camendutu for their valuable contributio | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------- | | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/kohya_ss-colab/blob/main/kohya_ss_colab.ipynb) | kohya_ss_gui_colab | - ## Installation ### Windows @@ -227,17 +227,17 @@ To set up the project, follow these steps: 1. Open a terminal and navigate to the desired installation directory. 2. Clone the repository by running the following command: - ``` + ```shell git clone https://github.com/bmaltais/kohya_ss.git ``` 3. Change into the `kohya_ss` directory: - ``` + ```shell cd kohya_ss ``` 4. Run the setup script by executing the following command: - ``` + ```shell .\setup.bat ``` @@ -260,7 +260,7 @@ Please note that the CUDNN 8.6 DLLs needed for this process cannot be hosted on To install the necessary dependencies on a Linux system, ensure that you fulfill the following requirements: - Ensure that `venv` support is pre-installed. You can install it on Ubuntu 22.04 using the command: - ``` + ```shell apt install python3.10-venv ``` @@ -269,7 +269,7 @@ To install the necessary dependencies on a Linux system, ensure that you fulfill - Make sure you have Python version 3.10.6 or higher (but lower than 3.11.0) installed on your system. - If you are using WSL2, set the `LD_LIBRARY_PATH` environment variable by executing the following command: - ``` + ```shell export LD_LIBRARY_PATH=/usr/lib/wsl/lib/ ``` @@ -280,22 +280,22 @@ To set up the project on Linux or macOS, perform the following steps: 1. Open a terminal and navigate to the desired installation directory. 2. Clone the repository by running the following command: - ``` + ```shell git clone https://github.com/bmaltais/kohya_ss.git ``` 3. Change into the `kohya_ss` directory: - ``` + ```shell cd kohya_ss ``` 4. If you encounter permission issues, make the `setup.sh` script executable by running the following command: - ``` + ```shell chmod +x ./setup.sh ``` 5. Run the setup script by executing the following command: - ``` + ```shell ./setup.sh ``` @@ -310,6 +310,7 @@ For macOS and other non-Linux systems, the installation process will attempt to If you choose to use the interactive mode, the default values for the accelerate configuration screen will be "This machine," "None," and "No" for the remaining questions. These default answers are the same as the Windows installation. ### Runpod + #### Manual installation To install the necessary components for Runpod and run kohya_ss, follow these steps: @@ -319,25 +320,25 @@ To install the necessary components for Runpod and run kohya_ss, follow these st 2. SSH into the Runpod. 3. Clone the repository by running the following command: - ``` + ```shell cd /workspace git clone https://github.com/bmaltais/kohya_ss.git ``` 4. Run the setup script: - ``` + ```shell cd kohya_ss ./setup-runpod.sh ``` 5. Run the gui with: - ``` + ```shell ./gui.sh --share --headless ``` or with this if you expose 7860 directly via the runpod configuration - ``` + ```shell ./gui.sh --listen=0.0.0.0 --headless ``` @@ -355,6 +356,7 @@ To run from a pre-built Runpod template you can: ### Docker + #### Local docker build If you prefer to use Docker, follow the instructions below: @@ -546,7 +548,7 @@ The documentation in this section will be moved to a separate document later. - `sdxl_train.py` is a script for SDXL fine-tuning. The usage is almost the same as `fine_tune.py`, but it also supports DreamBooth dataset. - `--full_bf16` option is added. Thanks to KohakuBlueleaf! - - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. + - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. - The full bfloat16 training might be unstable. Please use it at your own risk. - The different learning rates for each U-Net block are now supported in sdxl_train.py. Specify with `--block_lr` option. Specify 23 values separated by commas like `--block_lr 1e-3,1e-3 ... 1e-3`. - 23 values correspond to `0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out`. @@ -571,13 +573,13 @@ The documentation in this section will be moved to a separate document later. ### Utility scripts for SDXL -- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. +- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. - The options are almost the same as `sdxl_train.py'. See the help message for the usage. - Please launch the script as follows: `accelerate launch --num_cpu_threads_per_process 1 tools/cache_latents.py ...` - This script should work with multi-GPU, but it is not tested in my environment. -- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. +- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. - The options are almost the same as `cache_latents.py` and `sdxl_train.py`. See the help message for the usage. - `sdxl_gen_img.py` is added. This script can be used to generate images with SDXL, including LoRA, Textual Inversion and ControlNet-LLLite. See the help message for the usage. @@ -601,6 +603,7 @@ The documentation in this section will be moved to a separate document later. - `--bucket_reso_steps` can be set to 32 instead of the default value 64. Smaller values than 32 will not work for SDXL training. Example of the optimizer settings for Adafactor with the fixed learning rate: + ```toml optimizer_type = "adafactor" optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False" ] @@ -622,7 +625,6 @@ save_file(state_dict, file) ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [documentation](./docs/train_lllite_README.md) for details. - ## Change History * 2023/10/10 (v22.1.0) @@ -654,7 +656,7 @@ ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [docum * 2023/10/01 (v22.0.0) - Merging main branch of sd-scripts: - [SAI Model Spec](https://github.com/Stability-AI/ModelSpec) metadata is now supported partially. `hash_sha256` is not supported yet. - - The main items are set automatically. + - The main items are set automatically. - You can set title, author, description, license and tags with `--metadata_xxx` options in each training script. - Merging scripts also support minimum SAI Model Spec metadata. See the help message for the usage. - Metadata editor will be available soon. @@ -665,7 +667,7 @@ ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [docum - Arbitrary options can be used for Diffusers' schedulers. For example `--lr_scheduler_args "lr_end=1e-8"`. - LoRA-FA is added experimentally. Specify `--network_module networks.lora_fa` option instead of `--network_module networks.lora`. The trained model can be used as a normal LoRA model. - - JPEG XL is supported. [#786](https://github.com/kohya-ss/sd-scripts/pull/786) + - JPEG XL is supported. [#786](https://github.com/kohya-ss/sd-scripts/pull/786) - Input perturbation noise is added. See [#798](https://github.com/kohya-ss/sd-scripts/pull/798) for details. - Dataset subset now has `caption_prefix` and `caption_suffix` options. The strings are added to the beginning and the end of the captions before shuffling. You can specify the options in `.toml`. - Intel ARC support with IPEX is added. [#825](https://github.com/kohya-ss/sd-scripts/pull/825) diff --git a/converted_markdown.md b/converted_markdown.md index 23dc53753..684af19ad 100644 --- a/converted_markdown.md +++ b/converted_markdown.md @@ -903,7 +903,7 @@ US&client=webapp&u=https://d.hatena.ne.jp/keyword/%25A5%25CB%25A5%25E5%25A1%25BC このパラメータ値は常に25個の数字を指定しなければいけませんが、LoRAはAttentionブロックを学習対象としているので、Attentionブロックの存在しないIN0、IN3、IN6、IN9、IN10、IN11、OUT0、IN1、IN2に対する設定(1、4、7、11、12、14、15、16番目の数字)は学習時は無視されます。 -※上級者向け設定です。こだわりがないなら空欄のままで構いません。ここを指定しない場合は「Network Rank(Dimention)」値と「Network +※上級者向け設定です。こだわりがないなら空欄のままで構いません。ここを指定しない場合は「Network Rank(Dimension)」値と「Network Alpha」値がすべてのブロックに適応されます。 diff --git a/fine_tune_README.md b/fine_tune_README.md index 7ffd05d4a..696360a90 100644 --- a/fine_tune_README.md +++ b/fine_tune_README.md @@ -1,6 +1,9 @@ +# Fine tuning + It is a fine tuning that corresponds to NovelAI's proposed learning method, automatic captioning, tagging, Windows + VRAM 12GB (for v1.4/1.5) environment, etc. -## overview +## Overview + Fine tuning of U-Net of Stable Diffusion using Diffusers. It corresponds to the following improvements in NovelAI's article (For Aspect Ratio Bucketing, I referred to NovelAI's code, but the final code is all original). * Use the output of the penultimate layer instead of the last layer of CLIP (Text Encoder). @@ -14,18 +17,22 @@ Fine tuning of U-Net of Stable Diffusion using Diffusers. It corresponds to the Text Encoder is not trained by default. For fine tuning of the whole model, it seems common to learn only U-Net (NovelAI seems to be the same). Text Encoder can also be learned as an option. ## Additional features + ### Change CLIP output + CLIP (Text Encoder) converts the text into features in order to reflect the prompt in the image. Stable diffusion uses the output of the last layer of CLIP, but you can change it to use the output of the penultimate layer. According to NovelAI, this will reflect prompts more accurately. It is also possible to use the output of the last layer as is. *Stable Diffusion 2.0 uses the penultimate layer by default. Do not specify the clip_skip option. ### Training in non-square resolutions + Stable Diffusion is trained at 512\*512, but also at resolutions such as 256\*1024 and 384\*640. It is expected that this will reduce the cropped portion and learn the relationship between prompts and images more correctly. The learning resolution is adjusted vertically and horizontally in units of 64 pixels within a range that does not exceed the resolution area (= memory usage) given as a parameter. In machine learning, it is common to unify all input sizes, but there are no particular restrictions, and in fact it is okay as long as they are unified within the same batch. NovelAI's bucketing seems to refer to classifying training data in advance for each learning resolution according to the aspect ratio. And by creating a batch with the images in each bucket, the image size of the batch is unified. ### Extending token length from 75 to 225 + Stable diffusion has a maximum of 75 tokens (77 tokens including the start and end), but we will extend it to 225 tokens. However, the maximum length that CLIP accepts is 75 tokens, so in the case of 225 tokens, we simply divide it into thirds, call CLIP, and then concatenate the results. @@ -49,6 +56,7 @@ For example, store an image like this: ![Teacher data folder screenshot](https://user-images.githubusercontent.com/52813779/208907739-8e89d5fa-6ca8-4b60-8927-f484d2a9ae04.png) ## Automatic captioning + Skip if you just want to learn tags without captions. Also, when preparing captions manually, prepare them in the same directory as the teacher data image, with the same file name, extension .caption, etc. Each file should be a text file with only one line. @@ -59,13 +67,13 @@ The latest version no longer requires BLIP downloads, weight downloads, and addi Run make_captions.py in the finetune folder. -``` +```shell python finetune\make_captions.py --batch_size ``` If the batch size is 8 and the training data is placed in the parent folder train_data, it will be as follows. -``` +```shell python finetune\make_captions.py --batch_size 8 ..\train_data ``` @@ -90,11 +98,13 @@ For example, with captions like: ![captions and images](https://user-images.githubusercontent.com/52813779/208908947-af936957-5d73-4339-b6c8-945a52857373.png) ## Tagged by DeepDanbooru + If you do not want to tag the danbooru tag itself, please proceed to "Preprocessing of caption and tag information". Tagging is done with DeepDanbooru or WD14Tagger. WD14Tagger seems to be more accurate. If you want to tag with WD14Tagger, skip to the next chapter. ### Environmental arrangement + Clone DeepDanbooru https://github.com/KichangKim/DeepDanbooru into your working folder, or download the zip and extract it. I unzipped it. Also, download deepdanbooru-v3-20211112-sgd-e28.zip from Assets of "DeepDanbooru Pretrained Model v3-20211112-sgd-e28" on the DeepDanbooru Releases page https://github.com/KichangKim/DeepDanbooru/releases and extract it to the DeepDanbooru folder. @@ -108,28 +118,29 @@ Make a directory structure like this Install the necessary libraries for the Diffusers environment. Go to the DeepDanbooru folder and install it (I think it's actually just adding tensorflow-io). -``` +```shell pip install -r requirements.txt ``` Next, install DeepDanbooru itself. -``` +```shell pip install . ``` This completes the preparation of the environment for tagging. ### Implementing tagging + Go to DeepDanbooru's folder and run deepdanbooru to tag. -``` +```shell deepdanbooru evaluate --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt ``` If you put the training data in the parent folder train_data, it will be as follows. -``` +```shell deepdanbooru evaluate ../train_data --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt ``` @@ -146,6 +157,7 @@ A tag is attached like this (great amount of information...). ![Deep Danbooru tag and image](https://user-images.githubusercontent.com/52813779/208909908-a7920174-266e-48d5-aaef-940aba709519.png) ## Tagging with WD14Tagger + This procedure uses WD14Tagger instead of DeepDanbooru. Use the tagger used in Mr. Automatic1111's WebUI. I referred to the information on this github page (https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger). @@ -153,13 +165,16 @@ Use the tagger used in Mr. Automatic1111's WebUI. I referred to the information The modules required for the initial environment maintenance have already been installed. Weights are automatically downloaded from Hugging Face. ### Implementing tagging + Run the script to do the tagging. -``` + +```shell python tag_images_by_wd14_tagger.py --batch_size ``` If you put the training data in the parent folder train_data, it will be as follows. -``` + +```shell python tag_images_by_wd14_tagger.py --batch_size 4 ..\train_data ``` @@ -188,7 +203,7 @@ Combine captions and tags into a single file as metadata for easy processing fro To put captions into the metadata, run the following in your working folder (if you don't use captions for learning, you don't need to run this) (it's actually a single line, and so on). -``` +```shell python merge_captions_to_metadata.py --in_json @@ -197,7 +212,7 @@ python merge_captions_to_metadata.py The metadata file name is an arbitrary name. If the training data is train_data, there is no metadata file to read, and the metadata file is meta_cap.json, it will be as follows. -``` +```shell python merge_captions_to_metadata.py train_data meta_cap.json ``` @@ -205,7 +220,7 @@ You can specify the caption extension with the caption_extension option. If there are multiple teacher data folders, please specify the full_path argument (metadata will have full path information). Then run it for each folder. -``` +```shell python merge_captions_to_metadata.py --full_path train_data1 meta_cap1.json python merge_captions_to_metadata.py --full_path --in_json meta_cap1.json @@ -219,20 +234,22 @@ __*It is safe to rewrite the in_json option and the write destination each time ### Tag preprocessing Similarly, tags are also collected in metadata (no need to do this if tags are not used for learning). -``` + +```shell python merge_dd_tags_to_metadata.py --in_json ``` With the same directory structure as above, when reading meta_cap.json and writing to meta_cap_dd.json, it will be as follows. -``` + +```shell python merge_dd_tags_to_metadata.py train_data --in_json meta_cap.json meta_cap_dd.json ``` If you have multiple teacher data folders, please specify the full_path argument. Then run it for each folder. -``` +```shell python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap2.json train_data1 meta_cap_dd1.json python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap_dd1.json @@ -244,6 +261,7 @@ If in_json is omitted, if there is a write destination metadata file, it will be __*It is safe to rewrite the in_json option and the write destination each time and write to a separate metadata file. __ ### Cleaning captions and tags + Up to this point, captions and DeepDanbooru tags have been put together in the metadata file. However, captions with automatic captioning are subtle due to spelling variations (*), and tags include underscores and ratings (in the case of DeepDanbooru), so the editor's replacement function etc. You should use it to clean your captions and tags. *For example, when learning a girl in an anime picture, there are variations in captions such as girl/girls/woman/women. Also, it may be more appropriate to simply use "girl" for things like "anime girl". @@ -252,13 +270,13 @@ A script for cleaning is provided, so please edit the contents of the script acc (It is no longer necessary to specify the teacher data folder. All data in the metadata will be cleaned.) -``` +```shell python clean_captions_and_tags.py ``` Please note that --in_json is not included. For example: -``` +```shell python clean_captions_and_tags.py meta_cap_dd.json meta_clean.json ``` @@ -269,7 +287,8 @@ Preprocessing of captions and tags is now complete. In order to speed up the learning, we acquire the latent representation of the image in advance and save it to disk. At the same time, bucketing (classifying the training data according to the aspect ratio) is performed. In your working folder, type: -``` + +```shell python prepare_buckets_latents.py @@ -280,7 +299,7 @@ python prepare_buckets_latents.py If the model is model.ckpt, batch size 4, training resolution is 512\*512, precision is no (float32), read metadata from meta_clean.json and write to meta_lat.json: -``` +```shell python prepare_buckets_latents.py train_data meta_clean.json meta_lat.json model.ckpt --batch_size 4 --max_resolution 512,512 --mixed_precision no @@ -294,7 +313,7 @@ You can specify the minimum resolution size with the --min_bucket_reso option an If you increase the resolution to something like 768\*768, you should specify something like 1280 for the maximum size. If you specify the --flip_aug option, it will perform horizontal flip augmentation (data augmentation). You can artificially double the amount of data, but if you specify it when the data is not left-right symmetrical (for example, character appearance, hairstyle, etc.), learning will not go well. -(This is a simple implementation that acquires the latents for the flipped image and saves the \*\_flip.npz file. No options are required for fline_tune.py. If there is a file with \_flip, Randomly load a file without +(This is a simple implementation that acquires the latents for the flipped image and saves the \*\_flip.npz file. No options are required for fine_tune.py. If there is a file with \_flip, Randomly load a file without The batch size may be increased a little more even with 12GB of VRAM. The resolution is a number divisible by 64, and is specified by "width, height". The resolution is directly linked to the memory size during fine tuning. 512,512 seems to be the limit with VRAM 12GB (*). 16GB may be raised to 512,704 or 512,768. Even with 256, 256, etc., it seems to be difficult with 8GB of VRAM (because parameters and optimizers require a certain amount of memory regardless of resolution). @@ -306,24 +325,26 @@ The result of bucketing is displayed as follows. ![bucketing result](https://user-images.githubusercontent.com/52813779/208911419-71c00fbb-2ce6-49d5-89b5-b78d7715e441.png) If you have multiple teacher data folders, please specify the full_path argument. Then run it for each folder. -``` + +```shell python prepare_buckets_latents.py --full_path train_data1 meta_clean.json meta_lat1.json model.ckpt --batch_size 4 --max_resolution 512,512 --mixed_precision no python prepare_buckets_latents.py --full_path train_data2 meta_lat1.json meta_lat2.json model.ckpt - --batch_size 4 --max_resolution 512,512 --mixed_precision no - + --batch_size 4 --max_resolution 512,512 --mixed_precision no\ ``` + It is possible to make the read source and write destination the same, but separate is safer. __*It is safe to rewrite the argument each time and write it to a separate metadata file. __ - ## Run training + For example: Below are the settings for saving memory. -``` + +```shell accelerate launch --num_cpu_threads_per_process 8 fine_tune.py --pretrained_model_name_or_path=model.ckpt --in_json meta_lat.json @@ -364,19 +385,22 @@ Specifies whether to use mixed precision with mixed_precision. Specifying "fp16" "fp16" and "bf16" use almost the same amount of memory, and it is said that bf16 has better learning results (I didn't feel much difference in the range I tried). If "no" is specified, it will not be used (it will be float32). -* It seems that an error will occur when reading checkpoints learned with bf16 with Mr. AUTOMATIC1111's Web UI. This seems to be because the data type bfloat16 causes an error in the Web UI model safety checker. Save in fp16 or float32 format with the save_precision option. Or it seems to be good to store it in safetytensors format. +* It seems that an error will occur when reading checkpoints learned with bf16 with Mr. AUTOMATIC1111's Web UI. This seems to be because the data type bfloat16 causes an error in the Web UI model safety checker. Save in fp16 or float32 format with the save_precision option. Or it seems to be good to store it in safetensors format. Specifying save_every_n_epochs will save the model being trained every time that many epochs have passed. ### Supports Stable Diffusion 2.0 + Specify the --v2 option when using Hugging Face's stable-diffusion-2-base, and specify both --v2 and --v_parameterization options when using stable-diffusion-2 or 768-v-ema.ckpt please. ### Increase accuracy and speed when memory is available + First, removing gradient_checkpointing will speed it up. However, the batch size that can be set is reduced, so please set while looking at the balance between accuracy and speed. Increasing the batch size increases speed and accuracy. Increase the speed while checking the speed per data within the range where the memory is sufficient (the speed may actually decrease when the memory is at the limit). ### Change CLIP output used + Specifying 2 for the clip_skip option uses the output of the next-to-last layer. If 1 or option is omitted, the last layer is used. The learned model should be able to be inferred by Automatic1111's web UI. @@ -387,26 +411,31 @@ If the model being trained was originally trained to use the second layer, 2 is If you were using the last layer instead, the entire model would have been trained on that assumption. Therefore, if you train again using the second layer, you may need a certain number of teacher data and longer learning to obtain the desired learning result. ### Extending Token Length + You can learn by extending the token length by specifying 150 or 225 for max_token_length. The learned model should be able to be inferred by Automatic1111's web UI. As with clip_skip, learning with a length different from the learning state of the model may require a certain amount of teacher data and a longer learning time. ### Save learning log + Specify the log save destination folder in the logging_dir option. Logs in TensorBoard format are saved. For example, if you specify --logging_dir=logs, a logs folder will be created in your working folder, and logs will be saved in the date/time folder. Also, if you specify the --log_prefix option, the specified string will be added before the date and time. Use "--logging_dir=logs --log_prefix=fine_tune_style1" for identification. To check the log with TensorBoard, open another command prompt and enter the following in the working folder (I think tensorboard is installed when Diffusers is installed, but if it is not installed, pip install Please put it in tensorboard). -``` + +```shell tensorboard --logdir=logs ``` ### Learning Hypernetworks + It will be explained in another article. ### Learning with fp16 gradient (experimental feature) + The full_fp16 option will change the gradient from normal float32 to float16 (fp16) and learn (it seems to be full fp16 learning instead of mixed precision). As a result, it seems that the SD1.x 512*512 size can be learned with a VRAM usage of less than 8GB, and the SD2.x 512*512 size can be learned with a VRAM usage of less than 12GB. Specify fp16 in advance in accelerate config and optionally set mixed_precision="fp16" (does not work with bf16). @@ -419,32 +448,39 @@ It is realized by patching the PyTorch source (confirmed with PyTorch 1.12.1 and ### Other Options #### keep_tokens + If a number is specified, the specified number of tokens (comma-separated strings) from the beginning of the caption are fixed without being shuffled. If there are both captions and tags, the prompts during learning will be concatenated like "caption, tag 1, tag 2...", so if you set "--keep_tokens=1", the caption will always be at the beginning during learning. will come. #### dataset_repeats + If the number of data sets is extremely small, the epoch will end soon (it will take some time at the epoch break), so please specify a numerical value and multiply the data by some to make the epoch longer. #### train_text_encoder + Text Encoder is also a learning target. Slightly increased memory usage. In normal fine tuning, the Text Encoder is not targeted for training (probably because U-Net is trained to follow the output of the Text Encoder), but if the number of training data is small, the Text Encoder is trained like DreamBooth. also seems to be valid. #### save_precision + The data format when saving checkpoints can be specified from float, fp16, and bf16 (if not specified, it is the same as the data format during learning). It saves disk space, but the model produces different results. Also, if you specify float or fp16, you should be able to read it on Mr. 1111's Web UI. *For VAE, the data format of the original checkpoint will remain, so the model size may not be reduced to a little over 2GB even with fp16. #### save_model_as + Specify the save format of the model. Specify one of ckpt, safetensors, diffusers, diffusers_safetensors. When reading Stable Diffusion format (ckpt or safetensors) and saving in Diffusers format, missing information is supplemented by dropping v1.5 or v2.1 information from Hugging Face. #### use_safetensors -This option saves checkpoints in safetyensors format. The save format will be the default (same format as loaded). + +This option saves checkpoints in safetensors format. The save format will be the default (same format as loaded). #### save_state and resume + The save_state option saves the learning state of the optimizer, etc. in addition to the checkpoint in the folder when saving midway and at the final save. This avoids a decrease in accuracy when learning is resumed after being interrupted (since the optimizer optimizes while having a state, if the state is reset, the optimization must be performed again from the initial state. not). Note that the number of steps is not saved due to Accelerate specifications. When starting the script, you can resume by specifying the folder where the state is saved with the resume option. @@ -452,14 +488,17 @@ When starting the script, you can resume by specifying the folder where the stat Please note that the learning state will be about 5 GB per save, so please be careful of the disk capacity. #### gradient_accumulation_steps + Updates the gradient in batches for the specified number of steps. Has a similar effect to increasing the batch size, but consumes slightly more memory. *The Accelerate specification does not support multiple learning models, so if you set Text Encoder as the learning target and specify a value of 2 or more for this option, an error may occur. #### lr_scheduler / lr_warmup_steps + You can choose the learning rate scheduler from linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup with the lr_scheduler option. Default is constant. With lr_warmup_steps, you can specify the number of steps to warm up the scheduler (gradually changing the learning rate). Please do your own research for details. #### diffusers_xformers -Uses Diffusers' xformers feature rather than the script's own xformers replacement feature. Hypernetwork learning is no longer possible. \ No newline at end of file + +Uses Diffusers' xformers feature rather than the script's own xformers replacement feature. Hypernetwork learning is no longer possible. diff --git a/library/class_source_model.py b/library/class_source_model.py index 938c61fe1..041ed647d 100644 --- a/library/class_source_model.py +++ b/library/class_source_model.py @@ -33,8 +33,8 @@ def __init__( label='Model Quick Pick', choices=[ 'custom', - # 'stabilityai/stable-diffusion-xl-base-0.9', - # 'stabilityai/stable-diffusion-xl-refiner-0.9', + 'stabilityai/stable-diffusion-xl-base-1.0', + 'stabilityai/stable-diffusion-xl-refiner-1.0', 'stabilityai/stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned', 'stabilityai/stable-diffusion-2-1-base', 'stabilityai/stable-diffusion-2-base', diff --git a/library/common_gui.py b/library/common_gui.py index 5d9183229..8393642c3 100644 --- a/library/common_gui.py +++ b/library/common_gui.py @@ -41,8 +41,8 @@ # define a list of substrings to search for SDXL base models SDXL_MODELS = [ - 'stabilityai/stable-diffusion-xl-base-0.9', - 'stabilityai/stable-diffusion-xl-refiner-0.9', + 'stabilityai/stable-diffusion-xl-base-1.0', + 'stabilityai/stable-diffusion-xl-refiner-1.0', ] # define a list of substrings to search for diff --git a/library/svd_merge_lora_gui.py b/library/svd_merge_lora_gui.py index 781c1dcfc..27d670328 100644 --- a/library/svd_merge_lora_gui.py +++ b/library/svd_merge_lora_gui.py @@ -41,7 +41,7 @@ def svd_merge_lora( print(f"Output file '{save_to}' already exists. Aborting.") return - # Check if the ratio total is equal to one. If not mormalise to 1 + # Check if the ratio total is equal to one. If not normalise to 1 total_ratio = ratio_a + ratio_b + ratio_c + ratio_d if total_ratio != 1: ratio_a /= total_ratio diff --git a/localizations/en-GB.json b/localizations/en-GB.json new file mode 100644 index 000000000..9238bcb94 --- /dev/null +++ b/localizations/en-GB.json @@ -0,0 +1,24 @@ +{ + "analyze": "analyse", + "behavior": "behaviour", + "color": "colour", + "flavor": "flavour", + "honor": "honour", + "humor": "humour", + "localization": "localisation", + "localize": "localise", + "neighbor": "neighbour", + "offense": "offence", + "oriented": "orientated", + "practice": "practise", + "pretense": "pretence", + "program": "programme", + "recognize": "recognise", + "regularization": "regularisation", + "savior": "saviour", + "signaling": "signalling", + "specialization": "specialisation", + "stabilization": "stabilisation", + "standardization": "standardisation", + "utilize": "utilise" +} \ No newline at end of file diff --git a/networks/extract_lora_from_models.py b/networks/extract_lora_from_models.py index 7bdfceafb..c948d5b15 100644 --- a/networks/extract_lora_from_models.py +++ b/networks/extract_lora_from_models.py @@ -252,13 +252,13 @@ def setup_parser() -> argparse.ArgumentParser: "--clamp_quantile", type=float, default=1, - help="Quantile clamping value, float, (0-1). Defailt = 1", + help="Quantile clamping value, float, (0-1). Default = 1", ) parser.add_argument( "--min_diff", type=float, default=1, - help="Minimum difference betwen finetuned model and base to consider them different enough to extract, float, (0-1). Defailt = 0.01", + help="Minimum difference between finetuned model and base to consider them different enough to extract, float, (0-1). Default = 0.01", ) parser.add_argument( "--no_metadata", diff --git a/setup/setup_common.py b/setup/setup_common.py index 9a0ecdef3..a5dcca27e 100644 --- a/setup/setup_common.py +++ b/setup/setup_common.py @@ -394,7 +394,7 @@ def process_requirements_line(line, show_stdout: bool = False): def install_requirements(requirements_file, check_no_verify_flag=False, show_stdout: bool = False): if check_no_verify_flag: - log.info(f'Verifying modules instalation status from {requirements_file}...') + log.info(f'Verifying modules installation status from {requirements_file}...') else: log.info(f'Installing modules from {requirements_file}...') with open(requirements_file, 'r', encoding='utf8') as f: diff --git a/test/config/finetune-AdamW.json b/test/config/finetune-AdamW.json index d3128ae82..c4ddbe235 100644 --- a/test/config/finetune-AdamW.json +++ b/test/config/finetune-AdamW.json @@ -37,7 +37,7 @@ "min_bucket_reso": "256", "min_snr_gamma": 0, "mixed_precision": "bf16", - "model_list": "stabilityai/stable-diffusion-xl-base-0.9", + "model_list": "stabilityai/stable-diffusion-xl-base-1.0", "multires_noise_discount": 0, "multires_noise_iterations": 0, "noise_offset": 0, @@ -48,7 +48,7 @@ "output_dir": "./test/output", "output_name": "test_ft", "persistent_data_loader_workers": false, - "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-0.9", + "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0", "random_crop": false, "resume": "", "sample_every_n_epochs": 0, diff --git a/tools/blip2-for-sd/README.md b/tools/blip2-for-sd/README.md index 0d0b074d2..286d28159 100644 --- a/tools/blip2-for-sd/README.md +++ b/tools/blip2-for-sd/README.md @@ -4,7 +4,7 @@ source: https://github.com/Talmendo/blip2-for-sd Simple script to make BLIP2 output image description in a format suitable for Stable Diffusion. -Format followd is roughly +Format followed is roughly `[STYLE OF PHOTO] photo of a [SUBJECT], [IMPORTANT FEATURE], [MORE DETAILS], [POSE OR ACTION], [FRAMING], [SETTING/BACKGROUND], [LIGHTING], [CAMERA ANGLE], [CAMERA PROPERTIES],in style of [PHOTOGRAPHER]` ## Usage diff --git a/tools/blip2-for-sd/caption_processor.py b/tools/blip2-for-sd/caption_processor.py index 7652d14c6..8de18c33b 100644 --- a/tools/blip2-for-sd/caption_processor.py +++ b/tools/blip2-for-sd/caption_processor.py @@ -89,7 +89,7 @@ def caption_me(self, initial_prompt, image): p_lighting = self.ask("What is the scene lighting like? For example: soft lighting, studio lighting, natural lighting", image) # print(p_lighting) - p_angle = self.ask("What angle is the picture taken from? Be succint, like: from the side, from below, from front", image) + p_angle = self.ask("What angle is the picture taken from? Be succinct, like: from the side, from below, from front", image) # print(p_angle) p_camera = self.ask("What kind of camera could this picture have been taken with? Be specific and guess a brand with specific camera type", image) diff --git a/train_db_README.md b/train_db_README.md index 2367d29ae..7c3be2e3b 100644 --- a/train_db_README.md +++ b/train_db_README.md @@ -164,6 +164,7 @@ Each yaml file can be found at [https://github.com/Stability-AI/stablediffusion/ # Other study options ## Supports Stable Diffusion 2.0 --v2 / --v_parameterization + Specify the v2 option when using Hugging Face's stable-diffusion-2-base, and specify both the v2 and v_parameterization options when using stable-diffusion-2 or 768-v-ema.ckpt. In addition, learning SD 2.0 seems to be difficult with VRAM 12GB because the Text Encoder is getting bigger. @@ -179,11 +180,13 @@ The following points have changed significantly in Stable Diffusion 2.0. Among these, 1 to 4 are adopted for base, and 1 to 5 are adopted for the one without base (768-v). Enabling 1-4 is the v2 option, and enabling 5 is the v_parameterization option. ## check training data --debug_dataset + By adding this option, you can check what kind of image data and captions will be learned in advance before learning. Press Esc to exit and return to the command line. *Please note that it seems to hang when executed in an environment where there is no screen such as Colab. ## Stop training Text Encoder --stop_text_encoder_training + If you specify a numerical value for the stop_text_encoder_training option, after that number of steps, only the U-Net will be trained without training the Text Encoder. In some cases, the accuracy may be improved. (Probably only the Text Encoder may overfit first, and I guess that it can be prevented, but the detailed impact is unknown.) @@ -202,14 +205,17 @@ Use the resume option to resume training from a saved training state. Please spe Please note that due to the specifications of Accelerator (?), the number of epochs and global step are not saved, and it will start from 1 even when you resume. ## No tokenizer padding --no_token_padding + The no_token_padding option does not pad the output of the Tokenizer (same behavior as Diffusers version of old DreamBooth). ## Training with arbitrary size images --resolution + You can study outside the square. Please specify "width, height" like "448,640" in resolution. Width and height must be divisible by 64. Match the size of the training image and the regularization image. Personally, I often generate vertically long images, so I sometimes learn with "448, 640". ## Aspect Ratio Bucketing --enable_bucket / --min_bucket_reso / --max_bucket_reso + It is enabled by specifying the enable_bucket option. Stable Diffusion is trained at 512x512, but also at resolutions such as 256x768 and 384x640. If you specify this option, you do not need to unify the training images and regularization images to a specific resolution. Choose from several resolutions (aspect ratios) and learn at that resolution. @@ -224,19 +230,23 @@ When Aspect Ratio Bucketing is enabled, it may be better to prepare regularizati (Because the images in one batch are not biased toward training images and regularization images. ## augmentation --color_aug / --flip_aug + Augmentation is a method of improving model performance by dynamically changing data during learning. Learn while subtly changing the hue with color_aug and flipping left and right with flip_aug. Since the data changes dynamically, it cannot be specified together with the cache_latents option. ## Specify data precision when saving --save_precision + Specifying float, fp16, or bf16 as the save_precision option will save the checkpoint in that format (only when saving in Stable Diffusion format). Please use it when you want to reduce the size of checkpoint. ## save in any format --save_model_as + Specify the save format of the model. Specify one of ckpt, safetensors, diffusers, diffusers_safetensors. When reading Stable Diffusion format (ckpt or safetensors) and saving in Diffusers format, missing information is supplemented by dropping v1.5 or v2.1 information from Hugging Face. ## Save learning log --logging_dir / --log_prefix + Specify the log save destination folder in the logging_dir option. Logs in TensorBoard format are saved. For example, if you specify --logging_dir=logs, a logs folder will be created in your working folder, and logs will be saved in the date/time folder. @@ -251,9 +261,11 @@ tensorboard --logdir=logs Then open your browser and go to http://localhost:6006/ to see it. ## scheduler related specification of learning rate --lr_scheduler / --lr_warmup_steps + You can choose the learning rate scheduler from linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup with the lr_scheduler option. Default is constant. With lr_warmup_steps, you can specify the number of steps to warm up the scheduler (gradually changing the learning rate). Please do your own research for details. ## Training with fp16 gradient (experimental feature) --full_fp16 + The full_fp16 option will change the gradient from normal float32 to float16 (fp16) and learn (it seems to be full fp16 learning instead of mixed precision). As a result, it seems that the SD1.x 512x512 size can be learned with a VRAM usage of less than 8GB, and the SD2.x 512x512 size can be learned with a VRAM usage of less than 12GB. @@ -269,6 +281,7 @@ The setting of the learning rate and the number of steps seems to be severe. Ple # Other learning methods ## Learning multiple classes, multiple identifiers + The method is simple, multiple folders with ``Repetition count_ `` in the training image folder, and a folder with ``Repetition count_`` in the regularization image folder. Please prepare multiple For example, learning "sls frog" and "cpc rabbit" at the same time would look like this: @@ -286,6 +299,7 @@ If you have one class and multiple targets, you can have only one regularized im If the number of data varies, it seems that good results can be obtained by adjusting the number of repetitions to unify the number of sheets for each class and identifier. ## Use captions in DreamBooth + If you put a file with the same file name as the image and the extension .caption (you can change it in the option) in the training image and regularization image folders, the caption will be read from that file and learned as a prompt. * The folder name (identifier class) will no longer be used for training those images. diff --git a/train_network_README.md b/train_network_README.md index b0363a68b..ed62dad8b 100644 --- a/train_network_README.md +++ b/train_network_README.md @@ -1,4 +1,4 @@ -## About learning LoRA +# About learning LoRA [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) (arxiv), [LoRA](https://github.com/microsoft/LoRA) (github) to Stable Applied to Diffusion. @@ -96,7 +96,7 @@ Specify the save destination of the model after merging in the --save_to option Specify the LoRA model file learned in --models. It is possible to specify more than one, in which case they will be merged in order. -For --ratios, specify the application rate of each model (how much weight is reflected in the original model) with a numerical value from 0 to 1.0. For example, if it is close to overfitting, it may be better if the application rate is lowered. Specify as many as the number of models. +For --ratios, specify the application rate of each model (how much weight is reflected in the original model) with a numerical value from 0 to 1.0. For example, if it is close to over fitting, it may be better if the application rate is lowered. Specify as many as the number of models. When specifying multiple, it will be as follows. @@ -112,7 +112,7 @@ Applying multiple LoRA models one by one to the SD model and merging multiple Lo For example, a command line like: -``` +```shell python networks\merge_lora.py --save_to ..\lora_train1\model-char1-style1-merged.safetensors --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.6 0.4 @@ -128,7 +128,6 @@ For --ratios, specify the ratio of each model (how much weight is reflected in t LoRA trained with v1 and LoRA trained with v2, and LoRA with different number of dimensions cannot be merged. U-Net only LoRA and U-Net+Text Encoder LoRA should be able to merge, but the result is unknown. - ### Other Options * precision @@ -151,7 +150,8 @@ LoRA approximates the difference between two models (for example, the original m ### How to run scripts Please specify as follows. -``` + +```shell python networks\extract_lora_from_models.py --model_org base-model.ckpt --model_tuned fine-tuned-model.ckpt --save_to lora-weights.safetensors --dim 4 diff --git a/train_ti_README.md b/train_ti_README.md index ba03d5558..e655f8320 100644 --- a/train_ti_README.md +++ b/train_ti_README.md @@ -1,4 +1,4 @@ -## About learning Textual Inversion +# About learning Textual Inversion [Textual Inversion](https://textual-inversion.github.io/). I heavily referenced https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion for the implementation. @@ -16,7 +16,7 @@ Data preparation is exactly the same as ``train_network.py``, so please refer to Below is an example command line (DreamBooth technique). -``` +```shell accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py --pretrained_model_name_or_path=..\models\model.ckpt --train_data_dir=..\data\db\char1 --output_dir=..\ti_train1 @@ -30,7 +30,7 @@ accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py ``--debug_dataset`` will display the token id after substitution, so you can check if the token string after ``49408`` exists as shown below. I can confirm. -``` +```python input ids: tensor([[49406, 49408, 49409, 49410, 49411, 49412, 49413, 49414, 49415, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, @@ -47,7 +47,6 @@ In ``--init_word``, specify the string of the copy source token when initializin ``--num_vectors_per_token`` specifies how many tokens to use for this training. The higher the number, the more expressive it is, but it consumes more tokens. For example, if num_vectors_per_token=8, then the specified token string will consume 8 tokens (out of the 77 token limit for a typical prompt). - In addition, the following options can be specified. * --weights