The program did not execute properly after clicking "Start Training." #291

synthetai · 2025-01-10T07:19:13Z

When I click "Start", the output log freezes here and the training never starts, no matter how long it runs.
This is my operating system environment：
Sytem: Ubuntu 22.04
CUDA Version: 12.04
Python Version: 3.11
GPU: NVIDIA H100 80GB HBM3
Driver Version: 550.90.07

[2025-01-10 03:47:50] [INFO] INFO     create LoRA for Text Encoder 1:                                                                                                                                   lora_flux.py:741
[2025-01-10 03:47:50] [INFO] INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set embeddings to torch.bfloat16                                                                     flux_train_network.py:511
[2025-01-10 03:47:50] [INFO] INFO     create LoRA for Text Encoder 1: 72 modules.                                                                                                                       lora_flux.py:744
[2025-01-10 03:47:50] [INFO] INFO     create LoRA for FLUX all blocks: 304 modules.                                                                                                                     lora_flux.py:765
[2025-01-10 03:47:50] [INFO] INFO     enable LoRA for text encoder: 72 modules                                                                                                                          lora_flux.py:911
[2025-01-10 03:47:50] [INFO] INFO     enable LoRA for U-Net: 304 modules                                                                                                                                lora_flux.py:916
[2025-01-10 03:47:50] [INFO] FLUX: Gradient checkpointing enabled. CPU offload: False
[2025-01-10 03:47:50] [INFO] INFO     Text Encoder 1 (CLIP-L): 72 modules, LR 0.0008                                                                                                                   lora_flux.py:1018
[2025-01-10 03:47:50] [INFO] INFO     use 8-bit AdamW optimizer | {}                                                                                                                                  train_util.py:4682
[2025-01-10 03:47:50] [INFO] INFO     set U-Net weight dtype to torch.float8_e4m3fn                                                                                                                 train_network.py:631
[2025-01-10 03:47:50] [INFO] INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set embeddings to torch.bfloat16                                                                     flux_train_network.py:511
[2025-01-10 03:47:56] [INFO] running training / 学習開始
[2025-01-10 03:47:56] [INFO] num train images * repeats / 学習画像の数×繰り返し回数: 990
[2025-01-10 03:47:56] [INFO] num reg images / 正則化画像の数: 0
[2025-01-10 03:47:56] [INFO] num batches per epoch / 1epochのバッチ数: 495
[2025-01-10 03:47:56] [INFO] num epochs / epoch数: 16
[2025-01-10 03:47:56] [INFO] batch size per device / バッチサイズ: 1
[2025-01-10 03:47:56] [INFO] gradient accumulation steps / 勾配を合計するステップ数 = 1
[2025-01-10 03:47:56] [INFO] total optimization steps / 学習ステップ数: 7920
[2025-01-10 03:48:13] [INFO] 2025-01-10 03:48:13 INFO     unet dtype: torch.float8_e4m3fn, device: cuda:1                                                                                                              train_network.py:1124
[2025-01-10 03:48:13] [INFO] INFO     text_encoder [0] dtype: torch.float8_e4m3fn, device: cuda:1                                                                                                  train_network.py:1130
[2025-01-10 03:48:13] [INFO] INFO     text_encoder [1] dtype: torch.bfloat16, device: cpu                                                                                                          train_network.py:1130
[2025-01-10 03:48:14] [INFO] steps:   0%|                                                                                                                                                                         | 0/7920 [00:00<?, ?it/s]2025-01-10 03:48:14 INFO     unet dtype: torch.float8_e4m3fn, device: cuda:0                                                                                                              train_network.py:1124
[2025-01-10 03:48:14] [INFO] INFO     text_encoder [0] dtype: torch.float8_e4m3fn, device: cuda:0                                                                                                  train_network.py:1130
[2025-01-10 03:48:14] [INFO] INFO     text_encoder [1] dtype: torch.bfloat16, device: cpu                                                                                                          train_network.py:1130
[2025-01-10 03:48:14] [INFO] 
[2025-01-10 03:48:14] [INFO] epoch 1/16
[2025-01-10 03:48:14] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2025-01-10 03:48:14] [INFO] To disable this warning, you can either:
[2025-01-10 03:48:14] [INFO] - Avoid using `tokenizers` before the fork if possible
[2025-01-10 03:48:14] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2025-01-10 03:48:14] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2025-01-10 03:48:14] [INFO] To disable this warning, you can either:
[2025-01-10 03:48:14] [INFO] - Avoid using `tokenizers` before the fork if possible

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The program did not execute properly after clicking "Start Training." #291

The program did not execute properly after clicking "Start Training." #291

synthetai commented Jan 10, 2025

The program did not execute properly after clicking "Start Training." #291

The program did not execute properly after clicking "Start Training." #291

Comments

synthetai commented Jan 10, 2025