Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The program did not execute properly after clicking "Start Training." #291

Open
synthetai opened this issue Jan 10, 2025 · 0 comments
Open

Comments

@synthetai
Copy link

When I click "Start", the output log freezes here and the training never starts, no matter how long it runs.
This is my operating system environment:
Sytem: Ubuntu 22.04
CUDA Version: 12.04
Python Version: 3.11
GPU: NVIDIA H100 80GB HBM3
Driver Version: 550.90.07

[2025-01-10 03:47:50] [INFO] INFO     create LoRA for Text Encoder 1:                                                                                                                                   lora_flux.py:741
[2025-01-10 03:47:50] [INFO] INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set embeddings to torch.bfloat16                                                                     flux_train_network.py:511
[2025-01-10 03:47:50] [INFO] INFO     create LoRA for Text Encoder 1: 72 modules.                                                                                                                       lora_flux.py:744
[2025-01-10 03:47:50] [INFO] INFO     create LoRA for FLUX all blocks: 304 modules.                                                                                                                     lora_flux.py:765
[2025-01-10 03:47:50] [INFO] INFO     enable LoRA for text encoder: 72 modules                                                                                                                          lora_flux.py:911
[2025-01-10 03:47:50] [INFO] INFO     enable LoRA for U-Net: 304 modules                                                                                                                                lora_flux.py:916
[2025-01-10 03:47:50] [INFO] FLUX: Gradient checkpointing enabled. CPU offload: False
[2025-01-10 03:47:50] [INFO] INFO     Text Encoder 1 (CLIP-L): 72 modules, LR 0.0008                                                                                                                   lora_flux.py:1018
[2025-01-10 03:47:50] [INFO] INFO     use 8-bit AdamW optimizer | {}                                                                                                                                  train_util.py:4682
[2025-01-10 03:47:50] [INFO] INFO     set U-Net weight dtype to torch.float8_e4m3fn                                                                                                                 train_network.py:631
[2025-01-10 03:47:50] [INFO] INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set embeddings to torch.bfloat16                                                                     flux_train_network.py:511
[2025-01-10 03:47:56] [INFO] running training / 学習開始
[2025-01-10 03:47:56] [INFO] num train images * repeats / 学習画像の数×繰り返し回数: 990
[2025-01-10 03:47:56] [INFO] num reg images / 正則化画像の数: 0
[2025-01-10 03:47:56] [INFO] num batches per epoch / 1epochのバッチ数: 495
[2025-01-10 03:47:56] [INFO] num epochs / epoch数: 16
[2025-01-10 03:47:56] [INFO] batch size per device / バッチサイズ: 1
[2025-01-10 03:47:56] [INFO] gradient accumulation steps / 勾配を合計するステップ数 = 1
[2025-01-10 03:47:56] [INFO] total optimization steps / 学習ステップ数: 7920
[2025-01-10 03:48:13] [INFO] 2025-01-10 03:48:13 INFO     unet dtype: torch.float8_e4m3fn, device: cuda:1                                                                                                              train_network.py:1124
[2025-01-10 03:48:13] [INFO] INFO     text_encoder [0] dtype: torch.float8_e4m3fn, device: cuda:1                                                                                                  train_network.py:1130
[2025-01-10 03:48:13] [INFO] INFO     text_encoder [1] dtype: torch.bfloat16, device: cpu                                                                                                          train_network.py:1130
[2025-01-10 03:48:14] [INFO] steps:   0%|                                                                                                                                                                         | 0/7920 [00:00<?, ?it/s]2025-01-10 03:48:14 INFO     unet dtype: torch.float8_e4m3fn, device: cuda:0                                                                                                              train_network.py:1124
[2025-01-10 03:48:14] [INFO] INFO     text_encoder [0] dtype: torch.float8_e4m3fn, device: cuda:0                                                                                                  train_network.py:1130
[2025-01-10 03:48:14] [INFO] INFO     text_encoder [1] dtype: torch.bfloat16, device: cpu                                                                                                          train_network.py:1130
[2025-01-10 03:48:14] [INFO] 
[2025-01-10 03:48:14] [INFO] epoch 1/16
[2025-01-10 03:48:14] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2025-01-10 03:48:14] [INFO] To disable this warning, you can either:
[2025-01-10 03:48:14] [INFO] - Avoid using `tokenizers` before the fork if possible
[2025-01-10 03:48:14] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2025-01-10 03:48:14] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2025-01-10 03:48:14] [INFO] To disable this warning, you can either:
[2025-01-10 03:48:14] [INFO] - Avoid using `tokenizers` before the fork if possible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant