Problem launching training with less GPUs #92

AfterHAL · 2024-12-15T22:43:08Z

Hi.
I setup a Sana training session with one 4090 GPU on a PC, everything was fine so I moved the config and the checkpoint to a PC with 7 x 4090. Everything was OK on multi-gpu.
Later, I restarted the training session with only 6 GPUs and got this error:
(Note that it restarts fine with 7 GPUs, but not with less than that)

2024-12-15 23:01:07 - [Sana] - INFO - World_size: 6, seed: 1
2024-12-15 23:01:07 - [Sana] - INFO - Initializing: DDP for training
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.46it/s]
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.35it/s]
2024-12-15 23:01:16 - [Sana] - INFO - vae type: dc-ae
2024-12-15 23:01:16 - [Sana] - INFO - v-prediction: True, noise schedule: linear_flow, flow shift: 1.0, flow weighting: logit_normal, logit-mean: 0.0, logit-std: 1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.08it/s]
2024-12-15 23:01:19 - [Sana] - WARNING - use pe: False, position embed interpolation: 1.0, base size: 16
2024-12-15 23:01:19 - [Sana] - WARNING - attention type: linear; ffn type: glumbconv; autocast linear attn: false
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.36it/s]
2024-12-15 23:01:22 - [Sana] - INFO - SanaMS:SanaMS_600M_P1_D28, Model Parameters: 591.75M
2024-12-15 23:01:22 - [Sana] - INFO - Constructing dataset SanaImgDataset...
2024-12-15 23:01:22 - [Sana] - INFO - Dataset is repeat 2000 times for toy dataset
2024-12-15 23:01:22 - [Sana] - INFO - Dataset samples: 107476000
2024-12-15 23:01:22 - [Sana] - INFO - Loading external caption json from: original_filename['', '_InternVL2-26B', '_VILA1-5-13B'].json
2024-12-15 23:01:22 - [Sana] - INFO - Loading external clipscore json from: original_filename['_InternVL2-26B_clip_score', '_VILA1-5-13B_clip_score', '_prompt_clip_score'].json
2024-12-15 23:01:22 - [Sana] - INFO - external caption clipscore threshold: 25.0, temperature: 0.1
2024-12-15 23:01:22 - [Sana] - INFO - T5 max token length: 300
2024-12-15 23:01:22 - [Sana] - INFO - Dataset SanaImgDataset constructed: time: 0.16 s, length (use/ori): 107476000/107476000
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.22it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.65it/s]
2024-12-15 23:01:38 - [Sana] - INFO - Automatically adapt lr to 0.00003 (using sqrt scaling rule).
2024-12-15 23:01:38 - [Sana] - INFO - CAMEWrapper Optimizer: total 436 param groups, 436 are learnable, 0 are fix. Lr group: 436 params with lr 0.00003; Weight decay group: 436 params with weight decay 0.001.
2024-12-15 23:01:38 - [Sana] - INFO - Lr schedule: cosine, num_warmup_steps:6000.
2024-12-15 23:01:38 - [Sana] - WARNING - Basic Setting: lr: 0.00003, bs: 4, gc: True, gc_accum_step: 1, qk norm: False, fp32 attn: True, attn type: linear, ffn type: glumbconv, text encoder: gemma-2-2b-it, captions: {'prompt': 1}, precision: fp16
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[rank4]: Traceback (most recent call last):
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank4]:     main()
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank4]:     response = fn(cfg, *args, **kwargs)
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank4]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank4]:     set_rng_state(state, i)
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank4]:     _lazy_call(cb)
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank4]:     callable()
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank4]:     default_generator = torch.cuda.default_generators[idx]
[rank4]: IndexError: tuple index out of range
[rank5]: Traceback (most recent call last):
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank5]:     main()
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank5]:     response = fn(cfg, *args, **kwargs)
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank5]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank5]:     set_rng_state(state, i)
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank5]:     _lazy_call(cb)
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank5]:     callable()
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank5]:     default_generator = torch.cuda.default_generators[idx]
[rank5]: IndexError: tuple index out of range
[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank2]:     main()
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank2]:     response = fn(cfg, *args, **kwargs)
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank2]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank2]:     set_rng_state(state, i)
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank2]:     _lazy_call(cb)
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank2]:     callable()
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank2]:     default_generator = torch.cuda.default_generators[idx]
[rank2]: IndexError: tuple index out of range
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank1]:     main()
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank1]:     response = fn(cfg, *args, **kwargs)
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank1]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank1]:     set_rng_state(state, i)
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank1]:     _lazy_call(cb)
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank1]:     callable()
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank1]:     default_generator = torch.cuda.default_generators[idx]
[rank1]: IndexError: tuple index out of range
[rank3]: Traceback (most recent call last):
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank3]:     main()
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank3]:     response = fn(cfg, *args, **kwargs)
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank3]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank3]:     set_rng_state(state, i)
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank3]:     _lazy_call(cb)
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank3]:     callable()
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank3]:     default_generator = torch.cuda.default_generators[idx]
[rank3]: IndexError: tuple index out of range
2024-12-15 23:03:36 - [Sana] - INFO - Resume checkpoint of epoch 1 from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth. Load ema: False, resume optimizer： True, resume lr scheduler: True.
2024-12-15 23:03:36 - [Sana] - WARNING - Missing keys: ['pos_embed']
2024-12-15 23:03:36 - [Sana] - WARNING - Unexpected keys: []
2024-12-15 23:03:36 - [Sana] - INFO - resuming randomise
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank0]:     response = fn(cfg, *args, **kwargs)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank0]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank0]:     set_rng_state(state, i)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank0]:     _lazy_call(cb)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank0]:     callable()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5193 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5195 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5196 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5197 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5198 closing signal SIGTERM
E1215 23:03:38.848000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 5194) of binary: /mnt/d/TODAI/apps/SanaLinux/venv/bin/python3
Traceback (most recent call last):
  File "/mnt/d/TODAI/apps/SanaLinux/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-15_23:03:38
  host      : RB-MOLECULEXL0001.
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 5194)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

lawrence-cj · 2024-12-17T08:17:06Z

OH, this is because we save the seed and randomize information in the checkpoint. So when you resume from a ckpt saved on a 7-GPUs node and run the training on 6GPUs it will cause the bug you met. One solution is to use use --train.load_from xxx.pth arg and save new ckpts to a new folder. Let me know if it helps. @AfterHAL

AfterHAL · 2024-12-22T18:36:06Z

OH, this is because we save the seed and randomize information in the checkpoint. So when you resume from a ckpt saved on a 7-GPUs node and run the training on 6GPUs it will cause the bug you met. One solution is to use use --train.load_from xxx.pth arg and save new ckpts to a new folder. Let me know if it helps. @AfterHAL

Hi. Unfortunately, it doesn't work. Same error:

[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range

I tryed using --train.load_from xxx.pth (parameter does not exists),
--load_from xxx.pth : same error
--model.load_from xxx.pth : same error
And I tryed also changing the --train.work_dir without success.

lawrence-cj · 2024-12-22T18:52:37Z

Oh, it should be --model.load_from xxx.pth

AfterHAL · 2024-12-23T08:05:27Z

Oh, it should be --model.load_from xxx.pth

Hi.
I already tried with --model.load_from xxx.pth, but it gives another error:

[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 798, in main
[rank0]:     load_ema=config.model.resume_from.get("load_ema", False),
[rank0]: AttributeError: 'NoneType' object has no attribute 'get'

And using --model.load_from xxx.pth & model.resume_from.load_ema: true goes back to the previous error:

[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range

AfterHAL · 2024-12-28T00:59:11Z

Hi. I'm still stuck with this problem and have no clue...
Can someone help me please?

AfterHAL · 2025-01-03T16:47:05Z

Hi @lawrence-cj .
I'm still stuck with it. Some helps will be appreciated.
Thanks.

lawrence-cj · 2025-01-05T11:29:28Z

[rank0]: File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 798, in main
[rank0]: load_ema=config.model.resume_from.get("load_ema", False),
[rank0]: AttributeError: 'NoneType' object has no attribute 'get'

what is the output of args.resume_from or config.model.resume_from, when you tried with --model.load_from xxx.pth and what's your complete laughing command?
@AfterHAL

AfterHAL · 2025-01-05T16:02:02Z

Thank you @lawrence-cj .

This checkpoint is currently at 222k steps trained on 2 GPUs.
Here is the command to (re)start this training session on 1 GPU, after manually changing the torchrun argument --nproc_per_node=1:

CUDA_VISIBLE_DEVICES=7 \
bash train_scripts/train.sh \
/SanaFT/Sana_600M_img512_woman_Test02_v01.yaml \
--work_dir=/SanaFT/output/Sana_600M_img512_woman_Test02 \
--data.type=SanaImgDataset \
--model.multi_scale=false \
--train.train_batch_size=5 \
--data.data_dir="[/Datasets/Women_Cap]" \
--model.load_from=/SanaFT/output/Sana_600M_img512_woman_Test02/checkpoints/epoch_1_step_222000.pth

And here is the output:

2025-01-05 18:18:13 - [Sana] - INFO - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

2025-01-05 18:18:13 - [Sana] - INFO - Config:
{
    "data": {
        "data_dir": [
            "/Datasets/Women_Cap"
        ],
        "caption_proportion": {
            "prompt": 1
        },
        "external_caption_suffixes": [
            "",
            "_InternVL2-26B",
            "_VILA1-5-13B"
        ],
        "external_clipscore_suffixes": [
            "_InternVL2-26B_clip_score",
            "_VILA1-5-13B_clip_score",
            "_prompt_clip_score"
        ],
        "clip_thr_temperature": 0.1,
        "clip_thr": 25.0,
        "sort_dataset": false,
        "load_text_feat": false,
        "load_vae_feat": false,
        "transform": "default_train",
        "type": "SanaImgDataset",
        "image_size": 512,
        "hq_only": false,
        "valid_num": 0,
        "data": null,
        "extra": null
    },
    "model": {
        "model": "SanaMS_600M_P1_D28",
        "image_size": 512,
        "mixed_precision": "fp16",
        "fp32_attention": true,
        "load_from": "/SanaFT/output/Sana_600M_img512_woman_Test02/checkpoints/epoch_1_step_222000.pth",
        "resume_from": {
            "checkpoint": "latest",
            "load_ema": false,
            "resume_optimizer": true,
            "resume_lr_scheduler": true
        },
        "aspect_ratio_type": "ASPECT_RATIO_512",
        "multi_scale": false,
        "pe_interpolation": 1.0,
        "micro_condition": false,
        "attn_type": "linear",
        "autocast_linear_attn": false,
        "ffn_type": "glumbconv",
        "mlp_acts": [
            "silu",
            "silu",
            null
        ],
        "mlp_ratio": 2.5,
        "use_pe": false,
        "qk_norm": false,
        "class_dropout_prob": 0.1,
        "linear_head_dim": 32,
        "cross_norm": false,
        "cfg_scale": 4,
        "guidance_type": "classifier-free",
        "pag_applied_layers": [
            14
        ],
        "extra": null
    },
    "vae": {
        "vae_type": "dc-ae",
        "vae_pretrained": "mit-han-lab/dc-ae-f32c32-sana-1.0",
        "weight_dtype": "bfloat16",
        "scale_factor": 0.41407,
        "vae_latent_dim": 32,
        "vae_downsample_rate": 32,
        "sample_posterior": true,
        "extra": null
    },
    "text_encoder": {
        "text_encoder_name": "gemma-2-2b-it",
        "caption_channels": 2304,
        "y_norm": true,
        "y_norm_scale_factor": 0.01,
        "model_max_length": 300,
        "chi_prompt": [],
        "extra": null
    },
    "scheduler": {
        "train_sampling_steps": 1000,
        "predict_v": true,
        "noise_schedule": "linear_flow",
        "pred_sigma": false,
        "learn_sigma": true,
        "vis_sampler": "flow_dpm-solver",
        "flow_shift": 1.0,
        "weighting_scheme": "logit_normal",
        "logit_mean": 0.0,
        "logit_std": 1.0,
        "extra": null
    },
    "train": {
        "num_workers": 4,
        "seed": 1,
        "train_batch_size": 5,
        "num_epochs": 40,
        "gradient_accumulation_steps": 1,
        "grad_checkpointing": true,
        "gradient_clip": 0.1,
        "gc_step": 1,
        "optimizer": {
            "betas": [
                0.9,
                0.998,
                0.9999
            ],
            "eps": [
                1e-30,
                1e-16
            ],
            "lr": 0.0001,
            "type": "CAMEWrapper",
            "weight_decay": 0.0
        },
        "lr_schedule": "constant",
        "lr_schedule_args": {
            "num_warmup_steps": 200
        },
        "auto_lr": {
            "rule": "sqrt"
        },
        "ema_rate": 0.9999,
        "eval_batch_size": 16,
        "use_fsdp": false,
        "use_flash_attn": false,
        "eval_sampling_steps": 250,
        "lora_rank": 4,
        "log_interval": 50,
        "mask_type": "null",
        "mask_loss_coef": 0.0,
        "load_mask_index": false,
        "snr_loss": false,
        "real_prompt_ratio": 1.0,
        "training_hours": 10000.0,
        "save_image_epochs": 1,
        "save_model_epochs": 2,
        "save_model_steps": 500,
        "visualize": true,
        "null_embed_root": "output/pretrained_models/",
        "valid_prompt_embed_root": "output/tmp_embed/",
        "validation_prompts": [
            "a woman",
            "a woman wearing lingerie",
            "a sexy woman wearing lingerie and stiletto heels.",
            "a woman smiling.",
            "a full-length woman in underwear."
        ],
        "local_save_vis": true,
        "deterministic_validation": true,
        "online_metric": false,
        "eval_metric_step": 2000,
        "online_metric_dir": "metric_helper",
        "work_dir": "output/debug",
        "skip_step": 0,
        "loss_type": "huber",
        "huber_c": 0.001,
        "num_ddim_timesteps": 50,
        "w_max": 15.0,
        "w_min": 3.0,
        "ema_decay": 0.99,
        "debug_nan": false,
        "extra": null
    },
    "work_dir": "/SanaFT/output/Sana_600M_img512_woman_Test02",
    "resume_from": "latest",
    "load_from": null,
    "debug": false,
    "caching": false,
    "report_to": "tensorboard",
    "tracker_project_name": "t2i-evit-baseline",
    "name": "tmp",
    "loss_report_name": "loss"
}
2025-01-05 18:18:13 - [Sana] - INFO - World_size: 1, seed: 1
2025-01-05 18:18:13 - [Sana] - INFO - Initializing: DDP for training
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.85it/s]
2025-01-05 18:18:21 - [Sana] - INFO - vae type: dc-ae
2025-01-05 18:18:21 - [Sana] - INFO - v-prediction: True, noise schedule: linear_flow, flow shift: 1.0, flow weighting: logit_normal, logit-mean: 0.0, logit-std: 1.0
2025-01-05 18:18:24 - [Sana] - WARNING - use pe: False, position embed interpolation: 1.0, base size: 16
2025-01-05 18:18:24 - [Sana] - WARNING - attention type: linear; ffn type: glumbconv; autocast linear attn: false
2025-01-05 18:18:27 - [Sana] - INFO - SanaMS:SanaMS_600M_P1_D28, Model Parameters: 591.75M
2025-01-05 18:18:27 - [Sana] - INFO - Constructing dataset SanaImgDataset...
2025-01-05 18:18:27 - [Sana] - INFO - Dataset is repeat 2000 times for toy dataset
2025-01-05 18:18:27 - [Sana] - INFO - Dataset samples: 17970000
2025-01-05 18:18:27 - [Sana] - INFO - Loading external caption json from: original_filename['', '_InternVL2-26B', '_VILA1-5-13B'].json
2025-01-05 18:18:27 - [Sana] - INFO - Loading external clipscore json from: original_filename['_InternVL2-26B_clip_score', '_VILA1-5-13B_clip_score', '_prompt_clip_score'].json
2025-01-05 18:18:27 - [Sana] - INFO - external caption clipscore threshold: 25.0, temperature: 0.1
2025-01-05 18:18:27 - [Sana] - INFO - Text max token length: 300
2025-01-05 18:18:27 - [Sana] - INFO - Dataset SanaImgDataset constructed: time: 0.03 s, length (use/ori): 17970000/17970000
2025-01-05 18:18:28 - [Sana] - INFO - Automatically adapt lr to 0.00001 (using sqrt scaling rule).
2025-01-05 18:18:28 - [Sana] - INFO - CAMEWrapper Optimizer: total 436 param groups, 436 are learnable, 0 are fix. Lr group: 436 params with lr 0.00001; Weight decay group: 436 params with weight decay 0.0.
2025-01-05 18:18:28 - [Sana] - INFO - Lr schedule: constant, num_warmup_steps:200.
2025-01-05 18:18:28 - [Sana] - WARNING - Basic Setting: lr: 0.00001, bs: 5, gc: True, gc_accum_step: 1, qk norm: False, fp32 attn: True, attn type: linear, ffn type: glumbconv, text encoder: gemma-2-2b-it, captions: {'prompt': 1}, precision: fp16
[Sana] Loading model from /mnt/d/TODAI/SanaFT/output/Sana_600M_img512_woman_FromPreTrained_Test02/checkpoints/epoch_1_step_222000.pth
2025-01-05 18:19:00 - [Sana] - INFO - Resume checkpoint of epoch 1 from /mnt/d/TODAI/SanaFT/output/Sana_600M_img512_woman_FromPreTrained_Test02/checkpoints/epoch_1_step_222000.pth. Load ema: False, resume optimizer： True, resume lr scheduler: True.
2025-01-05 18:19:00 - [Sana] - WARNING - Missing keys: ['pos_embed']
2025-01-05 18:19:00 - [Sana] - WARNING - Unexpected keys: []
2025-01-05 18:19:00 - [Sana] - INFO - resuming randomise
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank0]:     response = fn(cfg, *args, **kwargs)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank0]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank0]:     set_rng_state(state, i)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank0]:     _lazy_call(cb)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank0]:     callable()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range
E0105 18:19:03.088000 139642610978944 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2343) of binary: /mnt/d/TODAI/apps/SanaLinux/venv/bin/python3
Traceback (most recent call last):
  File "/mnt/d/TODAI/apps/SanaLinux/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_scripts/train.py FAILED
------------------------------------------------------------

So, the config of model.load_from is OK, but the resume from is still set to "latest": model.resume_from.checkpoint:"latest"

    "model": {
        "model": "SanaMS_600M_P1_D28",
        "image_size": 512,
        "mixed_precision": "fp16",
        "fp32_attention": true,
        "load_from": "/SanaFT/output/Sana_600M_img512_woman_Test02/checkpoints/epoch_1_step_222000.pth",
        "resume_from": {
            "checkpoint": "latest",
            "load_ema": false,
            "resume_optimizer": true,
            "resume_lr_scheduler": true
        },

And config.resume_from is also on latest: "resume_from": "latest"

How can I force it ?

lawrence-cj · 2025-01-06T08:29:33Z

#131
This PR would fix the problem @AfterHAL

AfterHAL · 2025-01-08T18:12:32Z

#131 This PR would fix the problem @AfterHAL

Hi @lawrence-cj .
I had time try this today, and it works.
Thanks a lot.
I'm not very into coding, I prefer to use the tools and see what we can achieve with.
Thank a lot for your help and understanding @lawrence-cj .

lawrence-cj added the Answered Answered the question label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem launching training with less GPUs #92

Problem launching training with less GPUs #92

AfterHAL commented Dec 15, 2024

lawrence-cj commented Dec 17, 2024

AfterHAL commented Dec 22, 2024

lawrence-cj commented Dec 22, 2024

AfterHAL commented Dec 23, 2024

AfterHAL commented Dec 28, 2024

AfterHAL commented Jan 3, 2025

lawrence-cj commented Jan 5, 2025 •

edited

Loading

AfterHAL commented Jan 5, 2025 •

edited

Loading

lawrence-cj commented Jan 6, 2025

AfterHAL commented Jan 8, 2025

Problem launching training with less GPUs #92

Problem launching training with less GPUs #92

Comments

AfterHAL commented Dec 15, 2024

lawrence-cj commented Dec 17, 2024

AfterHAL commented Dec 22, 2024

lawrence-cj commented Dec 22, 2024

AfterHAL commented Dec 23, 2024

AfterHAL commented Dec 28, 2024

AfterHAL commented Jan 3, 2025

lawrence-cj commented Jan 5, 2025 • edited Loading

AfterHAL commented Jan 5, 2025 • edited Loading

lawrence-cj commented Jan 6, 2025

AfterHAL commented Jan 8, 2025

lawrence-cj commented Jan 5, 2025 •

edited

Loading

AfterHAL commented Jan 5, 2025 •

edited

Loading