Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem launching training with less GPUs #92

Open
AfterHAL opened this issue Dec 15, 2024 · 10 comments
Open

Problem launching training with less GPUs #92

AfterHAL opened this issue Dec 15, 2024 · 10 comments
Labels
Answered Answered the question

Comments

@AfterHAL
Copy link

Hi.
I setup a Sana training session with one 4090 GPU on a PC, everything was fine so I moved the config and the checkpoint to a PC with 7 x 4090. Everything was OK on multi-gpu.
Later, I restarted the training session with only 6 GPUs and got this error:
(Note that it restarts fine with 7 GPUs, but not with less than that)

2024-12-15 23:01:07 - [Sana] - INFO - World_size: 6, seed: 1
2024-12-15 23:01:07 - [Sana] - INFO - Initializing: DDP for training
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.46it/s]
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.35it/s]
2024-12-15 23:01:16 - [Sana] - INFO - vae type: dc-ae
2024-12-15 23:01:16 - [Sana] - INFO - v-prediction: True, noise schedule: linear_flow, flow shift: 1.0, flow weighting: logit_normal, logit-mean: 0.0, logit-std: 1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.08it/s]
2024-12-15 23:01:19 - [Sana] - WARNING - use pe: False, position embed interpolation: 1.0, base size: 16
2024-12-15 23:01:19 - [Sana] - WARNING - attention type: linear; ffn type: glumbconv; autocast linear attn: false
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.36it/s]
2024-12-15 23:01:22 - [Sana] - INFO - SanaMS:SanaMS_600M_P1_D28, Model Parameters: 591.75M
2024-12-15 23:01:22 - [Sana] - INFO - Constructing dataset SanaImgDataset...
2024-12-15 23:01:22 - [Sana] - INFO - Dataset is repeat 2000 times for toy dataset
2024-12-15 23:01:22 - [Sana] - INFO - Dataset samples: 107476000
2024-12-15 23:01:22 - [Sana] - INFO - Loading external caption json from: original_filename['', '_InternVL2-26B', '_VILA1-5-13B'].json
2024-12-15 23:01:22 - [Sana] - INFO - Loading external clipscore json from: original_filename['_InternVL2-26B_clip_score', '_VILA1-5-13B_clip_score', '_prompt_clip_score'].json
2024-12-15 23:01:22 - [Sana] - INFO - external caption clipscore threshold: 25.0, temperature: 0.1
2024-12-15 23:01:22 - [Sana] - INFO - T5 max token length: 300
2024-12-15 23:01:22 - [Sana] - INFO - Dataset SanaImgDataset constructed: time: 0.16 s, length (use/ori): 107476000/107476000
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.22it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.65it/s]
2024-12-15 23:01:38 - [Sana] - INFO - Automatically adapt lr to 0.00003 (using sqrt scaling rule).
2024-12-15 23:01:38 - [Sana] - INFO - CAMEWrapper Optimizer: total 436 param groups, 436 are learnable, 0 are fix. Lr group: 436 params with lr 0.00003; Weight decay group: 436 params with weight decay 0.001.
2024-12-15 23:01:38 - [Sana] - INFO - Lr schedule: cosine, num_warmup_steps:6000.
2024-12-15 23:01:38 - [Sana] - WARNING - Basic Setting: lr: 0.00003, bs: 4, gc: True, gc_accum_step: 1, qk norm: False, fp32 attn: True, attn type: linear, ffn type: glumbconv, text encoder: gemma-2-2b-it, captions: {'prompt': 1}, precision: fp16
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[Sana] Loading model from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth
[rank4]: Traceback (most recent call last):
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank4]:     main()
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank4]:     response = fn(cfg, *args, **kwargs)
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank4]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank4]:     set_rng_state(state, i)
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank4]:     _lazy_call(cb)
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank4]:     callable()
[rank4]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank4]:     default_generator = torch.cuda.default_generators[idx]
[rank4]: IndexError: tuple index out of range
[rank5]: Traceback (most recent call last):
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank5]:     main()
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank5]:     response = fn(cfg, *args, **kwargs)
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank5]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank5]:     set_rng_state(state, i)
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank5]:     _lazy_call(cb)
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank5]:     callable()
[rank5]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank5]:     default_generator = torch.cuda.default_generators[idx]
[rank5]: IndexError: tuple index out of range
[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank2]:     main()
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank2]:     response = fn(cfg, *args, **kwargs)
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank2]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank2]:     set_rng_state(state, i)
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank2]:     _lazy_call(cb)
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank2]:     callable()
[rank2]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank2]:     default_generator = torch.cuda.default_generators[idx]
[rank2]: IndexError: tuple index out of range
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank1]:     main()
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank1]:     response = fn(cfg, *args, **kwargs)
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank1]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank1]:     set_rng_state(state, i)
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank1]:     _lazy_call(cb)
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank1]:     callable()
[rank1]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank1]:     default_generator = torch.cuda.default_generators[idx]
[rank1]: IndexError: tuple index out of range
[rank3]: Traceback (most recent call last):
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank3]:     main()
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank3]:     response = fn(cfg, *args, **kwargs)
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank3]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank3]:     set_rng_state(state, i)
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank3]:     _lazy_call(cb)
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank3]:     callable()
[rank3]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank3]:     default_generator = torch.cuda.default_generators[idx]
[rank3]: IndexError: tuple index out of range
2024-12-15 23:03:36 - [Sana] - INFO - Resume checkpoint of epoch 1 from /mnt/d/TODAI/SanaFT/Sana_600M_img512_FFT/checkpoints/epoch_1_step_187000.pth. Load ema: False, resume optimizer: True, resume lr scheduler: True.
2024-12-15 23:03:36 - [Sana] - WARNING - Missing keys: ['pos_embed']
2024-12-15 23:03:36 - [Sana] - WARNING - Unexpected keys: []
2024-12-15 23:03:36 - [Sana] - INFO - resuming randomise
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank0]:     response = fn(cfg, *args, **kwargs)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank0]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank0]:     set_rng_state(state, i)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank0]:     _lazy_call(cb)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank0]:     callable()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5193 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5195 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5196 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5197 closing signal SIGTERM
W1215 23:03:38.333000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5198 closing signal SIGTERM
E1215 23:03:38.848000 140181982818432 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 5194) of binary: /mnt/d/TODAI/apps/SanaLinux/venv/bin/python3
Traceback (most recent call last):
  File "/mnt/d/TODAI/apps/SanaLinux/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-15_23:03:38
  host      : RB-MOLECULEXL0001.
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 5194)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@lawrence-cj
Copy link
Collaborator

OH, this is because we save the seed and randomize information in the checkpoint. So when you resume from a ckpt saved on a 7-GPUs node and run the training on 6GPUs it will cause the bug you met. One solution is to use use --train.load_from xxx.pth arg and save new ckpts to a new folder. Let me know if it helps. @AfterHAL

@lawrence-cj lawrence-cj added the Answered Answered the question label Dec 18, 2024
@AfterHAL
Copy link
Author

OH, this is because we save the seed and randomize information in the checkpoint. So when you resume from a ckpt saved on a 7-GPUs node and run the training on 6GPUs it will cause the bug you met. One solution is to use use --train.load_from xxx.pth arg and save new ckpts to a new folder. Let me know if it helps. @AfterHAL

Hi. Unfortunately, it doesn't work. Same error:

[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range

I tryed using --train.load_from xxx.pth (parameter does not exists),
--load_from xxx.pth : same error
--model.load_from xxx.pth : same error
And I tryed also changing the --train.work_dir without success.

@lawrence-cj
Copy link
Collaborator

Oh, it should be --model.load_from xxx.pth

@AfterHAL
Copy link
Author

Oh, it should be --model.load_from xxx.pth

Hi.
I already tried with --model.load_from xxx.pth, but it gives another error:

[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 798, in main
[rank0]:     load_ema=config.model.resume_from.get("load_ema", False),
[rank0]: AttributeError: 'NoneType' object has no attribute 'get'

And using --model.load_from xxx.pth & model.resume_from.load_ema: true goes back to the previous error:

[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range

@AfterHAL
Copy link
Author

Hi. I'm still stuck with this problem and have no clue...
Can someone help me please?

@AfterHAL
Copy link
Author

AfterHAL commented Jan 3, 2025

Hi @lawrence-cj .
I'm still stuck with it. Some helps will be appreciated.
Thanks.

@lawrence-cj
Copy link
Collaborator

lawrence-cj commented Jan 5, 2025

[rank0]: File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 798, in main
[rank0]: load_ema=config.model.resume_from.get("load_ema", False),
[rank0]: AttributeError: 'NoneType' object has no attribute 'get'

what is the output of args.resume_from or config.model.resume_from, when you tried with --model.load_from xxx.pth and what's your complete laughing command?
@AfterHAL

@AfterHAL
Copy link
Author

AfterHAL commented Jan 5, 2025

Thank you @lawrence-cj .

This checkpoint is currently at 222k steps trained on 2 GPUs.
Here is the command to (re)start this training session on 1 GPU, after manually changing the torchrun argument --nproc_per_node=1:

CUDA_VISIBLE_DEVICES=7 \
bash train_scripts/train.sh \
/SanaFT/Sana_600M_img512_woman_Test02_v01.yaml \
--work_dir=/SanaFT/output/Sana_600M_img512_woman_Test02 \
--data.type=SanaImgDataset \
--model.multi_scale=false \
--train.train_batch_size=5 \
--data.data_dir="[/Datasets/Women_Cap]" \
--model.load_from=/SanaFT/output/Sana_600M_img512_woman_Test02/checkpoints/epoch_1_step_222000.pth

And here is the output:

2025-01-05 18:18:13 - [Sana] - INFO - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

2025-01-05 18:18:13 - [Sana] - INFO - Config:
{
    "data": {
        "data_dir": [
            "/Datasets/Women_Cap"
        ],
        "caption_proportion": {
            "prompt": 1
        },
        "external_caption_suffixes": [
            "",
            "_InternVL2-26B",
            "_VILA1-5-13B"
        ],
        "external_clipscore_suffixes": [
            "_InternVL2-26B_clip_score",
            "_VILA1-5-13B_clip_score",
            "_prompt_clip_score"
        ],
        "clip_thr_temperature": 0.1,
        "clip_thr": 25.0,
        "sort_dataset": false,
        "load_text_feat": false,
        "load_vae_feat": false,
        "transform": "default_train",
        "type": "SanaImgDataset",
        "image_size": 512,
        "hq_only": false,
        "valid_num": 0,
        "data": null,
        "extra": null
    },
    "model": {
        "model": "SanaMS_600M_P1_D28",
        "image_size": 512,
        "mixed_precision": "fp16",
        "fp32_attention": true,
        "load_from": "/SanaFT/output/Sana_600M_img512_woman_Test02/checkpoints/epoch_1_step_222000.pth",
        "resume_from": {
            "checkpoint": "latest",
            "load_ema": false,
            "resume_optimizer": true,
            "resume_lr_scheduler": true
        },
        "aspect_ratio_type": "ASPECT_RATIO_512",
        "multi_scale": false,
        "pe_interpolation": 1.0,
        "micro_condition": false,
        "attn_type": "linear",
        "autocast_linear_attn": false,
        "ffn_type": "glumbconv",
        "mlp_acts": [
            "silu",
            "silu",
            null
        ],
        "mlp_ratio": 2.5,
        "use_pe": false,
        "qk_norm": false,
        "class_dropout_prob": 0.1,
        "linear_head_dim": 32,
        "cross_norm": false,
        "cfg_scale": 4,
        "guidance_type": "classifier-free",
        "pag_applied_layers": [
            14
        ],
        "extra": null
    },
    "vae": {
        "vae_type": "dc-ae",
        "vae_pretrained": "mit-han-lab/dc-ae-f32c32-sana-1.0",
        "weight_dtype": "bfloat16",
        "scale_factor": 0.41407,
        "vae_latent_dim": 32,
        "vae_downsample_rate": 32,
        "sample_posterior": true,
        "extra": null
    },
    "text_encoder": {
        "text_encoder_name": "gemma-2-2b-it",
        "caption_channels": 2304,
        "y_norm": true,
        "y_norm_scale_factor": 0.01,
        "model_max_length": 300,
        "chi_prompt": [],
        "extra": null
    },
    "scheduler": {
        "train_sampling_steps": 1000,
        "predict_v": true,
        "noise_schedule": "linear_flow",
        "pred_sigma": false,
        "learn_sigma": true,
        "vis_sampler": "flow_dpm-solver",
        "flow_shift": 1.0,
        "weighting_scheme": "logit_normal",
        "logit_mean": 0.0,
        "logit_std": 1.0,
        "extra": null
    },
    "train": {
        "num_workers": 4,
        "seed": 1,
        "train_batch_size": 5,
        "num_epochs": 40,
        "gradient_accumulation_steps": 1,
        "grad_checkpointing": true,
        "gradient_clip": 0.1,
        "gc_step": 1,
        "optimizer": {
            "betas": [
                0.9,
                0.998,
                0.9999
            ],
            "eps": [
                1e-30,
                1e-16
            ],
            "lr": 0.0001,
            "type": "CAMEWrapper",
            "weight_decay": 0.0
        },
        "lr_schedule": "constant",
        "lr_schedule_args": {
            "num_warmup_steps": 200
        },
        "auto_lr": {
            "rule": "sqrt"
        },
        "ema_rate": 0.9999,
        "eval_batch_size": 16,
        "use_fsdp": false,
        "use_flash_attn": false,
        "eval_sampling_steps": 250,
        "lora_rank": 4,
        "log_interval": 50,
        "mask_type": "null",
        "mask_loss_coef": 0.0,
        "load_mask_index": false,
        "snr_loss": false,
        "real_prompt_ratio": 1.0,
        "training_hours": 10000.0,
        "save_image_epochs": 1,
        "save_model_epochs": 2,
        "save_model_steps": 500,
        "visualize": true,
        "null_embed_root": "output/pretrained_models/",
        "valid_prompt_embed_root": "output/tmp_embed/",
        "validation_prompts": [
            "a woman",
            "a woman wearing lingerie",
            "a sexy woman wearing lingerie and stiletto heels.",
            "a woman smiling.",
            "a full-length woman in underwear."
        ],
        "local_save_vis": true,
        "deterministic_validation": true,
        "online_metric": false,
        "eval_metric_step": 2000,
        "online_metric_dir": "metric_helper",
        "work_dir": "output/debug",
        "skip_step": 0,
        "loss_type": "huber",
        "huber_c": 0.001,
        "num_ddim_timesteps": 50,
        "w_max": 15.0,
        "w_min": 3.0,
        "ema_decay": 0.99,
        "debug_nan": false,
        "extra": null
    },
    "work_dir": "/SanaFT/output/Sana_600M_img512_woman_Test02",
    "resume_from": "latest",
    "load_from": null,
    "debug": false,
    "caching": false,
    "report_to": "tensorboard",
    "tracker_project_name": "t2i-evit-baseline",
    "name": "tmp",
    "loss_report_name": "loss"
}
2025-01-05 18:18:13 - [Sana] - INFO - World_size: 1, seed: 1
2025-01-05 18:18:13 - [Sana] - INFO - Initializing: DDP for training
[DC-AE] Loading model from mit-han-lab/dc-ae-f32c32-sana-1.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.85it/s]
2025-01-05 18:18:21 - [Sana] - INFO - vae type: dc-ae
2025-01-05 18:18:21 - [Sana] - INFO - v-prediction: True, noise schedule: linear_flow, flow shift: 1.0, flow weighting: logit_normal, logit-mean: 0.0, logit-std: 1.0
2025-01-05 18:18:24 - [Sana] - WARNING - use pe: False, position embed interpolation: 1.0, base size: 16
2025-01-05 18:18:24 - [Sana] - WARNING - attention type: linear; ffn type: glumbconv; autocast linear attn: false
2025-01-05 18:18:27 - [Sana] - INFO - SanaMS:SanaMS_600M_P1_D28, Model Parameters: 591.75M
2025-01-05 18:18:27 - [Sana] - INFO - Constructing dataset SanaImgDataset...
2025-01-05 18:18:27 - [Sana] - INFO - Dataset is repeat 2000 times for toy dataset
2025-01-05 18:18:27 - [Sana] - INFO - Dataset samples: 17970000
2025-01-05 18:18:27 - [Sana] - INFO - Loading external caption json from: original_filename['', '_InternVL2-26B', '_VILA1-5-13B'].json
2025-01-05 18:18:27 - [Sana] - INFO - Loading external clipscore json from: original_filename['_InternVL2-26B_clip_score', '_VILA1-5-13B_clip_score', '_prompt_clip_score'].json
2025-01-05 18:18:27 - [Sana] - INFO - external caption clipscore threshold: 25.0, temperature: 0.1
2025-01-05 18:18:27 - [Sana] - INFO - Text max token length: 300
2025-01-05 18:18:27 - [Sana] - INFO - Dataset SanaImgDataset constructed: time: 0.03 s, length (use/ori): 17970000/17970000
2025-01-05 18:18:28 - [Sana] - INFO - Automatically adapt lr to 0.00001 (using sqrt scaling rule).
2025-01-05 18:18:28 - [Sana] - INFO - CAMEWrapper Optimizer: total 436 param groups, 436 are learnable, 0 are fix. Lr group: 436 params with lr 0.00001; Weight decay group: 436 params with weight decay 0.0.
2025-01-05 18:18:28 - [Sana] - INFO - Lr schedule: constant, num_warmup_steps:200.
2025-01-05 18:18:28 - [Sana] - WARNING - Basic Setting: lr: 0.00001, bs: 5, gc: True, gc_accum_step: 1, qk norm: False, fp32 attn: True, attn type: linear, ffn type: glumbconv, text encoder: gemma-2-2b-it, captions: {'prompt': 1}, precision: fp16
[Sana] Loading model from /mnt/d/TODAI/SanaFT/output/Sana_600M_img512_woman_FromPreTrained_Test02/checkpoints/epoch_1_step_222000.pth
2025-01-05 18:19:00 - [Sana] - INFO - Resume checkpoint of epoch 1 from /mnt/d/TODAI/SanaFT/output/Sana_600M_img512_woman_FromPreTrained_Test02/checkpoints/epoch_1_step_222000.pth. Load ema: False, resume optimizer: True, resume lr scheduler: True.
2025-01-05 18:19:00 - [Sana] - WARNING - Missing keys: ['pos_embed']
2025-01-05 18:19:00 - [Sana] - WARNING - Unexpected keys: []
2025-01-05 18:19:00 - [Sana] - INFO - resuming randomise
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 977, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
[rank0]:     response = fn(cfg, *args, **kwargs)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/Sana/train_scripts/train.py", line 950, in main
[rank0]:     torch.cuda.set_rng_state_all(rng_state["torch_cuda"])
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank0]:     set_rng_state(state, i)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank0]:     _lazy_call(cb)
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank0]:     callable()
[rank0]:   File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/cuda/random.py", line 72, in cb
[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]: IndexError: tuple index out of range
E0105 18:19:03.088000 139642610978944 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2343) of binary: /mnt/d/TODAI/apps/SanaLinux/venv/bin/python3
Traceback (most recent call last):
  File "/mnt/d/TODAI/apps/SanaLinux/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/d/TODAI/apps/SanaLinux/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_scripts/train.py FAILED
------------------------------------------------------------

So, the config of model.load_from is OK, but the resume from is still set to "latest": model.resume_from.checkpoint:"latest"

    "model": {
        "model": "SanaMS_600M_P1_D28",
        "image_size": 512,
        "mixed_precision": "fp16",
        "fp32_attention": true,
        "load_from": "/SanaFT/output/Sana_600M_img512_woman_Test02/checkpoints/epoch_1_step_222000.pth",
        "resume_from": {
            "checkpoint": "latest",
            "load_ema": false,
            "resume_optimizer": true,
            "resume_lr_scheduler": true
        },

And config.resume_from is also on latest: "resume_from": "latest"

How can I force it ?

@lawrence-cj
Copy link
Collaborator

#131
This PR would fix the problem @AfterHAL

@AfterHAL
Copy link
Author

AfterHAL commented Jan 8, 2025

#131 This PR would fix the problem @AfterHAL

Hi @lawrence-cj .
I had time try this today, and it works.
Thanks a lot.
I'm not very into coding, I prefer to use the tools and see what we can achieve with.
Thank a lot for your help and understanding @lawrence-cj .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Answered Answered the question
Projects
None yet
Development

No branches or pull requests

2 participants