You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
: logit_normal, logit-mean: 0.0, logit-std: 1.0
2025-01-03 23:03:17 - [Sana] - WARNING - use pe: False, position embed interpolation: 1.0, base size: 32
2025-01-03 23:03:17 - [Sana] - WARNING - attention type: linear; ffn type: glumbconv; autocast linear attn: false
2025-01-03 23:03:35 - [Sana] - INFO - SanaMS:SanaMS_1600M_P1_D20, Model Parameters: 1604.46M
2025-01-03 23:03:35 - [Sana] - INFO - Constructing dataset SanaWebDatasetMS...
2025-01-03 23:03:35 - [Sana] - INFO - loading from /home/linjl/zzc/dataset/wids-meta.json2025-01-03 23:03:35 - [Sana] - INFO - [SimplyInternal] Loading meta information /home/linjl/zzc/dataset/wids-meta.json
2025-01-03 23:03:35 - [Sana] - INFO - [WebShardedList] /home/linjl/zzc/dataset/wids-meta.json, base: ('/home/linjl/zzc
/dataset',), name: , nfiles: 3nbytes: 0, samples: ('22842',), cache: /home/linjl/.cache/_wids_cache/linjl-0777c725
2025-01-03 23:03:35 - [Sana] - INFO - Loading external caption json from: original_filename[''].json
2025-01-03 23:03:35 - [Sana] - INFO - Loading external clipscore json from: original_filename['_InternVL2-26B_clip_sco
re', '_VILA1-5-13B_clip_score', '_prompt_clip_score'].json
2025-01-03 23:03:35 - [Sana] - INFO - external caption clipscore threshold: 25.0, temperature: 0.1
2025-01-03 23:03:35 - [Sana] - INFO - Text max token length: 300
2025-01-03 23:03:35 - [Sana] - WARNING - Sort the dataset: False
2025-01-03 23:03:35 - [Sana] - INFO - Dataset SanaWebDatasetMS constructed: time: 0.00 s, length (use/ori): 22842/2284
2
2025-01-03 23:03:37 - [Sana] - WARNING - Using valid_num=0 in config file. Available 40 aspect_ratios: ['0.25', '0.26'
, '0.27', '0.28', '0.32', '0.33', '0.35', '0.4', '0.42', '0.48', '0.5', '0.52', '0.57', '0.6', '0.68', '0.72', '0.78',
'0.82', '0.88', '0.94', '1.0', '1.07', '1.13', '1.21', '1.29', '1.38', '1.46', '1.67', '1.75', '2.0', '2.09', '2.4',
'2.5', '2.89', '3.0', '3.11', '3.62', '3.75', '3.88', '4.0']
2025-01-03 23:03:37 - [Sana] - INFO - No cached file is found, dataloader is slow: /home/linjl/.cache/_wids_batchsampl
er_cache/linjl-8b140d22-sort_datasetFalse-hq_onlyFalse-valid_num0-aspect_ratio40-droplastTruedataset_len22842-num_repl
icas4-rank0-/home/linjl/zzc/dataset.json
2025-01-03 23:03:37 - [Sana] - INFO - rank-0 Cached file len: 0
2025-01-03 23:03:37 - [Sana] - INFO - Automatically adapt lr to 0.00001 (using sqrt scaling rule).
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
2025-01-03 23:03:37 - [Sana] - INFO - CAMEWrapper Optimizer: total 316 param groups, 316 are learnable, 0 are fix. Lr
group: 316 params with lr 0.00001; Weight decay group: 316 params with weight decay 0.0.
2025-01-03 23:03:37 - [Sana] - INFO - Lr schedule: constant, num_warmup_steps:1600.
2025-01-03 23:03:37 - [Sana] - WARNING - Basic Setting: lr: 0.00001, bs: 1, gc: True, gc_accum_step: 1, qk norm: False
, fp32 attn: True, attn type: linear, ffn type: glumbconv, text encoder: gemma-2-2b-it, captions: {'prompt': 1}, preci
sion: fp16
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
2025-01-03 23:03:42 - [Sana] - INFO - Load checkpoint from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth. Load ema:
False.
2025-01-03 23:03:42 - [Sana] - WARNING - Missing keys: ['pos_embed']
2025-01-03 23:03:42 - [Sana] - WARNING - Unexpected keys: []
W0103 23:03:57.870000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2597124 cl
osing signal SIGTERM
W0103 23:03:57.871000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2597125 closing signal SIGTERM
W0103 23:03:57.871000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2597127 closing signal SIGTERM
E0103 23:04:01.008000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -7) local_rank: 2 (pid: 2597126) of binary: /home/linjl/anaconda3/envs/sana/bin/python
Traceback (most recent call last):
File "/home/linjl/anaconda3/envs/sana/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in
__call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_scripts/train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-03_23:03:57
host : bme-server
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 2597126)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 2597126
=======================================================
Already passed the accelerate test.
And both RAM and virtual memory are sufficient.
The environment was prepared using the official steps of this project.
with 4x 3090
The text was updated successfully, but these errors were encountered:
Pevernow
changed the title
Received signal 7 (SIGBUS) during training with multiple GPUs
[Bug]Received signal 7 (SIGBUS) during training with multiple GPUs
Jan 3, 2025
@lawrence-cj
Solved.
This is caused by the lack of "file_name" in the json of the webdataset, but the program error does not give any prompts. I hope to improve the exception handling.
This is caused by the lack of "file_name" in the json of the webdataset, but the program error does not give any prompts. I hope to improve the exception handling.
I'm not sure why there is no error message here.😂 At least, there should be some python error message here, right?
This is caused by the lack of "file_name" in the json of the webdataset, but the program error does not give any prompts. I hope to improve the exception handling.
I'm not sure why there is no error message here.😂 At least, there should be some python error message here, right?
@lawrence-cj I guess it might be that a call somewhere directly triggered a SIGBUS error, which directly killed the process, leaving Python no time to report an error.
Already passed the accelerate test.
And both RAM and virtual memory are sufficient.
The environment was prepared using the official steps of this project.
with 4x 3090
The text was updated successfully, but these errors were encountered: