Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]Received signal 7 (SIGBUS) during training with multiple GPUs #126

Open
Pevernow opened this issue Jan 3, 2025 · 4 comments
Open
Labels
Answered Answered the question fixed fix a bug

Comments

@Pevernow
Copy link

Pevernow commented Jan 3, 2025

: logit_normal, logit-mean: 0.0, logit-std: 1.0
2025-01-03 23:03:17 - [Sana] - WARNING - use pe: False, position embed interpolation: 1.0, base size: 32
2025-01-03 23:03:17 - [Sana] - WARNING - attention type: linear; ffn type: glumbconv; autocast linear attn: false
2025-01-03 23:03:35 - [Sana] - INFO - SanaMS:SanaMS_1600M_P1_D20, Model Parameters: 1604.46M
2025-01-03 23:03:35 - [Sana] - INFO - Constructing dataset SanaWebDatasetMS...
2025-01-03 23:03:35 - [Sana] - INFO - loading from /home/linjl/zzc/dataset/wids-meta.json2025-01-03 23:03:35 - [Sana] - INFO - [SimplyInternal] Loading meta information /home/linjl/zzc/dataset/wids-meta.json
2025-01-03 23:03:35 - [Sana] - INFO - [WebShardedList] /home/linjl/zzc/dataset/wids-meta.json, base: ('/home/linjl/zzc
/dataset',), name: , nfiles: 3nbytes: 0, samples: ('22842',), cache: /home/linjl/.cache/_wids_cache/linjl-0777c725
2025-01-03 23:03:35 - [Sana] - INFO - Loading external caption json from: original_filename[''].json
2025-01-03 23:03:35 - [Sana] - INFO - Loading external clipscore json from: original_filename['_InternVL2-26B_clip_sco
re', '_VILA1-5-13B_clip_score', '_prompt_clip_score'].json
2025-01-03 23:03:35 - [Sana] - INFO - external caption clipscore threshold: 25.0, temperature: 0.1
2025-01-03 23:03:35 - [Sana] - INFO - Text max token length: 300
2025-01-03 23:03:35 - [Sana] - WARNING - Sort the dataset: False
2025-01-03 23:03:35 - [Sana] - INFO - Dataset SanaWebDatasetMS constructed: time: 0.00 s, length (use/ori): 22842/2284
2
2025-01-03 23:03:37 - [Sana] - WARNING - Using valid_num=0 in config file. Available 40 aspect_ratios: ['0.25', '0.26'
, '0.27', '0.28', '0.32', '0.33', '0.35', '0.4', '0.42', '0.48', '0.5', '0.52', '0.57', '0.6', '0.68', '0.72', '0.78',
 '0.82', '0.88', '0.94', '1.0', '1.07', '1.13', '1.21', '1.29', '1.38', '1.46', '1.67', '1.75', '2.0', '2.09', '2.4', 
'2.5', '2.89', '3.0', '3.11', '3.62', '3.75', '3.88', '4.0']
2025-01-03 23:03:37 - [Sana] - INFO - No cached file is found, dataloader is slow: /home/linjl/.cache/_wids_batchsampl
er_cache/linjl-8b140d22-sort_datasetFalse-hq_onlyFalse-valid_num0-aspect_ratio40-droplastTruedataset_len22842-num_repl
icas4-rank0-/home/linjl/zzc/dataset.json
2025-01-03 23:03:37 - [Sana] - INFO - rank-0 Cached file len: 0
2025-01-03 23:03:37 - [Sana] - INFO - Automatically adapt lr to 0.00001 (using sqrt scaling rule).
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
2025-01-03 23:03:37 - [Sana] - INFO - CAMEWrapper Optimizer: total 316 param groups, 316 are learnable, 0 are fix. Lr 
group: 316 params with lr 0.00001; Weight decay group: 316 params with weight decay 0.0.
2025-01-03 23:03:37 - [Sana] - INFO - Lr schedule: constant, num_warmup_steps:1600.
2025-01-03 23:03:37 - [Sana] - WARNING - Basic Setting: lr: 0.00001, bs: 1, gc: True, gc_accum_step: 1, qk norm: False
, fp32 attn: True, attn type: linear, ffn type: glumbconv, text encoder: gemma-2-2b-it, captions: {'prompt': 1}, preci
sion: fp16
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
[Sana] Loading model from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth
2025-01-03 23:03:42 - [Sana] - INFO - Load checkpoint from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth. Load ema: 
False.
2025-01-03 23:03:42 - [Sana] - WARNING - Missing keys: ['pos_embed']
2025-01-03 23:03:42 - [Sana] - WARNING - Unexpected keys: []
W0103 23:03:57.870000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2597124 cl
osing signal SIGTERM
W0103 23:03:57.871000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2597125 closing signal SIGTERM
W0103 23:03:57.871000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2597127 closing signal SIGTERM
E0103 23:04:01.008000 140372742742400 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -7) local_rank: 2 (pid: 2597126) of binary: /home/linjl/anaconda3/envs/sana/bin/python
Traceback (most recent call last):
  File "/home/linjl/anaconda3/envs/sana/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in 
__call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_scripts/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-03_23:03:57
  host      : bme-server
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 2597126)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2597126
=======================================================



Already passed the accelerate test.

And both RAM and virtual memory are sufficient.
image

The environment was prepared using the official steps of this project.

with 4x 3090

@Pevernow Pevernow changed the title Received signal 7 (SIGBUS) during training with multiple GPUs [Bug]Received signal 7 (SIGBUS) during training with multiple GPUs Jan 3, 2025
@lawrence-cj
Copy link
Collaborator

lawrence-cj commented Jan 5, 2025

What's the GPU memory of your machine? Can you launch the training with a single 3090 GPU?

@Pevernow
Copy link
Author

Pevernow commented Jan 5, 2025

@lawrence-cj
Solved.
This is caused by the lack of "file_name" in the json of the webdataset, but the program error does not give any prompts. I hope to improve the exception handling.

@lawrence-cj
Copy link
Collaborator

lawrence-cj commented Jan 5, 2025

This is caused by the lack of "file_name" in the json of the webdataset, but the program error does not give any prompts. I hope to improve the exception handling.

I'm not sure why there is no error message here.😂 At least, there should be some python error message here, right?

@lawrence-cj lawrence-cj added Answered Answered the question fixed fix a bug labels Jan 5, 2025
@Pevernow
Copy link
Author

Pevernow commented Jan 5, 2025

This is caused by the lack of "file_name" in the json of the webdataset, but the program error does not give any prompts. I hope to improve the exception handling.

I'm not sure why there is no error message here.😂 At least, there should be some python error message here, right?

@lawrence-cj I guess it might be that a call somewhere directly triggered a SIGBUS error, which directly killed the process, leaving Python no time to report an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Answered Answered the question fixed fix a bug
Projects
None yet
Development

No branches or pull requests

2 participants