You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When precompiling a model using neuron_parallel_compile, the script runs for a while and it seems to compile and save some of the graphs on the cached directory, but it then displays an error on exit (repeated several times for each device):
Compiler status PASS
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/torch_neuronx/parallel_compile/neuron_parallel_compile.py", line 167, in compile_task
compile_task_helper(compiled_hlo_status, compile_cache, hlos, workdir, dump=dump,
File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/torch_neuronx/parallel_compile/neuron_parallel_compile.py", line 98, in compile_task_helper
status, retry = libneuronxla.neuron_cc_wrapper.compile_cache_entry(
File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py", line 148, in compile_cache_entry
entry.download_hlo(tmp_model_path)
File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/libneuronxla/neuron_cc_cache.py", line 167, in download_hlo
self.cache.download_file(self.hlo_path, dst_path)
File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/libneuronxla/neuron_cc_cache.py", line 344, in download_file
shutil.copyfile(cache_path, dst_path)
File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/cache_dir_neuron/neuronxcc-2.15.143.0+e39249ad/MODULE_10800808726427433249+6d1be540/model.hlo_module.pb'
"""
Is there anything we can do to avoid this?
The error is reproducible on a trn1n.32xlarge instance using the script in sft_lora_finetune_llm_compile.sh on the optimum-neuron repo, on the problem-training-neuron_parallel_compile branch.
Full log attached here: compilation-failure.log
The text was updated successfully, but these errors were encountered:
When precompiling a model using
neuron_parallel_compile
, the script runs for a while and it seems to compile and save some of the graphs on the cached directory, but it then displays an error on exit (repeated several times for each device):Is there anything we can do to avoid this?
The error is reproducible on a trn1n.32xlarge instance using the script in sft_lora_finetune_llm_compile.sh on the optimum-neuron repo, on the problem-training-neuron_parallel_compile branch.
Full log attached here: compilation-failure.log
The text was updated successfully, but these errors were encountered: