Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError occurs on libneuronxla when precompiling model for training #1097

Open
tengomucho opened this issue Jan 29, 2025 · 1 comment

Comments

@tengomucho
Copy link

tengomucho commented Jan 29, 2025

When precompiling a model using neuron_parallel_compile, the script runs for a while and it seems to compile and save some of the graphs on the cached directory, but it then displays an error on exit (repeated several times for each device):

Compiler status PASS
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/torch_neuronx/parallel_compile/neuron_parallel_compile.py", line 167, in compile_task
    compile_task_helper(compiled_hlo_status, compile_cache, hlos, workdir, dump=dump,
  File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/torch_neuronx/parallel_compile/neuron_parallel_compile.py", line 98, in compile_task_helper
    status, retry = libneuronxla.neuron_cc_wrapper.compile_cache_entry(
  File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py", line 148, in compile_cache_entry
    entry.download_hlo(tmp_model_path)
  File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/libneuronxla/neuron_cc_cache.py", line 167, in download_hlo
    self.cache.download_file(self.hlo_path, dst_path)
  File "/home/ubuntu/Dev/venv/hf/lib/python3.10/site-packages/libneuronxla/neuron_cc_cache.py", line 344, in download_file
    shutil.copyfile(cache_path, dst_path)
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/cache_dir_neuron/neuronxcc-2.15.143.0+e39249ad/MODULE_10800808726427433249+6d1be540/model.hlo_module.pb'
"""

Is there anything we can do to avoid this?

The error is reproducible on a trn1n.32xlarge instance using the script in sft_lora_finetune_llm_compile.sh on the optimum-neuron repo, on the problem-training-neuron_parallel_compile branch.
Full log attached here: compilation-failure.log

@aws-rishyraj
Copy link
Contributor

Hi @tengomucho,

Thanks for opening this issue, we're taking a look on our end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants