How to support multi-device VLLM inference in the GRPO Trainer #2922

0x404 · 2025-02-21T09:24:51Z

Lines 439 to 461 in e5ae703

    
           if self.accelerator.is_main_process: 
        
               vllm_device = self.args.vllm_device 
        
               if vllm_device == "auto": 
        
                   if torch.cuda.device_count() == 1: 
        
                       vllm_device = "cuda:0"  # particular case when training with onyl 1 GPU: share it 
        
                   else: 
        
                       vllm_device = f"cuda:{self.accelerator.num_processes}"  # take the next GPU idx 
        
               # Check that the requested device is available 
        
               if vllm_device.split(":")[0] == "cuda" and int(vllm_device.split(":")[1]) >= torch.cuda.device_count(): 
        
                   raise ValueError( 
        
                       f"The requested device for vllm ({vllm_device}) is not available. You are likely using vLLM " 
        
                       "without restricting the number of GPUs for training. Set the `--num_processes` argument to a " 
        
                       "value lower than the number of GPUs available on your machine—typically, reducing it by one " 
        
                       f"is sufficient. In your case: `--num_processes {torch.cuda.device_count() - 1}`." 
        
                   ) 
        
               # Check that the requested device is not also used for training 
        
               if vllm_device in {f"cuda:{idx}" for idx in range(self.accelerator.num_processes)}: 
        
                   warnings.warn( 
        
                       f"The requested device {vllm_device} is also being used for training. For higher throughput " 
        
                       "and to avoid out-of-memory errors, it is recommended to use a dedicated device for vLLM. " 
        
                       "If this is intentional, you may ignore this warning but should adjust " 
        
                       "`vllm_gpu_memory_utilization` accordingly." 
        
                   )

In the current GRPO implementation, VLLM can only run on a single GPU, which becomes a performance bottleneck. For example, in an 8-GPU setup, the remaining 7 GPUs have to wait for 1 GPU to complete inference, and it also can't accommodate larger models.

How can we enable VLLM to run on multiple GPUs? The only concern is that we need to figure out a way to update the parameters across multiple GPUs each time the model is reloaded:

trl/trl/trainer/grpo_trainer.py

Lines 624 to 653 in e5ae703

    
           def _move_model_to_vllm(self): 
        
               with unwrap_model_for_generation( 
        
                   self.model, self.accelerator, gather_deepspeed3_params=self.args.ds3_gather_for_generation 
        
               ) as unwrapped_model: 
        
                   if is_compiled_module(unwrapped_model): 
        
                       unwrapped_model = unwrapped_model._orig_mod 
        
                   if is_peft_model(unwrapped_model): 
        
                       unwrapped_model.merge_adapter() 
        
                       state_dict = unwrapped_model.state_dict() 
        
                       # Remove base_model and base_layer prefixes 
        
                       state_dict = { 
        
                           k.removeprefix("base_model.model.").replace(".base_layer", ""): v for k, v in state_dict.items() 
        
                       } 
        
                       # Remove values with adapter prefix (example: "_lora") 
        
                       state_dict = {k: v for k, v in state_dict.items() if unwrapped_model.prefix not in k} 
        
                       # When module to save, remove its prefix and discard the original module 
        
                       state_dict = { 
        
                           k.replace("modules_to_save.default.", ""): v 
        
                           for k, v in state_dict.items() 
        
                           if "original_module" not in k 
        
                       } 
        
                   else: 
        
                       state_dict = unwrapped_model.state_dict() 
        
                   if self.accelerator.is_main_process: 
        
                       llm_model = self.llm.llm_engine.model_executor.driver_worker.model_runner.model 
        
                       llm_model.load_weights(state_dict.items()) 
        
                   # Unmerge the adapter to restore the model to its original state. 
        
                   # This must be done after loading weights to ensure they correspond to the merged state. 
        
                   if is_peft_model(unwrapped_model): 
        
                       unwrapped_model.unmerge_adapter()

github-actions bot added 🏋 GRPO Related to GRPO ✨ enhancement New feature or request labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to support multi-device VLLM inference in the GRPO Trainer #2922

How to support multi-device VLLM inference in the GRPO Trainer #2922

0x404 commented Feb 21, 2025

How to support multi-device VLLM inference in the GRPO Trainer #2922

How to support multi-device VLLM inference in the GRPO Trainer #2922

Comments

0x404 commented Feb 21, 2025