Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Memory leak in sm and proxy channels on AMD in python #439

Closed
liangyuRain opened this issue Jan 4, 2025 · 4 comments
Closed

[Bug] Memory leak in sm and proxy channels on AMD in python #439

liangyuRain opened this issue Jan 4, 2025 · 4 comments

Comments

@liangyuRain
Copy link

liangyuRain commented Jan 4, 2025

Hi, the following code repeatedly creates and del sm or proxy channels. The VRAM% from rocm-smi keeps increasing until reaching 25%, at which point it triggers RuntimeError: Call to cudaIpcGetMemHandle(&handle, baseDataPtr) failed. mscclpp/src/registered_memory.cc:102 (Cuda failure: invalid argument). Both sm and proxy channels have the problem, and the problem only appears on rocm. NVIDIA GPUs are fine.

import cupy as cp
import os

import mscclpp.comm as mscclpp_comm
from mscclpp import (
    ProxyService,
    Transport,
)
from mscclpp_mpi import MpiGroup, mpi_group


def create_group_and_connection(mpi_group: MpiGroup):
    group = mscclpp_comm.CommGroup(mpi_group.comm)
    remote_nghrs = list(range(group.nranks))
    remote_nghrs.remove(group.my_rank)
    connections = group.make_connection(remote_nghrs, Transport.CudaIpc)
    return group, connections


def main():
    # MPI group of 2
    mpi_group = MpiGroup(list(range(2)))
    group, connections = create_group_and_connection(mpi_group)
    assert len(connections) == 1
    memory = cp.empty(1, dtype=cp.int32)
    # proxy_service = ProxyService()

    peer = 1 - group.my_rank
    for x in range(100):
        print(x)
        if group.my_rank == 0:
            os.system("rocm-smi")
        channels = [group.make_sm_channels(memory, connections)[peer] for _ in range(100)]
        # channels = [group.make_proxy_channels(proxy_service, memory, connections)[peer] for _ in range(100)]
        for ch in channels:
            del ch
    

if __name__ == "__main__" :
    main()
@chhwang
Copy link
Contributor

chhwang commented Jan 5, 2025

Thank you for the report! We will investigate. Could you let us know your environment details?

@liangyuRain
Copy link
Author

Sure, platform info:

  • ROCm 6.2.1
  • Tested on MI250 and MI210
  • MSCCLPP commit 863a599
  • CuPy commit 0188dd8b16938fa835bcda797f70f9af2f8b4980

@chhwang
Copy link
Contributor

chhwang commented Jan 10, 2025

@liangyuRain We confirmed this is not a bug of MSCCL++. In our environment, upgrading ROCm to 6.3.1 resolved the issue. It is unclear if the issue is from ROCm or CuPy.

@liangyuRain
Copy link
Author

I see. Unfortunately, we cannot upgrade to 6.3.1 right now. Thx for your investigation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants