Tkurth/extended distributed primitives #273

azrael417 · 2023-12-07T14:33:19Z

Modulus Pull Request

Description

This PR enabled gathering of tensors of uneven shapes. This is necessary for integrating modulus into newer versions of makani. Some of the routines can be merged with the V-routines for the graph NN code. I haven't done that yet but I am happy to discuss this

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

No new dependencies necessary

azrael417 · 2023-12-07T14:34:13Z

Concerning the changelog, how does that need to be updated? Doesn't that also depend on what was merged in between this PR and other PR which came before this but after I forked the branch?

Signed-off-by: Thorsten Kurth <[email protected]>

akshaysubr

Looks good to me. Only had a couple of relatively minor comments.

Would be good to also ensure that all the existing distributed tests pass locally by running pytest -m multigpu in the test/ directory since these tests are not covered by CI currently.

modulus/distributed/utils.py

modulus/distributed/mappings.py

modulus/distributed/utils.py

Signed-off-by: Thorsten Kurth <[email protected]>

…ub.com/azrael417/modulus into tkurth/extended-distributed-primitives

stadlmax

Sorry, forgot to approve this earlier. Thanks for addressing the comment w.r.t. unified utilities. LGTM now.

NickGeneva · 2023-12-12T19:13:08Z

/blossom-ci

azrael417 · 2023-12-13T07:55:26Z

I ran the multrigpu test but ran into some issues. First, there is an assert that num_gpu == 2 (not >=2), so these tests fail on my dgxstation with 4 gpu. Can we relax that criterion a bit?

Working around it with cuda visible devices I can run some of the tests but the meshgraphnet one fails, but this is not related to this MR I think:

`models/meshgraphnet/test_meshgraphnet_snmg.py FFF [100%]

=================================== FAILURES ===================================
____________________ test_distributed_meshgraphnet[dtype0] _____________________

dtype = torch.float32

@pytest.mark.multigpu
@pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16])
def test_distributed_meshgraphnet(dtype):
    num_gpus = torch.cuda.device_count()
    assert num_gpus >= 2, "Not enough GPUs available for test"
    world_size = num_gpus

  torch.multiprocessing.spawn(

        run_test_distributed_meshgraphnet,
        args=(world_size, dtype),
        nprocs=world_size,
        start_method="spawn",
    )

models/meshgraphnet/test_meshgraphnet_snmg.py:193:

../../.conda/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:246: in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
../../.conda/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:202: in start_processes
while not context.join():

self = <torch.multiprocessing.spawn.ProcessContext object at 0x7f7f41a258a0>
timeout = None

def join(self, timeout=None):
    r"""
    Tries to join one or more processes in this spawn context.
    If one of them exited with a non-zero exit status, this function
    kills the remaining processes and raises an exception with the cause
    of the first process exiting.

    Returns ``True`` if all processes have been joined successfully,
    ``False`` if there are more processes that need to be joined.

    Args:
        timeout (float): Wait this long before giving up on waiting.
    """
    # Ensure this function can be called even when we're done.
    if len(self.sentinels) == 0:
        return True

    # Wait for any process to fail or all of them to succeed.
    ready = multiprocessing.connection.wait(
        self.sentinels.keys(),
        timeout=timeout,
    )

    error_index = None
    for sentinel in ready:
        index = self.sentinels.pop(sentinel)
        process = self.processes[index]
        process.join()
        if process.exitcode != 0:
            error_index = index
            break

    # Return if there was no error.
    if error_index is None:
        # Return whether or not all processes have been joined.
        return len(self.sentinels) == 0

    # Assume failure. Terminate processes that are still alive.
    for process in self.processes:
        if process.is_alive():
            process.terminate()
        process.join()

`

These here look good:

test_multi_gpu_sample.py . [ 8%] distributed/test_autograd.py .... [ 41%] distributed/test_config.py . [ 50%] distributed/test_distributed_fft.py . [ 58%] distributed/test_manager.py .. [ 75%]

akshaysubr · 2023-12-13T16:17:59Z

Yeah, the meshgraphnet failure is not related to this MR and an independent issue. Created a separate issue to track that: #278

akshaysubr · 2023-12-13T16:22:02Z

/blossom-ci

stadlmax · 2023-12-13T16:25:27Z

I ran the multrigpu test but ran into some issues. First, there is an assert that num_gpu == 2 (not >=2), so these tests fail on my dgxstation with 4 gpu. Can we relax that criterion a bit? Working around it with cuda visible devices I can run some of the tests but the meshgraphnet one fails, but this is not related to this MR I think:

I actually started using things like >=2 and setting world_size = num_gpus in a few places where I changed things. @akshaysubr We really should get rid off the assert ... == 2 things eventually.

azrael417 requested a review from akshaysubr December 7, 2023 14:34

azrael417 added 10 commits December 7, 2023 06:44

adding more flexibility to shapes

db41e44

Signed-off-by: Thorsten Kurth <[email protected]>

fixing bug in _gather

e68fe87

Signed-off-by: Thorsten Kurth <[email protected]>

adding missing None gradfient for input shape list to gather

cb7723c

Signed-off-by: Thorsten Kurth <[email protected]>

debugging utils.py

409fd3e

Signed-off-by: Thorsten Kurth <[email protected]>

fixing small typo in split backward

bd4c50b

Signed-off-by: Thorsten Kurth <[email protected]>

debugging gather problem

b503f54

Signed-off-by: Thorsten Kurth <[email protected]>

fixing some gather bugs

db18d84

Signed-off-by: Thorsten Kurth <[email protected]>

fixing assert error in split_tensor_along_dim

9f6384b

Signed-off-by: Thorsten Kurth <[email protected]>

removing debug printing

c1e1859

Signed-off-by: Thorsten Kurth <[email protected]>

passing-pre-commit

f4fbd15

Signed-off-by: Thorsten Kurth <[email protected]>

azrael417 force-pushed the tkurth/extended-distributed-primitives branch from d7bdb57 to f4fbd15 Compare December 7, 2023 14:45

Merge branch 'main' into tkurth/extended-distributed-primitives

9eded4c

akshaysubr approved these changes Dec 11, 2023

View reviewed changes

modulus/distributed/utils.py Show resolved Hide resolved

modulus/distributed/utils.py Show resolved Hide resolved

modulus/distributed/mappings.py Show resolved Hide resolved

akshaysubr requested a review from stadlmax December 11, 2023 19:45

akshaysubr added distributed Distributed and model parallel tools 4 - In Review Currently Under Review labels Dec 11, 2023

stadlmax reviewed Dec 12, 2023

View reviewed changes

modulus/distributed/utils.py Outdated Show resolved Hide resolved

azrael417 added 2 commits December 12, 2023 01:05

merging _gather and all_gather_v

9f6276a

Signed-off-by: Thorsten Kurth <[email protected]>

Merge branch 'tkurth/extended-distributed-primitives' of https://gith…

15e1f37

…ub.com/azrael417/modulus into tkurth/extended-distributed-primitives

stadlmax approved these changes Dec 12, 2023

View reviewed changes

Merge branch 'main' into tkurth/extended-distributed-primitives

8647b7b

akshaysubr mentioned this pull request Dec 13, 2023

🐛[BUG]: MeshGraphNet multiGPU test failure #278

Closed

Merge branch 'main' into tkurth/extended-distributed-primitives

22d6316

azrael417 merged commit fd80783 into NVIDIA:main Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tkurth/extended distributed primitives #273

Tkurth/extended distributed primitives #273

azrael417 commented Dec 7, 2023

azrael417 commented Dec 7, 2023

akshaysubr left a comment

stadlmax left a comment

NickGeneva commented Dec 12, 2023

azrael417 commented Dec 13, 2023 •

edited

Loading

akshaysubr commented Dec 13, 2023

akshaysubr commented Dec 13, 2023

stadlmax commented Dec 13, 2023

Tkurth/extended distributed primitives #273

Tkurth/extended distributed primitives #273

Conversation

azrael417 commented Dec 7, 2023

Modulus Pull Request

Description

Checklist

Dependencies

azrael417 commented Dec 7, 2023

akshaysubr left a comment

Choose a reason for hiding this comment

stadlmax left a comment

Choose a reason for hiding this comment

NickGeneva commented Dec 12, 2023

azrael417 commented Dec 13, 2023 • edited Loading

akshaysubr commented Dec 13, 2023

akshaysubr commented Dec 13, 2023

stadlmax commented Dec 13, 2023

azrael417 commented Dec 13, 2023 •

edited

Loading