Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In #1415 networkx seems to fail to complete min cut #1567

Open
crcrpar opened this issue Dec 18, 2024 · 2 comments
Open

In #1415 networkx seems to fail to complete min cut #1567

crcrpar opened this issue Dec 18, 2024 · 2 comments
Assignees

Comments

@crcrpar
Copy link
Collaborator

crcrpar commented Dec 18, 2024

🐛 Bug

networkx seems to fail to complete min cut for an MLP with two torchao.float8 linears and GELU, bf16.
The script below works when dtype is float32.
If the activation is ReLU, then I see a different error.

Traceback (most recent call last):
  File "/opt/pytorch/lightning-thunder/thunder/core/rematerialization.py", line 378, in find_cut
    _, (reachable, non_reachable) = nx.minimum_cut(g, "source", "sink")
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<class 'networkx.utils.decorators.argmap'> compilation 4", line 3, in argmap_minimum_cut_1
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/maxflow.py", line 454, in minimum_cut
    R = flow_func(flowG, _s, _t, capacity=capacity, value_only=True, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<class 'networkx.utils.decorators.argmap'> compilation 8", line 3, in argmap_preflow_push_5
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/preflowpush.py", line 422, in preflow_push
    R = preflow_push_impl(G, s, t, capacity, residual, global_relabel_freq, value_only)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/preflowpush.py", line 41, in preflow_push_impl
    detect_unboundedness(R, s, t)
  File "<class 'networkx.utils.decorators.argmap'> compilation 16", line 3, in argmap_detect_unboundedness_13
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/utils.py", line 173, in detect_unboundedness
    raise nx.NetworkXUnbounded(
networkx.exception.NetworkXUnbounded: Infinite capacity path, flow unbounded above.

To Reproduce

Steps to reproduce the behavior:
- Checkout #1415, more specifically, 0893fbe.

Code sample

import torch
import torch.nn as nn
from torchao.float8 import convert_to_float8_training
import thunder
from thunder.tests.make_tensor import make_tensor


def main():
    batch_size, in_features, out_features = 16, 32, 64

    device = torch.device("cuda")
    dtype = torch.bfloat16
    bias = True

    model = nn.Sequential(
        nn.Linear(in_features, out_features, bias=bias),
        nn.GELU(approximate="tanh"),
        nn.Linear(out_features, out_features, bias=bias),
    ).to(device=device, dtype=dtype)
    fp8_model = convert_to_float8_training(model)
    x = make_tensor((batch_size, in_features), device=device, dtype=dtype)

    jitted = thunder.jit(fp8_model, executors=[thunder.get_executor("torch"), thunder.get_executor("nvfuser")])
    actual = jitted(x)


if __name__ == "__main__":
    main()

Error with ReLU --

Expected behavior

Environment

  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

For this MLP with nvfuser executor, there seems to be either NVIDIA/Fuser#3498 or this one, depending on whether or not I'm applying DCE implemented in 232328c

@crcrpar crcrpar self-assigned this Dec 18, 2024
@IvanYashchuk
Copy link
Collaborator

We used to hit the same error with Thunder recomputation enabled (#1232). @riccardofelluga, since you hit this problem before do you have a minimal reproducer for the problem? Do you know what changes are needed in the trace here?

@riccardofelluga
Copy link
Collaborator

riccardofelluga commented Jan 13, 2025

Min-cut issues are not easy to nail down, tho I've observed that sometimes it might arise from duplicates in the trace. Let me look into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants