In #1415 networkx seems to fail to complete min cut #1567

crcrpar · 2024-12-18T08:30:54Z

🐛 Bug

networkx seems to fail to complete min cut for an MLP with two torchao.float8 linears and GELU, bf16.
The script below works when dtype is float32.
If the activation is ReLU, then I see a different error.

Traceback (most recent call last):
  File "/opt/pytorch/lightning-thunder/thunder/core/rematerialization.py", line 378, in find_cut
    _, (reachable, non_reachable) = nx.minimum_cut(g, "source", "sink")
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<class 'networkx.utils.decorators.argmap'> compilation 4", line 3, in argmap_minimum_cut_1
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/maxflow.py", line 454, in minimum_cut
    R = flow_func(flowG, _s, _t, capacity=capacity, value_only=True, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<class 'networkx.utils.decorators.argmap'> compilation 8", line 3, in argmap_preflow_push_5
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/preflowpush.py", line 422, in preflow_push
    R = preflow_push_impl(G, s, t, capacity, residual, global_relabel_freq, value_only)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/preflowpush.py", line 41, in preflow_push_impl
    detect_unboundedness(R, s, t)
  File "<class 'networkx.utils.decorators.argmap'> compilation 16", line 3, in argmap_detect_unboundedness_13
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/utils.py", line 173, in detect_unboundedness
    raise nx.NetworkXUnbounded(
networkx.exception.NetworkXUnbounded: Infinite capacity path, flow unbounded above.

To Reproduce

Steps to reproduce the behavior:
~~- Checkout #1415, more specifically, 0893fbe.~~

checkout [tensor subclass] print type_string of tensor attributes #1592 and run the test of this parameterization -- https://github.com/Lightning-AI/lightning-thunder/blob/subclass_tensor-type-str/thunder/tests/test_tensor_subclass.py#L290-L295 or run the following snippet.

Code sample

import torch
import torch.nn as nn
from torchao.float8 import convert_to_float8_training
import thunder
from thunder.tests.make_tensor import make_tensor


def main():
    batch_size, in_features, out_features = 16, 32, 64

    device = torch.device("cuda")
    dtype = torch.bfloat16
    bias = True

    model = nn.Sequential(
        nn.Linear(in_features, out_features, bias=bias),
        nn.GELU(approximate="tanh"),
        nn.Linear(out_features, out_features, bias=bias),
    ).to(device=device, dtype=dtype)
    fp8_model = convert_to_float8_training(model)
    x = make_tensor((batch_size, in_features), device=device, dtype=dtype)

    jitted = thunder.jit(fp8_model, executors=[thunder.get_executor("torch"), thunder.get_executor("nvfuser")])
    actual = jitted(x)


if __name__ == "__main__":
    main()

Error with ReLU --

Expected behavior

Environment

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

For this MLP with nvfuser executor, there seems to be either NVIDIA/Fuser#3498 or this one, depending on whether or not I'm applying DCE implemented in 232328c

The text was updated successfully, but these errors were encountered:

IvanYashchuk · 2025-01-08T17:36:40Z

We used to hit the same error with Thunder recomputation enabled (#1232). @riccardofelluga, since you hit this problem before do you have a minimal reproducer for the problem? Do you know what changes are needed in the trace here?

riccardofelluga · 2025-01-13T14:00:15Z

Min-cut issues are not easy to nail down, tho I've observed that sometimes it might arise from duplicates in the trace. Let me look into it

crcrpar self-assigned this Dec 18, 2024

crcrpar mentioned this issue Dec 18, 2024

[torchao float8tensor] #1415

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In #1415 networkx seems to fail to complete min cut #1567

In #1415 networkx seems to fail to complete min cut #1567

crcrpar commented Dec 18, 2024 •

edited

Loading

IvanYashchuk commented Jan 8, 2025

riccardofelluga commented Jan 13, 2025 •

edited

Loading

In #1415 networkx seems to fail to complete min cut #1567

In #1415 networkx seems to fail to complete min cut #1567

Comments

crcrpar commented Dec 18, 2024 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

IvanYashchuk commented Jan 8, 2025

riccardofelluga commented Jan 13, 2025 • edited Loading

crcrpar commented Dec 18, 2024 •

edited

Loading

riccardofelluga commented Jan 13, 2025 •

edited

Loading