Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run Unittest #2

Open
ly-muc opened this issue Nov 22, 2022 · 1 comment
Open

Unable to run Unittest #2

ly-muc opened this issue Nov 22, 2022 · 1 comment

Comments

@ly-muc
Copy link

ly-muc commented Nov 22, 2022

I am currently trying to verify the correctness of my installation. In order to handle different nodes, my test script differs from the original script in the following lines.

os.environ['MASTER_ADDR'] = args.masterhost
os.environ['MASTER_PORT'] = '4040'
os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]

dist.init_process_group(backend="cgx",  init_method="env://", rank=self.rank % torch.cuda.device_count())

I execute the test with the following line:

mpirun -np 2 -x PATH --hostfile hostfile --tag-output --allow-run-as-root -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca coll ^hcoll -- python test/test_qmpi.py --masterhost=$MASTER_HOST

However, the test fails and I get an assertion error when comparing with the expected tensor. Here, I get different error messages when repeating the test. For example, either the following error message occurs:

======================================================================
FAIL: test_compressed_exact (__main__.CGXTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_qmpi.py", line 95, in test_compressed_exact
    	self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))
AssertionError: Tensors are not equal: tensor([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
        2., 2.], device='cuda:0', dtype=torch.float16) != tensor([3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
        3., 3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 128

or this one:

======================================================================
FAIL: test_compressed_exact (__main__.CGXTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_qmpi.py", line 95, in test_compressed_exact
    self.assertEqual(t, expected, "Parameters. bits {},buffer size: {}".format(q, t.numel()))
AssertionError: Tensors are not equal: tensor([2.], device='cuda:0', dtype=torch.float16) != tensor([3.], device='cuda:0', dtype=torch.float16). Parameters. bits 2,buffer size: 1

In the two cases shown, the assertion fails at a different step while iterating over the tensor lengths. Do you possibly have an idea what could cause this?

For my understanding, in the readme when dist.init_process_group is called, the local rank is used. Does this assume that there is only one node?

Thanks!

@ilmarkov
Copy link
Member

@ly-muc Thank you for the filing the issue!
The problem was in the code. It is fixed in the commit and new release.

The test was only ran on a single node but should also work in multinode setting.
I think it is sufficient to have
dist.init_process_group(backend="cgx", init_method="env://", rank=self.rank). The rank is taken from OMPI_COMM_WORLD_RANK which is supposed to be global rank, not local.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants