-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Neuron and HabanaLabs devices #155
Conversation
Hi @mrgolin , thanks for the pull request! |
@HassanKhadour Thanks for taking a look! |
Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations. |
+1, abstracting out the accelerator alloc/copy routines would make things maintainable for the individual benchmarks going forward. |
I've added abstraction for various memory types as well as support for Habana Labs AI accelerators. |
@HassanKhadour Can you please share failing check results? |
Hi Michael, src/host_memory.c:68:26: style: Unused variable: host_ctx [unusedVariable] |
The check has been passed, reported wrong status, fixed it |
Thanks! |
@galpress @rajachan @HassanKhadour any comments will be appreciated. |
How is that this PR is still waiting for review after we've been asked to refactor perftest accelerators memory support and at the same time commits are pushed directly to master adding a new memory type (Cuda dma-buf) in an old way, blowing the code, making it unmaintainable (exactly as @gal-pressman and @rajachan mentioned) and making this PR unmergeable? |
Hi mrgolin, We are sorry we weren't in the loop of the review from the beginning, we apologize about the inconvenience that is caused. |
I merely suggested it doesn't make sense that the same code is copy-pasted again and again. |
Gal, I actually did what you suggested (See 457fc46) and will appreciate if you can review this again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can review the Habana/Neuron bits. I do like the intent of the memory type abstraction commit, so I'd be fine merging it if you get sign offs on the other two commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
The rdma_cm is damaged, getting segmentation fault after running with it.
Can you please fix it and address the comments ?
} | ||
pp_free_gpu(ctx); | ||
for (i = 0; i < dereg_counter; i++) { | ||
ctx->memory->free_buffer(ctx->memory, 0, ctx->buf[i], ctx->buff_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are u sure thats its making the exact behaviour for mmap?, before it did munmap for only the first qp, now for all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that current mmap implementation can handle only num_of_qps=1 or mr_per_qp=false cases, see line 190:
ctx->buf[0] = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd,
offset);
So dereg_counter should be equal to 1 in that case:
dereg_counter = (user_param->mr_per_qp) ? user_param->num_of_qps : 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we must make it as the current implementation and refactor that after if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked this again and we are keeping same behavior as before for mmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How? I don't see perftest limit the number of qp's in case of mmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it doesn't. It can be used with >1 qps and it doesn't explicitly limit number of allocated buffers.
But please see create_mr code (wasn't changed):
if (create_single_mr(ctx, user_param, 0)) {
fprintf(stderr, "failed to create mr\n");
return 1;
}
/* create the rest if needed, or copy the first one */
for (i = 1; i < user_param->num_of_qps; i++) {
mr_index++;
if (user_param->mr_per_qp) {
if (create_single_mr(ctx, user_param, i)) {
fprintf(stderr, "failed to create mr\n");
goto destroy_mr;
}
} else {
ALLOCATE(ctx->mr[i], struct ibv_mr, 1);
memset(ctx->mr[i], 0, sizeof(struct ibv_mr));
ctx->mr[i] = ctx->mr[0];
// cppcheck-suppress arithOperationsOnVoidPointer
ctx->buf[i] = ctx->buf[0] + (i*BUFF_SIZE(ctx->size, ctx->cycle_buffer));
}
}
If perftest ran w/o --mr_per_qp flag, only a single mr (and one buffer) will be created.
Otherwise for mmap case, both the original and the new code will create several mrs using mapping to same memory. Probably not a desired behavior.
Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue. |
The segfault occurs in Client side. But the ctx->memory still isn't initiated in this stage. |
Thanks for specifying the error flow. Fixed. |
Thanks! I fixed a bug in perftest_communication, please report if there are other related issues. |
@HassanKhadour If there are no additional comments can you please merge this PR? |
Hi @mrgolin , I had some additional comments on the code. |
Thanks! Please keep me looped and let me know if I can help anyhow. |
Hi @mrgolin, Appreciate your help in testing the ROCm changes on devices that support ROCm & other vendor devices that supports RDMA, from our side I'm testing on Nvidia devices. Thanks |
the use_cuda with rdma_cm, UD connection and bidirectional is damaged. |
@sshaulnv Can you please share more details? If possible a backtrace or a coredump. |
Sorry for a delayed reply, we've tested with host memory, Habalabs, Neuron, Cuda w/ and w/o dmabuf.
Seems like I've already answered all comments. |
sure. backtrace: |
@sshaulnv you could simply mention that you have pushed a fix for this long existing issue (and this PR is not rebased on..) instead of letting me debug this and wonder how my unrelated changes could cause a segfault in this code. |
I'll rebase my code ones this commit is approved. |
@alex--m sorry about that, I didn't remember making a fix for it. |
Is there any additional blocker from merging this PR? |
Hi @mrgolin Please rebase the code for the last time, we will give it the last testing cycle and merge it very soon. Thanks |
Refactor perftest to allow easy extension for different memory types (mainly for AI accelerators direct memory support). Move existing implementations for DRAM (host memory), mmap, Cuda and ROCm according to the new design. Signed-off-by: Michael Margolin <[email protected]>
Add support for AWS Trainium devices via the AWS Neuron SDK. AWS Neuron is a software stack for running ML applications using AWS's AI hardware accelerators. Perftest support for Neuron allows testing of AI accelerator direct memory usage by IB devices (in similar to CUDA). To support Neuron, configure using --enable-neuron --with-neuron=<path/to/neuron/base>. Run bandwidth tests with Neuron direct by specifying --use_neuron=<core_id> option. Signed-off-by: Daniel Kranzdorf <[email protected]> Signed-off-by: Michael Margolin <[email protected]>
To support Habana Labs, configure using --enable-hl --with-hl=<habanalabs/installation/prefix> Run bandwidth tests with Habana Labs direct by specifying --use_hl <pci_bus_id>. Signed-off-by: Michael Margolin <[email protected]>
Rebased. Thanks. |
#ifdef HAVE_CUDA_DMABUF | ||
if (user_param->use_cuda_dmabuf) { | ||
free(ctx->cuda_buf_dmabuf_fd); | ||
free(ctx->cuda_buf_dmabuf_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed it, but I can't find the memory release of 'dmabuf_offset'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now a local variable set by allocate_buffer() and passed into ibv_reg_dmabuf_mr().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sshaulnv Is this more clear now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks
@HassanKhadour |
There is this comment from @sshaulnv "Maybe I missed it, but I can't find the memory release of 'dmabuf_offset'" |
Ohh.. sorry forgot to submit. |
@mrgolin Thanks for your contribution! |
Thanks for the review! |
This PR adds abstraction layer for supporting various AI HW accelerators.
Add support for Neuron accelerator device memory usage.
To support Neuron, configure using --enable-neuron --with-neuron=<path/to/neuron/base>.
Run bandwidth tests with Neuron direct by specifying --use_neuron option.
Add support for Habana Labs accelerators device memory usage.
To support Habana Labs, configure using --enable-hl --with-hl=<habanalabs/installation/prefix>
Run bandwidth tests with Habana Labs direct by specifying --use_hl <pci_bus_id>.