Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#17215: Add write/read APIs for TTNN tensors allocated on mesh buffer #17513

Merged
merged 6 commits into from
Feb 7, 2025

Conversation

omilyutin-tt
Copy link
Contributor

@omilyutin-tt omilyutin-tt commented Feb 3, 2025

Ticket

#17215

Problem description

Tensors allocated on mesh buffer (aka "mesh tensors") need write and read APIs exposed to TTNN.

What's changed

  • Extended mesh CQ interface to read / write shards, to accommodate TTNN multi-device sharding APIs.
    • The future work includes parallelizing the per-device dispatches internally, within Metal.
  • Add to_device_mesh_tensor and to_host_mesh_tensor that will be the main API used in TTNN to read/write the mesh buffer tensors.

Checklist

@@ -378,6 +378,9 @@ struct MultiDeviceStorage {

using Storage = std::variant<OwnedStorage, DeviceStorage, BorrowedStorage, MultiDeviceHostStorage, MultiDeviceStorage>;

template <typename T>
concept OwnedOrBorrowedStorage = std::is_same_v<T, OwnedStorage> || std::is_same_v<T, BorrowedStorage>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider HostStorage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is MultiDeviceHostStorage unfortunately. I think we need to do a better job at creating a hierarchy here - e.g. HostTensor as a collection of buffers (which can be owned or borrowed) + DeviceTensor that will always be backed by MeshBuffer (eventually). "Owned" vs" Borrowed" to me sounds like lower level concept, implementation detail of the buffer, not the whole tensor storage variant.

CommandQueue& cq, std::shared_ptr<Buffer> device_buffer, void* host_buffer_data, bool blocking) {
EnqueueReadBuffer(cq, device_buffer, host_buffer_data, blocking);
}

template <typename T>
inline void read_data_from_device_buffer(std::shared_ptr<Buffer> device_buffer, std::vector<T>& host_buffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

CommandQueue& cq, std::shared_ptr<Buffer> device_buffer, void* host_buffer_data, bool blocking) {
EnqueueReadBuffer(cq, device_buffer, host_buffer_data, blocking);
}

template <typename T>
inline void read_data_from_device_buffer(std::shared_ptr<Buffer> device_buffer, std::vector<T>& host_buffer) {
void read_data_from_device_buffer(std::shared_ptr<Buffer> device_buffer, std::vector<T>& host_buffer) {
::tt::tt_metal::detail::ReadFromBuffer(device_buffer, host_buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the difference between this vs EnqueueReadBuffer above?

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a slow dispatch path?
if so, can you please mark the method deprecated with the comment that its a slow dispatch path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow dispatch is not deprecated. It's actively used for all sorts of bringup, debug and experiments. Marking it deprecated would imply that this needs to be cleaned up from the codebase at some point, which is not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TT-NN must not care about it I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, TTNN needs to rely on the single API, internally we might fallback to slow dispatch if needed.

}
},
},
[](const auto& s) -> owned_buffer::Buffer<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good change

ttnn/cpp/ttnn/tensor/tensor_impl.cpp Outdated Show resolved Hide resolved
tt_metal/distributed/mesh_command_queue.cpp Show resolved Hide resolved
CommandQueue& cq, std::shared_ptr<Buffer> device_buffer, void* host_buffer_data, bool blocking) {
EnqueueReadBuffer(cq, device_buffer, host_buffer_data, blocking);
}

template <typename T>
inline void read_data_from_device_buffer(std::shared_ptr<Buffer> device_buffer, std::vector<T>& host_buffer) {
void read_data_from_device_buffer(std::shared_ptr<Buffer> device_buffer, std::vector<T>& host_buffer) {
::tt::tt_metal::detail::ReadFromBuffer(device_buffer, host_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow dispatch is not deprecated. It's actively used for all sorts of bringup, debug and experiments. Marking it deprecated would imply that this needs to be cleaned up from the codebase at some point, which is not the case.

Copy link
Contributor

@tt-asaigal tt-asaigal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be adding python APis for this next? Would be great to see how TTNN code remain essentially unchanged when we switch backends.

@omilyutin-tt
Copy link
Contributor Author

Will you be adding python APis for this next? Would be great to see how TTNN code remain essentially unchanged when we switch backends.

Most likely will add a switch to use these mesh-based implementations in the existing .cpu() / to() methods. Bunch of things will fail - I will be incrementally fixing them as we integrate mesh* primitives into TTNN. Eventually this will become the new and only code path.

@tt-asaigal
Copy link
Contributor

Will you be adding python APis for this next? Would be great to see how TTNN code remain essentially unchanged when we switch backends.

Most likely will add a switch to use these mesh-based implementations in the existing .cpu() / to() methods. Bunch of things will fail - I will be incrementally fixing them as we integrate mesh* primitives into TTNN. Eventually this will become the new and only code path.

Yes, a first step would be to add a switch to those top level APIs, and then see what falls out when we integrate into the functions exposed by core.py. Thanks!

Copy link
Collaborator

@cfjchu cfjchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but need to review your changes and the assumptions you made about 1:1 mapping of shards to devices.

ttnn/cpp/ttnn/tensor/tensor_impl.cpp Outdated Show resolved Hide resolved
@omilyutin-tt omilyutin-tt force-pushed the omilyutin/mesh-tensor-rw branch from fd642fe to ce98cd9 Compare February 5, 2025 04:45
@omilyutin-tt omilyutin-tt force-pushed the omilyutin/mesh-tensor-rw branch from ce98cd9 to 65105aa Compare February 5, 2025 22:15
Copy link
Collaborator

@cfjchu cfjchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing!

@tt-asaigal
Copy link
Contributor

Latest changes look great, thanks Oleg!

@omilyutin-tt omilyutin-tt merged commit a4deded into main Feb 7, 2025
234 checks passed
@omilyutin-tt omilyutin-tt deleted the omilyutin/mesh-tensor-rw branch February 7, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants