Skip to content

Commit

Permalink
[Profiler] Handle Tensor Sizes/Strides Parsing Error (pytorch#134862)
Browse files Browse the repository at this point in the history
Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788

Pull Request resolved: pytorch#134862
Approved by: https://github.com/aaronenyeshi
  • Loading branch information
sraikund16 authored and pytorchmergebot committed Sep 3, 2024
1 parent f05b716 commit 9ffcca7
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 6 deletions.
30 changes: 26 additions & 4 deletions torch/csrc/profiler/collection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,18 @@ RawTensorMetadataBase::RawTensorMetadataBase(const at::Tensor& t)
: data_{t.has_storage() ? t.storage().data() : nullptr},
dtype_{t.scalar_type()},
layout_{t.layout()},
dim_{static_cast<uint32_t>(t.sizes().size())} {
size_dim_{static_cast<uint32_t>(t.sizes().size())},
stride_dim_{static_cast<uint32_t>(t.strides().size())} {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
t.sizes().size() <= std::numeric_limits<uint32_t>::max(),
"Cannot profile Tensors of size > uint32 max. Got dim: ",
t.sizes().size());
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
t.sizes().size() != t.strides().size(),
"Tensor has mismatching sizes and strides. Sizes: ",
t.sizes().size(),
" Strides: ",
t.strides().size());
}

RawTensorMetadata::RawTensorMetadata(const at::Tensor& t)
Expand Down Expand Up @@ -181,14 +188,29 @@ auto InputOutputEncoder::getIValueGenerator(const IOType& io_type) {
ivals_it = ivalues_.begin(),
io_type]() mutable {
auto decode_tensor = [&]() -> TensorMetadata {
const auto& raw_metadata = *tensor_metadata_it++;
std::vector<int64_t> sizes;
std::vector<int64_t> strides;
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.dim_)) {
if (tensor_metadata_it.exhausted()) {
LOG(WARNING)
<< "Tensor metadata exhausted prematurely. Reported shapes may be inaccurate!";
return {RawTensorMetadata(), sizes, strides};
}
const auto& raw_metadata = *tensor_metadata_it++;
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.size_dim_)) {
if (tensor_size_strides_it.exhausted()) {
LOG(WARNING)
<< "Expected Tensor Size mismatch with raw Tensor metadata. Reported shapes may be inaccurate!";
return {raw_metadata, sizes, strides};
}
sizes.push_back(*tensor_size_strides_it++);
}
if (raw_metadata.layout_ == at::kStrided) {
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.dim_)) {
for (C10_UNUSED const auto _ : c10::irange(raw_metadata.stride_dim_)) {
if (tensor_size_strides_it.exhausted()) {
LOG(WARNING)
<< "Expected Tensor Strides mismatch with raw Tensor metadata. Reported shapes may be inaccurate!";
return {raw_metadata, sizes, strides};
}
strides.push_back(*tensor_size_strides_it++);
}
}
Expand Down
3 changes: 2 additions & 1 deletion torch/csrc/profiler/collection.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ struct TORCH_API RawTensorMetadataBase {
StorageImplData data_;
c10::ScalarType dtype_{c10::ScalarType::Undefined};
c10::Layout layout_{c10::Layout::Strided};
uint32_t dim_{0};
uint32_t size_dim_{0};
uint32_t stride_dim_{0};
};

// Collected during profiling.
Expand Down
2 changes: 1 addition & 1 deletion torch/csrc/profiler/python/init.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,7 @@ void initPythonBindings(PyObject* module) {
return py::reinterpret_borrow<py::object>(
torch::autograd::utils::wrap(metadata.dtype_));
})
.def_readonly("dim", &TensorMetadata::dim_)
.def_readonly("dim", &TensorMetadata::size_dim_)
.def_readonly("sizes", &TensorMetadata::sizes_)
.def_readonly("strides", &TensorMetadata::strides_);

Expand Down

0 comments on commit 9ffcca7

Please sign in to comment.