-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RSDK-9255] Remove previously-made symlinks on reconfigure/shutdown #26
base: main
Are you sure you want to change the base?
Conversation
@@ -103,6 +103,12 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
std::lock_guard<std::mutex> lock(state_lock_); | |||
if (!stopped_) { | |||
stopped_ = true; | |||
|
|||
// Remove any symlinks we've added to our module workspace. | |||
if (!state_->model_name.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is impossible to succeed to configure the module with a missing or empty model_name
, so I think you don't need the check:
auto model_name = attributes->find("model_name");
if (model_name == attributes->end()) {
std::ostringstream buffer;
buffer << service_name
<< ": Required parameter `model_name` not found in configuration";
throw std::invalid_argument(buffer.str());
}
auto* const model_name_string = model_name->second->get<std::string>();
if (!model_name_string || model_name_string->empty()) {
std::ostringstream buffer;
buffer << service_name
<< ": Required non-empty string parameter `model_name` is either not a "
"string "
"or is an empty string";
throw std::invalid_argument(buffer.str());
}
state->model_name = std::move(*model_name_string);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose reconfigure
never succeeds. Couldn't stop
still be called during shutdown? and in that case, if model_name
is empty, I'd rather not delete the entire $VIAM_MODULE_DATA
directory.
It's possible the check is redundant, but I don't think that's obvious. I'd prefer to leave it in, as a defensive programming measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sort of the point of the state
. A call to stop
will never see the result of a failed _reconfigure
, nor will any calls to invoke
or metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is an enforced invariant that the
model_name
field of an observableState
object is non-empty, and checking that here makes it look like it isn't. - Is this logic still needed, since the scrubbing now happens on startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to remove the entire subdirectory of symlinks, so even if there is some way for the model name to be unset, it's a non-issue. You might still be right that there's no way for it to be unset, but now I don't need to worry about whether we missed some case.
|
||
// Remove any symlinks we've added to our module workspace. | ||
if (!state_->model_name.empty()) { | ||
remove_symlink_mlmodel_(state_->model_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could responsibility for removing the symlink be made part of State
destruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought about it, I don't know that it is safe to remove the model symlink here. The stop
call, as written, doesn't currently interrupt any active invocations, it just prevents new ones. So you could have an active inference request in flight. I'm not sure what the triton server would do if the model disappeared in the middle of an inference. I can see a few possibilities:
- It is all fine, because the actual file on disk still exists and only the symlink was removed.
- Triton might see the symlink go away (model removed) and kill the pending inference requests
- Random horrible crashes.
There is another consideration though. The way the state
is managed is that a new one gets created while the old one still exists. So if you push management of the symlink into state, you need to ensure that destruction of the old one doesn't erase the symlink out from under the newly created one.
fa5f6db
to
69af20c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take another look. I think I like this version more!
- Each service gets its own subdirectory within the module's directory, so that if you have 2 GPUs you can have 2 Triton services without them both loading everything.
- When a service reconfigures, it blows away its entire directory. This means we clean everything up if we didn't shut down properly last time. It might also mean that we waste work removing and re-adding a module that didn't change during reconfiguration, but reconfiguration is rare and I suspect no one will mind if that takes extra time.
@@ -424,6 +448,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
|
|||
const auto& attributes = state->configuration.attributes(); | |||
|
|||
// If we're reconfiguring and have an old model name symlinked into our module workspace, | |||
// remove it before setting up the new repo. | |||
remove_symlink_mlmodel_(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reconfigure but don't change models, I wonder if this will cause Triton to throw out the "old" model and reload the "new" model. Then again, reconfiguring should be rare, so it's not the end of the world if we waste some time in here.
@@ -103,6 +103,12 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
std::lock_guard<std::mutex> lock(state_lock_); | |||
if (!stopped_) { | |||
stopped_ = true; | |||
|
|||
// Remove any symlinks we've added to our module workspace. | |||
if (!state_->model_name.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to remove the entire subdirectory of symlinks, so even if there is some way for the model name to be unset, it's a non-issue. You might still be right that there's no way for it to be unset, but now I don't need to worry about whether we missed some case.
@@ -450,7 +473,9 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
// With no model repository path, we try to construct our own by symlinking a single | |||
// model path. | |||
symlink_mlmodel_(*state.get()); | |||
state->model_repo_path = std::move(std::getenv("VIAM_MODULE_DATA")); | |||
std::filesystem::path directory_name = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is a good example of where auto
is entirely justified, since you have repeated std::filesystem::path
twice. There's no ambiguity about the type at all. Saying auto directory_name = std::filesystem::path ...
is all you need.
state->model_repo_path = std::move(std::getenv("VIAM_MODULE_DATA")); | ||
std::filesystem::path directory_name = | ||
std::filesystem::path(std::getenv("VIAM_MODULE_DATA")) / state->configuration.name(); | ||
state->model_repo_path = std::move(directory_name.string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::move
here is harmless but also I believe probably not required. The result of director_name.string()
is already a temporary so the rvalue reference form of assignment will be selected.
@@ -424,6 +443,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
|
|||
const auto& attributes = state->configuration.attributes(); | |||
|
|||
// If we're reconfiguring and have an old model name symlinked into our module workspace, | |||
// remove it before setting up the new repo. | |||
remove_symlink_mlmodel_(state->configuration.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function computes the same directory path that is computed below. Maybe just compute it once, here in reconfigure
, and use it in both places?
@@ -103,6 +103,12 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
std::lock_guard<std::mutex> lock(state_lock_); | |||
if (!stopped_) { | |||
stopped_ = true; | |||
|
|||
// Remove any symlinks we've added to our module workspace. | |||
if (!state_->model_name.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is an enforced invariant that the
model_name
field of an observableState
object is non-empty, and checking that here makes it look like it isn't. - Is this logic still needed, since the scrubbing now happens on startup?
@@ -103,6 +103,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
std::lock_guard<std::mutex> lock(state_lock_); | |||
if (!stopped_) { | |||
stopped_ = true; | |||
|
|||
// Remove any symlinks we've added to our module workspace. | |||
remove_symlink_mlmodel_(*state_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, stop
doesn't wait for any in-progress inferences to complete. It just prevents any new inference requests from starting. So I don't think it is entirely safe to remove the model link here. If you would like to make it so that stop
waits on any pending inference requests to complete, that could probably be done, but it would require coordination between stop
and infer
that doesn't currently exist. There is also an inference cancellation API, which could also form part of the responsibility of stop
, but it would still need to wait for the inference futures to complete. That'd require keeping the inference futures in a data structure of some sort, cancelling them all, and waiting for in-progress calls to infer
to drain.
What should be happening is that the symlink should be removed once there are no more state
objects referencing it. The best way to manage that is to push creation and removal of the symlink into a management class that uses RAII, and then have state
hold a shared_ptr
to it.
std::filesystem::path(std::getenv("VIAM_MODULE_DATA")) / state.model_name; | ||
|
||
std::filesystem::path directory_name = get_module_data_path_(state); | ||
state.model_repo_path = std::move(directory_name.string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::move
isn't needed here, the result of .string()
is already a temporary.
@@ -424,6 +448,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
|
|||
const auto& attributes = state->configuration.attributes(); | |||
|
|||
// If we're reconfiguring and have an old model name symlinked into our module workspace, | |||
// remove it before setting up the new repo. | |||
remove_symlink_mlmodel_(*state.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem here in reconfigure_
as in stop_
. How do you know you aren't yanking the link out from under an inference running under the existing state as you build the new one?
I actually think the place to do this scrubbing may be in Service::Service
in the constructor body, before the state gets initialized (so it would need to come out of the initializer list). Since a resource never changes its name, and you are keying from the name, and this is just to scrub any stale state (which you only need to do once on startup since it is really there to deal with unclean shutdown), it seems to me you only need to do it one time, not each time the service is reconfigured.
// GPUs. Each one gets its own subdirectory within our module data to avoid hitting the | ||
// others. | ||
std::filesystem::path directory_name = | ||
std::filesystem::path(std::getenv("VIAM_MODULE_DATA")) / state.configuration.name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
field of the configuration becomes available canonically from Resource::get_resource_name
(see the Service
constructor where we pass configuration.name
up to the parent class constructor, which passes it along up to the Resource
constructor. So you can just write get_resource_name()
here instead of state.configuration.name()
.
@@ -398,6 +415,13 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
std::filesystem::create_directory_symlink(*model_path_string, triton_name); | |||
} | |||
|
|||
static void remove_symlink_mlmodel_(const struct state_& state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function seems a little misnamed, since it doesn't just remove the model symlink, but everything under the directory.
@@ -340,7 +344,16 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk | |||
return state_; | |||
} | |||
|
|||
static void symlink_mlmodel_(const struct state_& state) { | |||
static inline std::filesystem::path get_module_data_path_(const struct state_& state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for this to be inline
- all functions declared in the class body are implicitly inline.
Everything compiles, but I'm still testing it out right now...