Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9255] Remove previously-made symlinks on reconfigure/shutdown #26

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

penguinland
Copy link
Collaborator

Everything compiles, but I'm still testing it out right now...

@penguinland penguinland requested a review from acmorrow December 5, 2024 21:02
src/viam_mlmodelservice_triton_impl.cpp Show resolved Hide resolved
@@ -103,6 +103,12 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
std::lock_guard<std::mutex> lock(state_lock_);
if (!stopped_) {
stopped_ = true;

// Remove any symlinks we've added to our module workspace.
if (!state_->model_name.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is impossible to succeed to configure the module with a missing or empty model_name, so I think you don't need the check:

        auto model_name = attributes->find("model_name");
        if (model_name == attributes->end()) {
            std::ostringstream buffer;
            buffer << service_name
                   << ": Required parameter `model_name` not found in configuration";
            throw std::invalid_argument(buffer.str());
        }

        auto* const model_name_string = model_name->second->get<std::string>();
        if (!model_name_string || model_name_string->empty()) {
            std::ostringstream buffer;
            buffer << service_name
                   << ": Required non-empty string parameter `model_name` is either not a "
                      "string "
                      "or is an empty string";
            throw std::invalid_argument(buffer.str());
        }
        state->model_name = std::move(*model_name_string);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose reconfigure never succeeds. Couldn't stop still be called during shutdown? and in that case, if model_name is empty, I'd rather not delete the entire $VIAM_MODULE_DATA directory.

It's possible the check is redundant, but I don't think that's obvious. I'd prefer to leave it in, as a defensive programming measure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sort of the point of the state. A call to stop will never see the result of a failed _reconfigure, nor will any calls to invoke or metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is an enforced invariant that the model_name field of an observable State object is non-empty, and checking that here makes it look like it isn't.
  • Is this logic still needed, since the scrubbing now happens on startup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to remove the entire subdirectory of symlinks, so even if there is some way for the model name to be unset, it's a non-issue. You might still be right that there's no way for it to be unset, but now I don't need to worry about whether we missed some case.


// Remove any symlinks we've added to our module workspace.
if (!state_->model_name.empty()) {
remove_symlink_mlmodel_(state_->model_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could responsibility for removing the symlink be made part of State destruction?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought about it, I don't know that it is safe to remove the model symlink here. The stop call, as written, doesn't currently interrupt any active invocations, it just prevents new ones. So you could have an active inference request in flight. I'm not sure what the triton server would do if the model disappeared in the middle of an inference. I can see a few possibilities:

  • It is all fine, because the actual file on disk still exists and only the symlink was removed.
  • Triton might see the symlink go away (model removed) and kill the pending inference requests
  • Random horrible crashes.

There is another consideration though. The way the state is managed is that a new one gets created while the old one still exists. So if you push management of the symlink into state, you need to ensure that destruction of the old one doesn't erase the symlink out from under the newly created one.

Copy link
Collaborator Author

@penguinland penguinland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take another look. I think I like this version more!

  • Each service gets its own subdirectory within the module's directory, so that if you have 2 GPUs you can have 2 Triton services without them both loading everything.
  • When a service reconfigures, it blows away its entire directory. This means we clean everything up if we didn't shut down properly last time. It might also mean that we waste work removing and re-adding a module that didn't change during reconfiguration, but reconfiguration is rare and I suspect no one will mind if that takes extra time.

@@ -424,6 +448,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk

const auto& attributes = state->configuration.attributes();

// If we're reconfiguring and have an old model name symlinked into our module workspace,
// remove it before setting up the new repo.
remove_symlink_mlmodel_(state);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reconfigure but don't change models, I wonder if this will cause Triton to throw out the "old" model and reload the "new" model. Then again, reconfiguring should be rare, so it's not the end of the world if we waste some time in here.

@@ -103,6 +103,12 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
std::lock_guard<std::mutex> lock(state_lock_);
if (!stopped_) {
stopped_ = true;

// Remove any symlinks we've added to our module workspace.
if (!state_->model_name.empty()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to remove the entire subdirectory of symlinks, so even if there is some way for the model name to be unset, it's a non-issue. You might still be right that there's no way for it to be unset, but now I don't need to worry about whether we missed some case.

@penguinland penguinland requested a review from acmorrow February 7, 2025 17:00
@@ -450,7 +473,9 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
// With no model repository path, we try to construct our own by symlinking a single
// model path.
symlink_mlmodel_(*state.get());
state->model_repo_path = std::move(std::getenv("VIAM_MODULE_DATA"));
std::filesystem::path directory_name =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a good example of where auto is entirely justified, since you have repeated std::filesystem::path twice. There's no ambiguity about the type at all. Saying auto directory_name = std::filesystem::path ... is all you need.

state->model_repo_path = std::move(std::getenv("VIAM_MODULE_DATA"));
std::filesystem::path directory_name =
std::filesystem::path(std::getenv("VIAM_MODULE_DATA")) / state->configuration.name();
state->model_repo_path = std::move(directory_name.string());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The std::move here is harmless but also I believe probably not required. The result of director_name.string() is already a temporary so the rvalue reference form of assignment will be selected.

@@ -424,6 +443,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk

const auto& attributes = state->configuration.attributes();

// If we're reconfiguring and have an old model name symlinked into our module workspace,
// remove it before setting up the new repo.
remove_symlink_mlmodel_(state->configuration.name());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function computes the same directory path that is computed below. Maybe just compute it once, here in reconfigure, and use it in both places?

@@ -103,6 +103,12 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
std::lock_guard<std::mutex> lock(state_lock_);
if (!stopped_) {
stopped_ = true;

// Remove any symlinks we've added to our module workspace.
if (!state_->model_name.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is an enforced invariant that the model_name field of an observable State object is non-empty, and checking that here makes it look like it isn't.
  • Is this logic still needed, since the scrubbing now happens on startup?

@@ -103,6 +103,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
std::lock_guard<std::mutex> lock(state_lock_);
if (!stopped_) {
stopped_ = true;

// Remove any symlinks we've added to our module workspace.
remove_symlink_mlmodel_(*state_.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, stop doesn't wait for any in-progress inferences to complete. It just prevents any new inference requests from starting. So I don't think it is entirely safe to remove the model link here. If you would like to make it so that stop waits on any pending inference requests to complete, that could probably be done, but it would require coordination between stop and infer that doesn't currently exist. There is also an inference cancellation API, which could also form part of the responsibility of stop, but it would still need to wait for the inference futures to complete. That'd require keeping the inference futures in a data structure of some sort, cancelling them all, and waiting for in-progress calls to infer to drain.

What should be happening is that the symlink should be removed once there are no more state objects referencing it. The best way to manage that is to push creation and removal of the symlink into a management class that uses RAII, and then have state hold a shared_ptr to it.

std::filesystem::path(std::getenv("VIAM_MODULE_DATA")) / state.model_name;

std::filesystem::path directory_name = get_module_data_path_(state);
state.model_repo_path = std::move(directory_name.string());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The std::move isn't needed here, the result of .string() is already a temporary.

@@ -424,6 +448,10 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk

const auto& attributes = state->configuration.attributes();

// If we're reconfiguring and have an old model name symlinked into our module workspace,
// remove it before setting up the new repo.
remove_symlink_mlmodel_(*state.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem here in reconfigure_ as in stop_. How do you know you aren't yanking the link out from under an inference running under the existing state as you build the new one?

I actually think the place to do this scrubbing may be in Service::Service in the constructor body, before the state gets initialized (so it would need to come out of the initializer list). Since a resource never changes its name, and you are keying from the name, and this is just to scrub any stale state (which you only need to do once on startup since it is really there to deal with unclean shutdown), it seems to me you only need to do it one time, not each time the service is reconfigured.

// GPUs. Each one gets its own subdirectory within our module data to avoid hitting the
// others.
std::filesystem::path directory_name =
std::filesystem::path(std::getenv("VIAM_MODULE_DATA")) / state.configuration.name();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name field of the configuration becomes available canonically from Resource::get_resource_name (see the Service constructor where we pass configuration.name up to the parent class constructor, which passes it along up to the Resource constructor. So you can just write get_resource_name() here instead of state.configuration.name().

@@ -398,6 +415,13 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
std::filesystem::create_directory_symlink(*model_path_string, triton_name);
}

static void remove_symlink_mlmodel_(const struct state_& state) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function seems a little misnamed, since it doesn't just remove the model symlink, but everything under the directory.

@@ -340,7 +344,16 @@ class Service : public vsdk::MLModelService, public vsdk::Stoppable, public vsdk
return state_;
}

static void symlink_mlmodel_(const struct state_& state) {
static inline std::filesystem::path get_module_data_path_(const struct state_& state) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for this to be inline - all functions declared in the class body are implicitly inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants