Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465

Open
ejortegau opened this issue Jan 6, 2025 · 0 comments
Labels
Needs Triage This issue needs to be correctly labelled and triaged Type: RFC Request For Comment

Comments

@ejortegau
Copy link
Contributor

ejortegau commented Jan 6, 2025

Feature Description

Context & prior art

Classical Orchestrator had under /web/audit-recovery/uid/<recoveryid> some details about recoveries and steps taken as part of each one of them. This allowed an operator to easily see whether the recovery had succeeded or not, as well as some information about each step taken:

image

Furthermore, if one had more than one Orchestrator instance (and set them up with the shared MySQL database - not sure how this worked on the raft setup with independent local databases as I never used it), no matter which Orchestrator instance was checked, the same information would be visible.

In vtorc, however, this information is (at least currently) not so clearly surfaced to the user anywhere. The current vtorc web UI only seems to show a list of recoveries with no details on each one:

image

Furthermore, since there is no shared state, and no cluster leader, users need to check each individual vtorc instance to see all the recoveries. This is IMHO a considerable feature gap when it comes to observability.

Feature Request

There should be some easy means to centrally view what recoveries took place and how each step of them went.

RFC

There are multiple ways to try to solve the issue above:

  • Emit clearly distiniguishable log entries for recoveries and delegate visualization to the user's logging stack (e.g. kibana). No other changes needed/most likely. For example, all log entries of interest related to recoveries could have a prefix like <Problem> Recovery <keyspace>/<shard>: (e.g. DeadPrimary Recovery commerce/80-: ). This is probably the lowest effort one to implement, but assumes the existence of logging processing & visualization infrastructure.
  • vtorc implements a new web UI endpoint for the information. This comes with some challenges:
    • For users running multiple vtorc instances:
      • The user needs to visit the new web UI for each instance.
      • Or the vtcorcs need to know about each other and the one serving the web request fetches the information from the rest to present a single view. This woul impliy that, in addition to the UI, vtorc should also expose recovery details view APi endpoint that can be queried by the other vtorcs to consolidate information from all of them.
    • Data persistence: vtorc currently uses an sqlite DB which defaults to in-memory storage. Care should be taken to ensure it's kept across vtorc process restarts if the user wants to not lose the recovery details across restarts. E.g., use an actual sqlite file instead of inmemory, and, if running in k8s, use a persistent volume claim for its path.
  • vtadmin adds functionality to query all vtorc instances, aggregates the information about recoveries from all instances and shows it to the user in a new web UI. This requires registering all vtorc instances somewhere so that vrtadmin can find them. I guess that would be the topology? In this scenario, the user should still take care of data persistency for the vtorc DB.

The last two approaches are more feature-rich/complete for the user as there's no need/assumption of existence of logging pipeline, but obviously require more work. Also, perhaps the first approach could be done as a stop gap measure while the second or third one are done.

Thoughts?

Use Case(s)

As a vites operator, I want to be able to easilky check whether vtorc detected and attempted to recover a problem, and the steps taken during that recovery.

@ejortegau ejortegau added Type: RFC Request For Comment Needs Triage This issue needs to be correctly labelled and triaged labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage This issue needs to be correctly labelled and triaged Type: RFC Request For Comment
Projects
None yet
Development

No branches or pull requests

1 participant