RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465

ejortegau · 2025-01-06T15:31:20Z

Feature Description

Context & prior art

Classical Orchestrator had under /web/audit-recovery/uid/<recoveryid> some details about recoveries and steps taken as part of each one of them. This allowed an operator to easily see whether the recovery had succeeded or not, as well as some information about each step taken:

Furthermore, if one had more than one Orchestrator instance (and set them up with the shared MySQL database - not sure how this worked on the raft setup with independent local databases as I never used it), no matter which Orchestrator instance was checked, the same information would be visible.

In vtorc, however, this information is (at least currently) not so clearly surfaced to the user anywhere. The current vtorc web UI only seems to show a list of recoveries with no details on each one:

Furthermore, since there is no shared state, and no cluster leader, users need to check each individual vtorc instance to see all the recoveries. This is IMHO a considerable feature gap when it comes to observability.

Feature Request

There should be some easy means to centrally view what recoveries took place and how each step of them went.

RFC

There are multiple ways to try to solve the issue above:

Emit clearly distiniguishable log entries for recoveries and delegate visualization to the user's logging stack (e.g. kibana). No other changes needed/most likely. For example, all log entries of interest related to recoveries could have a prefix like <Problem> Recovery <keyspace>/<shard>: (e.g. DeadPrimary Recovery commerce/80-: ). This is probably the lowest effort one to implement, but assumes the existence of logging processing & visualization infrastructure.
vtorc implements a new web UI endpoint for the information. This comes with some challenges:
- For users running multiple vtorc instances:
  - The user needs to visit the new web UI for each instance.
  - Or the vtcorcs need to know about each other and the one serving the web request fetches the information from the rest to present a single view. This woul impliy that, in addition to the UI, vtorc should also expose recovery details view APi endpoint that can be queried by the other vtorcs to consolidate information from all of them.
- Data persistence: vtorc currently uses an sqlite DB which defaults to in-memory storage. Care should be taken to ensure it's kept across vtorc process restarts if the user wants to not lose the recovery details across restarts. E.g., use an actual sqlite file instead of inmemory, and, if running in k8s, use a persistent volume claim for its path.
vtadmin adds functionality to query all vtorc instances, aggregates the information about recoveries from all instances and shows it to the user in a new web UI. This requires registering all vtorc instances somewhere so that vrtadmin can find them. I guess that would be the topology? In this scenario, the user should still take care of data persistency for the vtorc DB.

The last two approaches are more feature-rich/complete for the user as there's no need/assumption of existence of logging pipeline, but obviously require more work. Also, perhaps the first approach could be done as a stop gap measure while the second or third one are done.

Thoughts?

Use Case(s)

As a vites operator, I want to be able to easilky check whether vtorc detected and attempted to recover a problem, and the steps taken during that recovery.

The text was updated successfully, but these errors were encountered:

ejortegau added Type: RFC Request For Comment Needs Triage This issue needs to be correctly labelled and triaged labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465

RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465

ejortegau commented Jan 6, 2025 •

edited

Loading

RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465

RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465

Comments

ejortegau commented Jan 6, 2025 • edited Loading

Feature Description

Context & prior art

Feature Request

RFC

Use Case(s)

ejortegau commented Jan 6, 2025 •

edited

Loading