You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Classical Orchestrator had under /web/audit-recovery/uid/<recoveryid> some details about recoveries and steps taken as part of each one of them. This allowed an operator to easily see whether the recovery had succeeded or not, as well as some information about each step taken:
Furthermore, if one had more than one Orchestrator instance (and set them up with the shared MySQL database - not sure how this worked on the raft setup with independent local databases as I never used it), no matter which Orchestrator instance was checked, the same information would be visible.
In vtorc, however, this information is (at least currently) not so clearly surfaced to the user anywhere. The current vtorc web UI only seems to show a list of recoveries with no details on each one:
Furthermore, since there is no shared state, and no cluster leader, users need to check each individual vtorc instance to see all the recoveries. This is IMHO a considerable feature gap when it comes to observability.
Feature Request
There should be some easy means to centrally view what recoveries took place and how each step of them went.
RFC
There are multiple ways to try to solve the issue above:
Emit clearly distiniguishable log entries for recoveries and delegate visualization to the user's logging stack (e.g. kibana). No other changes needed/most likely. For example, all log entries of interest related to recoveries could have a prefix like <Problem> Recovery <keyspace>/<shard>: (e.g. DeadPrimary Recovery commerce/80-: ). This is probably the lowest effort one to implement, but assumes the existence of logging processing & visualization infrastructure.
vtorc implements a new web UI endpoint for the information. This comes with some challenges:
For users running multiple vtorc instances:
The user needs to visit the new web UI for each instance.
Or the vtcorcs need to know about each other and the one serving the web request fetches the information from the rest to present a single view. This woul impliy that, in addition to the UI, vtorc should also expose recovery details view APi endpoint that can be queried by the other vtorcs to consolidate information from all of them.
Data persistence: vtorc currently uses an sqlite DB which defaults to in-memory storage. Care should be taken to ensure it's kept across vtorc process restarts if the user wants to not lose the recovery details across restarts. E.g., use an actual sqlite file instead of inmemory, and, if running in k8s, use a persistent volume claim for its path.
vtadmin adds functionality to query all vtorc instances, aggregates the information about recoveries from all instances and shows it to the user in a new web UI. This requires registering all vtorc instances somewhere so that vrtadmin can find them. I guess that would be the topology? In this scenario, the user should still take care of data persistency for the vtorc DB.
The last two approaches are more feature-rich/complete for the user as there's no need/assumption of existence of logging pipeline, but obviously require more work. Also, perhaps the first approach could be done as a stop gap measure while the second or third one are done.
Thoughts?
Use Case(s)
As a vites operator, I want to be able to easilky check whether vtorc detected and attempted to recover a problem, and the steps taken during that recovery.
The text was updated successfully, but these errors were encountered:
Feature Description
Context & prior art
Classical Orchestrator had under
/web/audit-recovery/uid/<recoveryid>
some details about recoveries and steps taken as part of each one of them. This allowed an operator to easily see whether the recovery had succeeded or not, as well as some information about each step taken:Furthermore, if one had more than one Orchestrator instance (and set them up with the shared MySQL database - not sure how this worked on the raft setup with independent local databases as I never used it), no matter which Orchestrator instance was checked, the same information would be visible.
In
vtorc
, however, this information is (at least currently) not so clearly surfaced to the user anywhere. The currentvtorc
web UI only seems to show a list of recoveries with no details on each one:Furthermore, since there is no shared state, and no cluster leader, users need to check each individual
vtorc
instance to see all the recoveries. This is IMHO a considerable feature gap when it comes to observability.Feature Request
There should be some easy means to centrally view what recoveries took place and how each step of them went.
RFC
There are multiple ways to try to solve the issue above:
<Problem> Recovery <keyspace>/<shard>:
(e.g.DeadPrimary Recovery commerce/80-:
). This is probably the lowest effort one to implement, but assumes the existence of logging processing & visualization infrastructure.vtorc
implements a new web UI endpoint for the information. This comes with some challenges:vtorc
instances:vtcorc
s need to know about each other and the one serving the web request fetches the information from the rest to present a single view. This woul impliy that, in addition to the UI,vtorc
should also expose recovery details view APi endpoint that can be queried by the othervtorc
s to consolidate information from all of them.vtorc
currently uses ansqlite
DB which defaults to in-memory storage. Care should be taken to ensure it's kept acrossvtorc
process restarts if the user wants to not lose the recovery details across restarts. E.g., use an actual sqlite file instead of inmemory, and, if running in k8s, use a persistent volume claim for its path.vtadmin
adds functionality to query allvtorc
instances, aggregates the information about recoveries from all instances and shows it to the user in a new web UI. This requires registering allvtorc
instances somewhere so thatvrtadmin
can find them. I guess that would be the topology? In this scenario, the user should still take care of data persistency for thevtorc
DB.The last two approaches are more feature-rich/complete for the user as there's no need/assumption of existence of logging pipeline, but obviously require more work. Also, perhaps the first approach could be done as a stop gap measure while the second or third one are done.
Thoughts?
Use Case(s)
As a vites operator, I want to be able to easilky check whether
vtorc
detected and attempted to recover a problem, and the steps taken during that recovery.The text was updated successfully, but these errors were encountered: