Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve VTOrc config handling to support dynamic variables #17218

Merged
merged 23 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
b3e4aec
feat: remove deprecated flag
GuptaManan100 Nov 12, 2024
85913d1
feat: remove config file flag and start using viper instead for insta…
GuptaManan100 Nov 12, 2024
0a27ba8
feat: move prevent cross cell promotion as well to viper
GuptaManan100 Nov 12, 2024
c6c89fb
feat: add sqldatafile, snapshot time and reasonable replication log t…
GuptaManan100 Nov 12, 2024
7b86cb3
feat: add audit file location to viper
GuptaManan100 Nov 12, 2024
1ab5cc3
feat: move remaining audit flags to viper
GuptaManan100 Nov 12, 2024
ec2ef87
feat: move remaining configs to viper
GuptaManan100 Nov 12, 2024
2121e16
feat: add api to read config and add a test for dynamic configuration
GuptaManan100 Nov 12, 2024
0420ce5
feat: add summary changes
GuptaManan100 Nov 12, 2024
897ab10
feat: fix vtorc config in local example
GuptaManan100 Nov 13, 2024
d20979b
feat: change way we use config in the e2e tests
GuptaManan100 Nov 21, 2024
d5923cd
feat: add two more fields to viper
GuptaManan100 Nov 21, 2024
7ef3b3b
Merge remote-tracking branch 'upstream/main' into vtorc-config-reload…
GuptaManan100 Nov 21, 2024
6822752
feat: fix flag in config
GuptaManan100 Nov 22, 2024
8b1e52b
test: start adding dynamic config tests in e2e fashion
GuptaManan100 Nov 24, 2024
3a422fa
test: add all the remaining configs to the test
GuptaManan100 Nov 24, 2024
ed96f72
feat: change the keys of the viper configs to match the flag names
GuptaManan100 Nov 25, 2024
9f14d8b
feat: add logging for configuration changes
GuptaManan100 Nov 25, 2024
b63d4f8
feat: fix config in examples and update summary
GuptaManan100 Nov 26, 2024
a939288
Merge remote-tracking branch 'upstream/main' into vtorc-config-reload…
GuptaManan100 Nov 27, 2024
cb16b34
Merge remote-tracking branch 'upstream/main' into vtorc-config-reload…
GuptaManan100 Dec 3, 2024
a9761f5
feat: address review comments
GuptaManan100 Dec 3, 2024
85fbc91
feat: don't use deleted flag in vtorc
GuptaManan100 Dec 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion changelog/22.0/22.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
- **[Major Changes](#major-changes)**
- **[RPC Changes](#rpc-changes)**
- **[Prefer not promoting a replica that is currently taking a backup](#reparents-prefer-not-backing-up)**
- **[VTOrc Config File Changes](#vtorc-config-file-changes)**


## <a id="major-changes"/>Major Changes</a>
Expand All @@ -25,4 +26,25 @@ For planned reparents, hosts taking backups with a backup engine other than `bui
valid candidates. This means they will never get promoted - not even if there's no other candidates.

Note that behavior for `builtin` backups remains unchanged: a replica that is currently taking a `builtin` backup will
never be promoted, neither by planned nor by emergency reparents.
never be promoted, neither by planned nor by emergency reparents.

### <a id="vtorc-config-file-changes"/>VTOrc Config File Changes</a>

GuptaManan100 marked this conversation as resolved.
Show resolved Hide resolved
The configuration file for VTOrc has been updated to now support dynamic fields. The old `--config` parameter has been removed. The alternative is to use the `--config-file` parameter. The configuration can now be provided in both json, yaml or any other format that [viper](https://github.com/spf13/viper) supports.

The following fields can be dynamically changed -
1. `instance-poll-time`
2. `prevent-cross-cell-failover`
3. `snapshot-topology-interval`
4. `reasonable-replication-lag`
5. `audit-to-backend`
6. `audit-to-syslog`
7. `audit-purge-duration`
8. `wait-replicas-timeout`
9. `tolerable-replication-lag`
10. `topo-information-refresh-duration`
11. `recovery-poll-duration`
12. `allow-emergency-reparent`
13. `change-tablets-with-errant-gtid-to-drained`

To upgrade to the newer version of the configuration file, first switch to using the flags in your current deployment before upgrading. Then you can switch to using the configuration file in the newer release.
4 changes: 3 additions & 1 deletion examples/common/scripts/vtorc-up.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@ vtorc \
$TOPOLOGY_FLAGS \
--logtostderr \
--alsologtostderr \
--config="${script_dir}/../vtorc/config.json" \
--config-path="${script_dir}/../vtorc/" \
--config-name="config.yaml" \
--config-type="yml" \
--port $port \
> "${log_dir}/vtorc.out" 2>&1 &

Expand Down
4 changes: 0 additions & 4 deletions examples/common/vtorc/config.json

This file was deleted.

1 change: 1 addition & 0 deletions examples/common/vtorc/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
instance-poll-time: 1s
GuptaManan100 marked this conversation as resolved.
Show resolved Hide resolved
16 changes: 4 additions & 12 deletions go/cmd/vtorc/cli/cli.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/spf13/cobra"

"vitess.io/vitess/go/acl"
"vitess.io/vitess/go/viperutil/debug"
"vitess.io/vitess/go/vt/log"
"vitess.io/vitess/go/vt/servenv"
"vitess.io/vitess/go/vt/vtorc/config"
Expand All @@ -29,8 +30,7 @@ import (
)

var (
configFile string
Main = &cobra.Command{
Main = &cobra.Command{
Use: "vtorc",
Short: "VTOrc is the automated fault detection and repair tool in Vitess.",
Example: `vtorc \
Expand All @@ -51,22 +51,16 @@ var (

func run(cmd *cobra.Command, args []string) {
servenv.Init()
config.UpdateConfigValuesFromFlags()
inst.RegisterStats()

log.Info("starting vtorc")
if len(configFile) > 0 {
config.ForceRead(configFile)
} else {
config.Read("/etc/vtorc.conf.json", "conf/vtorc.conf.json", "vtorc.conf.json")
}
if config.Config.AuditToSyslog {
if config.GetAuditToSyslog() {
inst.EnableAuditSyslog()
}
config.MarkConfigurationLoaded()

// Log final config values to debug if something goes wrong.
config.LogConfigValues()
log.Infof("Running with Configuration - %v", debug.AllSettings())
server.StartVTOrcDiscovery()

server.RegisterVTOrcAPIEndpoints()
Expand Down Expand Up @@ -96,7 +90,5 @@ func init() {
servenv.MoveFlagsToCobraCommand(Main)

logic.RegisterFlags(Main.Flags())
config.RegisterFlags(Main.Flags())
acl.RegisterFlags(Main.Flags())
Main.Flags().StringVar(&configFile, "config", "", "config file name")
}
1 change: 0 additions & 1 deletion go/flags/endtoend/vtorc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ Flags:
--catch-sigpipe catch and ignore SIGPIPE on stdout and stderr if specified
--change-tablets-with-errant-gtid-to-drained Whether VTOrc should be changing the type of tablets with errant GTIDs to DRAINED
--clusters_to_watch strings Comma-separated list of keyspaces or keyspace/shards that this instance will monitor and repair. Defaults to all clusters in the topology. Example: "ks1,ks2/-80"
--config string config file name
--config-file string Full path of the config file (with extension) to use. If set, --config-path, --config-type, and --config-name are ignored.
--config-file-not-found-handling ConfigFileNotFoundHandling Behavior when a config file is not found. (Options: error, exit, ignore, warn) (default warn)
--config-name string Name of the config file (without extension) to search for. (default "vtconfig")
Expand Down
1 change: 0 additions & 1 deletion go/test/endtoend/cluster/cluster_process.go
Original file line number Diff line number Diff line change
Expand Up @@ -1301,7 +1301,6 @@ func (cluster *LocalProcessCluster) NewVTOrcProcess(config VTOrcConfiguration) *
VtctlProcess: *base,
LogDir: cluster.TmpDirectory,
Config: config,
WebPort: cluster.GetAndReservePort(),
Port: cluster.GetAndReservePort(),
}
}
Expand Down
57 changes: 35 additions & 22 deletions go/test/endtoend/cluster/vtorc_process.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,20 +43,28 @@ type VTOrcProcess struct {
ExtraArgs []string
ConfigPath string
Config VTOrcConfiguration
WebPort int
NoOverride bool
proc *exec.Cmd
exit chan error
}

type VTOrcConfiguration struct {
Debug bool
ListenAddress string
RecoveryPeriodBlockSeconds int
TopologyRefreshSeconds int `json:",omitempty"`
PreventCrossDataCenterPrimaryFailover bool `json:",omitempty"`
LockShardTimeoutSeconds int `json:",omitempty"`
ReplicationLagQuery string `json:",omitempty"`
FailPrimaryPromotionOnLagMinutes int `json:",omitempty"`
InstancePollTime string `json:"instance-poll-time,omitempty"`
SnapshotTopologyInterval string `json:"snapshot-topology-interval,omitempty"`
PreventCrossCellFailover bool `json:"prevent-cross-cell-failover,omitempty"`
ReasonableReplicationLag string `json:"reasonable-replication-lag,omitempty"`
AuditToBackend bool `json:"audit-to-backend,omitempty"`
AuditToSyslog bool `json:"audit-to-syslog,omitempty"`
AuditPurgeDuration string `json:"audit-purge-duration,omitempty"`
WaitReplicasTimeout string `json:"wait-replicas-timeout,omitempty"`
TolerableReplicationLag string `json:"tolerable-replication-lag,omitempty"`
TopoInformationRefreshDuration string `json:"topo-information-refresh-duration,omitempty"`
RecoveryPollDuration string `json:"recovery-poll-duration,omitempty"`
AllowEmergencyReparent string `json:"allow-emergency-reparent,omitempty"`
ChangeTabletsWithErrantGtidToDrained bool `json:"change-tablets-with-errant-gtid-to-drained,omitempty"`
LockShardTimeoutSeconds int `json:",omitempty"`
ReplicationLagQuery string `json:",omitempty"`
FailPrimaryPromotionOnLagMinutes int `json:",omitempty"`
}

// ToJSONString will marshal this configuration as JSON
Expand All @@ -65,12 +73,12 @@ func (config *VTOrcConfiguration) ToJSONString() string {
return string(b)
}

func (config *VTOrcConfiguration) AddDefaults(webPort int) {
config.Debug = true
if config.RecoveryPeriodBlockSeconds == 0 {
config.RecoveryPeriodBlockSeconds = 1
}
config.ListenAddress = fmt.Sprintf(":%d", webPort)
func (config *VTOrcConfiguration) addValuesToCheckOverride() {
config.InstancePollTime = "10h"
}

func (orc *VTOrcProcess) RewriteConfiguration() error {
return os.WriteFile(orc.ConfigPath, []byte(orc.Config.ToJSONString()), 0644)
}

// Setup starts orc process with required arguements
Expand All @@ -91,7 +99,9 @@ func (orc *VTOrcProcess) Setup() (err error) {
orc.ConfigPath = configFile.Name()

// Add the default configurations and print them out
orc.Config.AddDefaults(orc.WebPort)
if !orc.NoOverride {
orc.Config.addValuesToCheckOverride()
}
log.Errorf("configuration - %v", orc.Config.ToJSONString())
_, err = configFile.WriteString(orc.Config.ToJSONString())
if err != nil {
Expand All @@ -111,15 +121,18 @@ func (orc *VTOrcProcess) Setup() (err error) {
"--topo_implementation", orc.TopoImplementation,
"--topo_global_server_address", orc.TopoGlobalAddress,
"--topo_global_root", orc.TopoGlobalRoot,
"--config", orc.ConfigPath,
"--config-file", orc.ConfigPath,
"--port", fmt.Sprintf("%d", orc.Port),
// This parameter is overriden from the config file, added here to just verify that we indeed use the config file paramter over the flag
"--recovery-period-block-duration", "10h",
"--instance-poll-time", "1s",
// Faster topo information refresh speeds up the tests. This doesn't add any significant load either
"--topo-information-refresh-duration", "3s",
"--bind-address", "127.0.0.1",
)
if !orc.NoOverride {
orc.proc.Args = append(orc.proc.Args,
// This parameter is overriden from the config file. This verifies that we indeed use the flag value over the config file.
"--instance-poll-time", "1s",
// Faster topo information refresh speeds up the tests. This doesn't add any significant load either.
"--topo-information-refresh-duration", "3s",
)
}

if *isCoverage {
orc.proc.Args = append(orc.proc.Args, "--test.coverprofile="+getCoveragePath("orc.out"))
Expand Down
3 changes: 1 addition & 2 deletions go/test/endtoend/vtorc/api/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ import (
func TestAPIEndpoints(t *testing.T) {
defer cluster.PanicHandler(t)
utils.SetupVttabletsAndVTOrcs(t, clusterInfo, 2, 1, nil, cluster.VTOrcConfiguration{
PreventCrossDataCenterPrimaryFailover: true,
RecoveryPeriodBlockSeconds: 5,
PreventCrossCellFailover: true,
}, 1, "")
keyspace := &clusterInfo.ClusterInstance.Keyspaces[0]
shard0 := &keyspace.Shards[0]
Expand Down
Loading
Loading