-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/cmd/roachtest: Add large schema test for backup/restore. #139047
base: master
Are you sure you want to change the base?
Conversation
Michael, if this looks good to you, I'll see what Faizan thinks. I tried running this last night, backup/restore worked at 10k tables with no problem. |
|
||
dest := destinationName(c) | ||
uri := `gs://` + backupTestingBucket + `/` + dest + `?AUTH=implicit` | ||
t.L().Printf("Backing up to %s\n", uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly is the check box we're seeking to fill? Running a single full backup on a cluster with 10k tables seems straightforward. We could begin pushing the system in a variety of ways:
- run x incremental backups
- run schema changes during the backup process (maybe this workload already does this?)
- run back ups with revision history
As an aside, it seems more realistic to run the backup(s) while the main workload is running. Instead of running backups manually after the fact, could we instead enable automatic scheduled backups? And then run a restore after the workload finishes?
https://github.com/msbutler/cockroach/blob/butler-ldr-init-scan-roachtest/pkg/cmd/roachtest/tests/large_schema_benchmark.go#L150
If foundations wants certain variants of the benchmark free of backups, perhaps we should add a different flavor of this test withBackups
via the args to registerLargeSchemaBenchmark
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, it seems more realistic to run the backup(s) while the main workload is running. Instead of running backups manually after the fact, could we instead enable automatic scheduled backups? And then run a restore after the workload finishes?
Let's do this and run hourly incrementals, daily fulls, and restore from one of the hourly incrementals
cbcc726
to
d01adb0
Compare
Wrote a new test based on our discussion last week, PTAL? |
Informs: cockroachdb#138747 Release note: None
// and multi-region). | ||
func registerLargeSchemaBackupRestores(r registry.Registry) { | ||
// 10k is probably the upper limit right now, but we haven't tested further. | ||
for _, scale := range []int{1000, 5000, 10000} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need 3 new roachtests? couldn't we just have one with 10k?
) | ||
|
||
// registerLargeSchemaBackupRestores registers all permutations of | ||
// multi-region large schema benchmarking (for different scales |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment?
spec.GCEVolumeType("pd-ssd"), | ||
spec.GCEMachineType("n2-standard-8"), | ||
} | ||
testTimeout := 19 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does this test actually take to run?
// quickly,on MR the job retention can slow things down. Let's | ||
// minimize how long jobs are kept, so that the creation / ingest | ||
// completes in a reasonable amount of time. | ||
_, err := conn.Exec("SET CLUSTER SETTING jobs.retention_time='1h'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this isn't MR, do we need this?
_, err = conn.Exec("SET CLUSTER SETTING kv.transaction.internal.max_auto_retries=500") | ||
require.NoError(t, err) | ||
// Create a user that will be used for authentication for the REST | ||
// API calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary?
// Create all the databases | ||
importConcurrencyLimit := 32 | ||
options := tpccOptions{ | ||
WorkloadCmd: "tpccmultidb", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you can leave WorkloadCmd
empty, as we're not setting up mr tpcc
testUser := fmt.Sprintf("test-user-%d", i) | ||
_, err = conn.Exec(fmt.Sprintf("CREATE USER %q", testUser)) | ||
require.NoError(t, err) | ||
cmd := fmt.Sprintf("%s %q", grantPreamble, testUser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this grant command create N new descriptors or does it modify N descriptors? Any way to verify that we're indeed conducting N descriptor changes?
spec.WorkloadNode(), | ||
spec.WorkloadNodeCPU(8), | ||
spec.VolumeSize(800), | ||
spec.GCEVolumeType("pd-ssd"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big is this cluster once set up ? we need pd ssds, or can we get use cheap local ssds (375 GB per node)
setupTPCC(ctx, t, t.L(), c, options) | ||
|
||
// The backup testing bucket is single-region, because that's cheaper. | ||
// So skip this test in multi-region clusters, to avoid the expense of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment about multi region clusters?
require.NoError(t, err) | ||
} | ||
|
||
//time.Sleep(600 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment
Informs: #138747
Release note: None