Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/cmd/roachtest: Add large schema test for backup/restore. #139047

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

benbardin
Copy link
Collaborator

Informs: #138747
Release note: None

@benbardin benbardin requested a review from msbutler January 14, 2025 17:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benbardin
Copy link
Collaborator Author

Michael, if this looks good to you, I'll see what Faizan thinks.

I tried running this last night, backup/restore worked at 10k tables with no problem.


dest := destinationName(c)
uri := `gs://` + backupTestingBucket + `/` + dest + `?AUTH=implicit`
t.L().Printf("Backing up to %s\n", uri)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly is the check box we're seeking to fill? Running a single full backup on a cluster with 10k tables seems straightforward. We could begin pushing the system in a variety of ways:

  • run x incremental backups
  • run schema changes during the backup process (maybe this workload already does this?)
  • run back ups with revision history

As an aside, it seems more realistic to run the backup(s) while the main workload is running. Instead of running backups manually after the fact, could we instead enable automatic scheduled backups? And then run a restore after the workload finishes?
https://github.com/msbutler/cockroach/blob/butler-ldr-init-scan-roachtest/pkg/cmd/roachtest/tests/large_schema_benchmark.go#L150

If foundations wants certain variants of the benchmark free of backups, perhaps we should add a different flavor of this test withBackups via the args to registerLargeSchemaBenchmark.

Copy link

@alicia-l2 alicia-l2 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, it seems more realistic to run the backup(s) while the main workload is running. Instead of running backups manually after the fact, could we instead enable automatic scheduled backups? And then run a restore after the workload finishes?

Let's do this and run hourly incrementals, daily fulls, and restore from one of the hourly incrementals

@benbardin benbardin force-pushed the restore10k branch 2 times, most recently from cbcc726 to d01adb0 Compare January 21, 2025 20:04
@benbardin
Copy link
Collaborator Author

Wrote a new test based on our discussion last week, PTAL?

@benbardin benbardin changed the title pkg/cmd/roachtest: Add Backup/Restore to large schema test pkg/cmd/roachtest: Add large schema test for backup/restore. Jan 21, 2025
// and multi-region).
func registerLargeSchemaBackupRestores(r registry.Registry) {
// 10k is probably the upper limit right now, but we haven't tested further.
for _, scale := range []int{1000, 5000, 10000} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need 3 new roachtests? couldn't we just have one with 10k?

)

// registerLargeSchemaBackupRestores registers all permutations of
// multi-region large schema benchmarking (for different scales
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment?

spec.GCEVolumeType("pd-ssd"),
spec.GCEMachineType("n2-standard-8"),
}
testTimeout := 19 * time.Hour
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long does this test actually take to run?

// quickly,on MR the job retention can slow things down. Let's
// minimize how long jobs are kept, so that the creation / ingest
// completes in a reasonable amount of time.
_, err := conn.Exec("SET CLUSTER SETTING jobs.retention_time='1h'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this isn't MR, do we need this?

_, err = conn.Exec("SET CLUSTER SETTING kv.transaction.internal.max_auto_retries=500")
require.NoError(t, err)
// Create a user that will be used for authentication for the REST
// API calls.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary?

// Create all the databases
importConcurrencyLimit := 32
options := tpccOptions{
WorkloadCmd: "tpccmultidb",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can leave WorkloadCmd empty, as we're not setting up mr tpcc

testUser := fmt.Sprintf("test-user-%d", i)
_, err = conn.Exec(fmt.Sprintf("CREATE USER %q", testUser))
require.NoError(t, err)
cmd := fmt.Sprintf("%s %q", grantPreamble, testUser)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this grant command create N new descriptors or does it modify N descriptors? Any way to verify that we're indeed conducting N descriptor changes?

spec.WorkloadNode(),
spec.WorkloadNodeCPU(8),
spec.VolumeSize(800),
spec.GCEVolumeType("pd-ssd"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how big is this cluster once set up ? we need pd ssds, or can we get use cheap local ssds (375 GB per node)

setupTPCC(ctx, t, t.L(), c, options)

// The backup testing bucket is single-region, because that's cheaper.
// So skip this test in multi-region clusters, to avoid the expense of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment about multi region clusters?

require.NoError(t, err)
}

//time.Sleep(600 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants