Always allocate regions to different sleds #7382

jmpesp · 2025-01-22T00:09:24Z

When performing region allocation again for a particular volume and increasing the redundancy, the region allocation query was not excluding the sleds that have already been used by the existing allocation as candidates for the new region (note that this only matters when the allocation strategy mandates distinct sleds!). This resulted in region replacement and region snapshot replacement having the potential to allocate the replacement region onto the same sled as an existing region.

This commit fixes the region allocation query, and then fixes the fallout: many tests of region replacement or region snapshot replacement were not creating many simulated sled agents, and therefore could not properly satisfy the fixed region allocation query. The extra_sled_agents parameter added in #7353 was used, along with changing the DiskTestBuilder to simulated one zpool on many sleds.

Changing these replacement tests revealed the next problem: because all the simulated sleds share the same IP (::1) and range of ports for the fake Crucible agents, each region allocation would have the same target, and the volume construction request would look like:

"target": [
  "[::1]:1100",
  "[::1]:1100",
  "[::1]:1100"
]

This is very confusing for the replacement routines!

We can't change the simulated sled's IPs: they need to bind to localhost to listen for requests. Therefore this commit also adds the idea of "sled index", and this sled index is used to make each range of Crucible agent ports unique.

This lead straight to the next problem: the simulated Pantry assumes that there is really only one simulated sled-agent, and this has all the storage! This is no longer true in a post-#7353 world. Snapshots would not be able to complete because the simulated Pantry would only find one of the three regions when searching the first simulated sled-agent's storage.

The real Pantry would construct a Crucible Upstairs, and delegate snapshot requests to that, so this commit also adds a simulated Crucible Upstairs that's aware of each of the simulated sled agent's Storage, and can therefore fulfill snapshot requests. Centralizing the place where all simulated sled agents (and Pantry servers!) are created makes this easy: the simulated Upstairs can be aware of all of the Storage objects, and the simulated Pantry can use that simulated Upstairs to take fake snapshots.

Fixes oxidecomputer/crucible#1594.

When performing region allocation _again_ for a particular volume and increasing the redundancy, the region allocation query was not excluding the sleds that have already been used by the existing allocation as candidates for the new region (note that this only matters when the allocation strategy mandates distinct sleds!). This resulted in region replacement and region snapshot replacement having the potential to allocate the replacement region onto the same sled as an existing region. This commit fixes the region allocation query, and then fixes the fallout: many tests of region replacement or region snapshot replacement were not creating many simulated sled agents, and therefore could not properly satisfy the fixed region allocation query. The `extra_sled_agents` parameter added in oxidecomputer#7353 was used, along with changing the `DiskTestBuilder` to simulated one zpool on many sleds. Changing these replacement tests revealed the next problem: because all the simulated sleds share the same IP (::1) and range of ports for the fake Crucible agents, each region allocation would have the same target, and the volume construction request would look like: "target": [ "[::1]:1100", "[::1]:1100", "[::1]:1100" ] This is very confusing for the replacement routines! We can't change the simulated sled's IPs: they need to bind to localhost to listen for requests. Therefore this commit also adds the idea of "sled index", and this sled index is used to make each range of Crucible agent ports unique. This lead straight to the next problem: the simulated Pantry assumes that there is really only one simulated sled-agent, and this has all the storage! This is no longer true in a post-oxidecomputer#7353 world. Snapshots would not be able to complete because the simulated Pantry would only find one of the three regions when searching the first simulated sled-agent's storage. The real Pantry would construct a Crucible Upstairs, and delegate snapshot requests to that, so this commit also adds a simulated Crucible Upstairs that's aware of each of the simulated sled agent's `Storage`, and can therefore fulfill snapshot requests. Centralizing the place where all simulated sled agents (and Pantry servers!) are created makes this easy: the simulated Upstairs can be aware of all of the `Storage` objects, and the simulated Pantry can use that simulated Upstairs to take fake snapshots. Fixes oxidecomputer/crucible#1594.

jgallagher

LGTM, just some nits and questions.

jgallagher · 2025-01-22T17:07:13Z

common/src/progenitor_operation_retry.rs

    GoneCheckError(#[source] Error),

    /// The retry loop progenitor operation saw a permanent client error
-    #[error("permanent error")]
+    #[error("permanent error: {0}")]


Nit - this (and the same change a few lines up) is not right. Whatever is printing or logging errors should print the full source chain using something like anyhow's # formatting or slog-error-chain, at which point this will result in double printing of messages:

permanent error: some error: some error

fixed in d508812

jgallagher · 2025-01-22T17:10:29Z

nexus/db-queries/src/db/queries/region_allocation.rs

+            INNER JOIN
+            sled ON (sled.id = physical_disk.sled_id)
+          WHERE
+            zpool.id = ANY(SELECT pool_id FROM existing_zpools)


TIL = ANY is equivalent to IN. (I would have written this zpool.id IN (SELECT ...).

jgallagher · 2025-01-22T17:15:10Z

nexus/db-queries/src/db/queries/region_allocation.rs

+            "
+        existing_sleds AS (
+          SELECT
+            sled.id


Could this be zpool.sled_id and then avoid the joins against physical_disk/sled?

TIL that zpool has that field! 703e3c7

jgallagher · 2025-01-22T17:18:17Z

nexus/test-utils-macros/src/lib.rs

@@ -9,7 +9,7 @@ use syn::{parse_macro_input, ItemFn, Token};
 #[derive(Debug, PartialEq, Eq, Hash)]
 pub(crate) enum NexusTestArg {
    ServerUnderTest(syn::Path),
-    ExtraSledAgents(usize),
+    ExtraSledAgents(u16),


This seems fine but out of curiosity - why the change from usize?

Edit: From later in the PR, it looks like it's because this is used to calculate a port number? I think I'd be inclined to keep this as a usize and map index -> port at the point where that's needed. But this isn't a big deal either way.

I figured changing the type here would cut down on the conversions and checked arithmetic later, and also 65536 seemed like a more reasonable limit for the number of simulated sled agents :)

jgallagher · 2025-01-22T18:41:42Z

nexus/tests/integration_tests/sleds.rs

@@ -73,8 +73,11 @@ async fn test_sleds_list(cptestctx: &ControlPlaneTestContext) {
                log,
                addr,
                sa_id,
+                // Index starts at 2


Might be worth expanding this slightly: "Index starts at 2 because our nexus_test config created 2 sled agents already" or something like that.

👍 afe335d

jgallagher · 2025-01-22T18:43:07Z

nexus/tests/integration_tests/volume_management.rs

+
+    for sled_agent in cptestctx.all_sled_agents() {
+        let zpool_id =
+            TypedUuid::from_untyped_uuid(db_read_only_dataset.pool_id);


Nit - can we use the specific kind of Uuid here?

Suggested change

TypedUuid::from_untyped_uuid(db_read_only_dataset.pool_id);

ZpoolUuid::from_untyped_uuid(db_read_only_dataset.pool_id);

good idea, done in 474e085

jgallagher · 2025-01-22T19:16:02Z

sled-agent/src/sim/sled_agent.rs

@@ -207,7 +174,7 @@ impl SledAgent {
            updates: UpdateManager::new(config.updates.clone()),
            nexus_address,
            nexus_client,
-            disk_id_to_region_ids: Mutex::new(HashMap::new()),
+            simulated_upstairs: simulated_upstairs.clone(),


Nit - can we drop this .clone()?

Suggested change

simulated_upstairs: simulated_upstairs.clone(),

simulated_upstairs,

yep - 7922496

jgallagher · 2025-01-22T19:27:27Z

sled-agent/src/sim/storage.rs

        self.next_crucible_port += 100;
+        assert!(self.next_crucible_port < 1000);


I'm missing something here. In new() I see

next_crucible_port: (sled_index + 1) * 1000,

which looks like this assertion would always fail (since next_crucible_port starts at N-thousand for the Nth sled-agent)?

Similarly confused question: the comment says "10 per dataset", but next_crucible_port looks like it's incremented independently of datasets. Is it 10 per dataset or 10 total?

I think both of us are missing something here! This worked for me when I ran the full test suite before opening this PR, but now I'm seeing

thread 'test_omdb_env_settings' panicked at sled-agent/src/sim/storage.rs:1604:9: assertion failed: self.next_crucible_port < 1000

What should be asserted is the next port - starting port is less than the range, which is done in d440dc0. That commit also changes the ranges so that there can be a maximum of 20 datasets/agents per sled (and therefore 50 regions/snapshots per agent) and also clarifies the comment.

Ahh, that makes sense. Thanks, the new comment is very helpful.

jgallagher · 2025-01-22T19:30:51Z

sled-agent/src/sim/upstairs.rs

+        volume_construction_request: &VolumeConstructionRequest,
+    ) {
+        let mut inner = self.inner.lock().unwrap();
+        // XXX what if existing entry? assert?


Can callers "remap" IDs to new VCRs? If not, seems reasonable to either assert or return a Result, I think? (I guess a variant of this question would be: is it reasonable for a caller to call this multiple times with the same ID -> VCR mapping, and if so should that be okay?)

Actually yeah, VCR changes happen all the time due to read-only parent removal and the various replacements. Updated this comment in da4aca6.

…al disk and sled

…ment

jmpesp added 2 commits January 22, 2025 00:07

sled-agent/src/sim/upstairs.rs

2eaf413

jgallagher reviewed Jan 22, 2025

View reviewed changes

jmpesp added 8 commits January 23, 2025 14:18

remove {0}, this would result in double printing the error

d508812

zpool has a sled_id field, use that instead of joining through physic…

703e3c7

…al disk and sled

fix assert for going out of the allocated port range, and clarify com…

d440dc0

…ment

no clone required

7922496

specific type for uuid

474e085

VCR changes occur all the time!

da4aca6

clarify index comment

afe335d

fmt

7c44303

jmpesp enabled auto-merge (squash) January 23, 2025 15:21

clarify Config::for_testing zpool config

ad7dec5

jmpesp merged commit 288eaf4 into oxidecomputer:main Jan 23, 2025
16 checks passed

jmpesp deleted the proper_region_redundancy branch January 23, 2025 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always allocate regions to different sleds #7382

Always allocate regions to different sleds #7382

jmpesp commented Jan 22, 2025

jgallagher left a comment

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 22, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

jgallagher Jan 23, 2025

jgallagher Jan 22, 2025

jmpesp Jan 23, 2025

	TypedUuid::from_untyped_uuid(db_read_only_dataset.pool_id);
	ZpoolUuid::from_untyped_uuid(db_read_only_dataset.pool_id);

	simulated_upstairs: simulated_upstairs.clone(),
	simulated_upstairs,

		self.next_crucible_port += 100;
		assert!(self.next_crucible_port < 1000);

Always allocate regions to different sleds #7382

Always allocate regions to different sleds #7382

Conversation

jmpesp commented Jan 22, 2025

jgallagher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment