Sled Agent x Falcon: Use VMMs for Sled Agent testing #5226

smklein · 2024-03-07T23:03:45Z

TL:DR

It's time to introduce a test wrapper for Sled Agent tests to execute within a VMM.

Summary

Sled Agent tests used mocks to interface with the OS (intercepting calls to the system). Then, to a limited degree, they used fakes (see: #2422) to simulate the system. However, these tests still require a significant amount of plumbing, test-only interfaces, and constraints to execute correctly.

We'd benefit significantly from using a combination of falcon and nextest features to wrap "the ability to run your code in the context of a new, isolated VMM".

Background

Example job using falcon in CI: https://github.com/oxidecomputer/omicron/blob/commprobe/.github/buildomat/jobs/a4x2-deploy.sh
Falcon APIs to ...
- Spin up node: https://github.com/oxidecomputer/falcon/blob/1260f16acf6a021ee90696694badc8c735b16204/examples/solo/src/main.rs#L11-L15
- Mount test binaries: https://github.com/oxidecomputer/falcon/blob/1260f16acf6a021ee90696694badc8c735b16204/lib/src/lib.rs#L462-L469
- Run commands within new VM: https://github.com/oxidecomputer/falcon/blob/1260f16acf6a021ee90696694badc8c735b16204/lib/src/lib.rs#L595-L599
Nextest Issue for per-test target runners, though we could use a https://nexte.st/book/target-runners.html for now

Goal

Here's what would be a really nice end-state:

You write some code within Sled Agent which manipulates "global state" on your sled (e.g., managing disks, launching zones, manipulating dump devices, etc, -- whatever!)
In the same file where you want to write the code to perform these actions, you write a test like the following:

// Some function poking at global state, that you want to test.
pub async fn manage_system_state() { ... }

#[cfg(all(test, target = "illumos", feature = "vmm-test"))]
mod test {
  use super::*;

  #[vmm_test(config = default)]
  async fn my_test() {
     let zones = std::command::Command::new("zoneadm").arg("list").output().expect();
     println!("My own, test-specific set of zones in my VMM: {}", zones.stdout);

     // Use your test code to manipulate the state of the system.
     manage_system_state().await;
  }

  #[vmm_test(config = default)]
  async fn my_other_test() {
     // Run in a separate VMM - no worry about conflicting with the state of "my_test".
     ...
  }
}

To run these tests, you should be able to run cargo nextest run, pointing specifically to this test target, and we could be able to run them with a pfexec invocation, so the test runner could successfully launch VMMs.
- This exact command could be invoked via cargo xtask, and itself added to CI.

Tasks

Create an attribute macro for vmm_tests, which spins up a node, mounts test binaries, and runs commands within the new VM.
- Extend this command with "config" options, to allow tests to specify "what their machine looks like". This should largely translate to calling into Falcon's API, though it would be nice to set some reasonable single-sled defaults.
- Extend this command to grab logs and other debug information from tests, so we can inspect system state on test failure.
- Ensure this test runner destroys VMMs on cleanup
- Consider optimizing this runner to "revert system state before the test started" if we want to re-use it between multiple tests.
Ensure that any tests using this framework are adequately labelled. For example, we could mark the tests as "ignored" to ensure that the vanilla cargo nextest run invocation is not broken when executed without adequate permissions.
Use or work around Add support for per-test target runners nextest-rs/nextest#1358 to invoke pfexec from nextest, granting adequate permissions to the specific tests wanting to launch VMMs
Ensure these tests are run on the "lab environment" in CI
Migrate Sled Agent tests to use this framework. Good targets include: The StorageManager, ServiceManager, and ZoneBundler tests, though there are many more viable candidates.

The text was updated successfully, but these errors were encountered:

karencfv · 2024-03-07T23:33:34Z

Oooohhhh nice! This will help with #4835

smklein · 2024-03-08T22:20:13Z

Chatted with @andrewjstone a little bit about this. We're thinking that "while a test attribute macro might be cool", it also would probably make more sense for tests to have a bit better "explicit control" over VMMs. That would let us do things like maybe "write test code that can cause reboots, and watch what happens".

smklein added Testing & Analysis Tests & Analyzers Sled Agent Related to the Per-Sled Configuration and Management labels Mar 7, 2024

smklein mentioned this issue Mar 7, 2024

[sled agent] Fakes are better than mocks; get rid of mocks #2422

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sled Agent x Falcon: Use VMMs for Sled Agent testing #5226

Sled Agent x Falcon: Use VMMs for Sled Agent testing #5226

smklein commented Mar 7, 2024

karencfv commented Mar 7, 2024

smklein commented Mar 8, 2024

Sled Agent x Falcon: Use VMMs for Sled Agent testing #5226

Sled Agent x Falcon: Use VMMs for Sled Agent testing #5226

Comments

smklein commented Mar 7, 2024

TL:DR

Summary

Background

Goal

Tasks

karencfv commented Mar 7, 2024

smklein commented Mar 8, 2024