feat: Add self-managed debug tool #31282

SangJunBak · 2025-02-04T04:50:47Z

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:

Collect pod logs
Collect info similar to kubectl get all -o wide
Up next will be getting events, describe output, rest of TODOs, and tests

I hope to merge all changes in a stack but decided to create a PR for early feedback.

Motivation

This PR adds a known-desirable feature.

https://github.com/MaterializeInc/database-issues/issues/8908

Tips for reviewer

Currently want to leave it all in one file and separate things out later
Curious if the amount of code for something like this is normal!
To quickly test, run cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

src/self-managed-debug/Cargo.toml

src/self-managed-debug/src/main.rs

antiguru

Left some comments inline. Thanks for taking this on!

A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.

We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)

antiguru · 2025-02-04T10:08:26Z

src/self-managed-debug/Cargo.toml

+[package]
+name = "mz-self-managed-debug"
+description = "Debug tool for self-managed Materialize."
+version = "0.130.1"


You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0, but you'll want to update it prior to merging.)

Also, to update it automatically, adjust line 41 and 61 in bin/bump-version so that the version gets updated automatically. The script takes care of calling bin/bazel gen to sync the bazel files.

Acc ik Marta mentioned we'd most likely want to decouple the tool from LTS versions/materialize https://materializeinc.slack.com/archives/C07PN7KSB0T/p1737567670003019?thread_ts=1737566344.547089&cid=C07PN7KSB0T, so we might want to make this just version 0.1.0. I think it's very doable that we can have the tool work for any version of materialize.

src/self-managed-debug/Cargo.toml

src/self-managed-debug/src/main.rs

antiguru · 2025-02-04T10:16:02Z

src/self-managed-debug/src/main.rs

+        {
+            let pod_name = pod.metadata.name.clone().unwrap_or_default();
+            eprintln!("Failed to process pod {}: {}", pod_name, e);
+        }


This might not do what you want: the ? operator returns from the enclosing function, not the block.

Ah but if we're referring to the usage of ? in lines 132-168, wouldn't it return from the block given it's an async block?

src/self-managed-debug/src/main.rs

alex-hunt-materialize

+1 to everything Moritz said.

src/self-managed-debug/src/main.rs

SangJunBak · 2025-02-04T16:50:59Z

After talking to the cloud team, decided to remove dump_k8s_get_all and can add it back later if we need to

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can: - Collect pod logs - Collect info similar to `kubectl get all -o wide` - Up next will be getting events, describe output, and rest of TODOs.

Decided high level, human readable logs aren't useful for the cloud team.

- Rename Kubernetes-related CLI arguments to shortform - Make namespace argument required - Add error handling for non-existent namespaces

- Add timestamp-based directory structure for log files - Separate previous and current pod logs into distinct files - Introduce `Context` struct to track debug tool shared state

- Updates file path generation to use simplified directory structure for future zipping

- Removes unnecessary return values - Add s`Display` implementation for `K8sResourceType` - Improve logging and error handling for empty log/event scenarios

SangJunBak · 2025-02-10T06:56:59Z

src/self-managed-debug/src/main.rs

+    if !output.status.success() {
+        return Err(anyhow::anyhow!(
+            "{}",
+            String::from_utf8_lossy(&output.stderr)
+        ));
+    }


I noticed when we usually get an auth error here, per resource type, we get an error message like:

Failed to dump kubectl describe for deployments: The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile. E0210 01:49:49.975938 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255" The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile. E0210 01:49:50.491841 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255" The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile. E0210 01:49:50.934489 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255" The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile. E0210 01:49:51.425720 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255" The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile. E0210 01:49:51.899735 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255" Unable to connect to the server: getting credentials: exec: executable aws failed with exit code 255

wondering if it's worth pointing out, if we should truncate this, or if we should output this to an error log file?

That message is useful for debugging permissions in AWS. We should definitely show that to the user.

alex-hunt-materialize

We're still missing the most important thing I want: the object yaml.

alex-hunt-materialize · 2025-02-10T15:48:49Z

src/self-managed-debug/src/main.rs

+        Ok(client) => {
+            for namespace in context.args.k8s_namespaces.clone() {
+                let _ = match dump_k8s_pod_logs(&context, client.clone(), &namespace).await {
+                    Ok(_) => Some(()),


If we aren't doing anything with the output, we don't need to return an option.

if let Err(e) = match dump_k8s_pod_logs(&context, client.clone(), &namespace).await { // print the error }

alex-hunt-materialize · 2025-02-10T15:52:23Z

src/self-managed-debug/src/main.rs

+                            namespace, e
+                        );
+                        if let Some(_) = e.downcast_ref::<kube::Error>() {
+                            eprintln!("Ensure that {} exists.", namespace);


There are other reasons that you might get a kube error. If you are trying to handle the case that the pod doesn't exist, you can handle that in your match.

Err(kube::Error::Api(e)) if e.code == 404 => { // pod doesn't exist }

I'm not sure we care if the pod doesn't exist here, though.

alex-hunt-materialize · 2025-02-10T15:53:36Z

src/self-managed-debug/src/main.rs

+                    }
+                };
+
+                let _ = match dump_k8s_events(&context, client.clone(), &namespace).await {


See above about if let Err(e) instead.

alex-hunt-materialize · 2025-02-10T15:55:15Z

src/self-managed-debug/src/main.rs

+
+    for handle in describe_handles {
+        let _ = handle.await;
+    }


If you're just going to wait in order, you might as well just await immediately. You probably want join_all(describe_handles) instead.

alex-hunt-materialize · 2025-02-10T17:08:03Z

src/self-managed-debug/src/main.rs

+        .await
+        {
+            eprintln!("Failed to export {}: {}", &current_logs_file_name, e);
+        }


There is a lot of duplication here. Maybe pull this all out into a separate function that is called for both the previous and current logs?

alex-hunt-materialize · 2025-02-10T17:13:06Z

src/self-managed-debug/src/main.rs

+        }
+        .await
+        {
+            if let Some(kube::Error::Api(e)) = e.downcast_ref::<kube::Error>() {


This would probably be nicer with a three branch match statement:

match result.await { Ok(_) => {}, Err(kube::Error::Api(e)) if e.code == 400 => { eprintln!("No previous logs available for pod {}", pod_name); }, Err(e) => { eprintln!("Failed to export {}: {}", &previous_logs_file_name, e); }, }

I'm also surprised it's a 400, rather than a 404. It sucks that they aren't more specific here.

alex-hunt-materialize · 2025-02-10T17:18:41Z

src/self-managed-debug/src/main.rs

+    let file_path = format_resource_path(context.start_time, namespace, &K8sResourceType::Event);
+    let file_name = format!("{}/events.yaml", file_path);
+    create_dir_all(&file_path)?;
+    let mut file = File::options().append(true).create(true).open(&file_name)?;


Do we want append here? There shouldn't already be any file, and if there is, appending to any existing data would be surprising.

Ah at this stage, I assume there isn't already a file. But I added append(true) such that serde_yaml::to_writer(&mut file, &event)?; appends per event, which seems like what we want! Lemme know if there's a better way to achieve what I want tho

alex-hunt-materialize · 2025-02-10T17:21:49Z

src/self-managed-debug/src/main.rs

+    }
+
+    let file_path = format_resource_path(context.start_time, namespace, resource_type);
+    let file_name = format!("{}/describe.txt", file_path);


It's not super important, but for cross-platform stuff, you should probably use PathBuf and Path instead of formatting Strings.

SangJunBak requested a review from a team February 4, 2025 04:50

SangJunBak requested a review from ParkMyCar as a code owner February 4, 2025 04:50

SangJunBak commented Feb 4, 2025

View reviewed changes

src/self-managed-debug/Cargo.toml Show resolved Hide resolved

SangJunBak commented Feb 4, 2025

View reviewed changes

src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved

SangJunBak force-pushed the jun/#8908/debug-tool branch from 63632d2 to a6465db Compare February 4, 2025 04:54

SangJunBak removed the request for review from ParkMyCar February 4, 2025 05:11

SangJunBak force-pushed the jun/#8908/debug-tool branch from a6465db to 2460e24 Compare February 4, 2025 05:44

SangJunBak requested review from a team, aljoscha and morsapaes as code owners February 4, 2025 05:44

SangJunBak changed the base branch from lts-v0.130 to main February 4, 2025 05:45

SangJunBak removed request for a team, aljoscha and morsapaes February 4, 2025 05:45

antiguru reviewed Feb 4, 2025

View reviewed changes

alex-hunt-materialize reviewed Feb 4, 2025

View reviewed changes

SangJunBak added 3 commits February 7, 2025 17:59

feat: Add self-managed debug tool

8826a87

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can: - Collect pod logs - Collect info similar to `kubectl get all -o wide` - Up next will be getting events, describe output, and rest of TODOs.

Clean up version and imports

6c75e40

Remove get_all command

1e186a4

Decided high level, human readable logs aren't useful for the cloud team.

SangJunBak added 2 commits February 7, 2025 18:00

Improve UI of tool

9288755

- Rename Kubernetes-related CLI arguments to shortform - Make namespace argument required - Add error handling for non-existent namespaces

Enhance log collection with improved file naming and organization

7dbfff0

- Add timestamp-based directory structure for log files - Separate previous and current pod logs into distinct files - Introduce `Context` struct to track debug tool shared state

SangJunBak force-pushed the jun/#8908/debug-tool branch from e9d47e9 to 7dbfff0 Compare February 7, 2025 23:02

SangJunBak added 3 commits February 9, 2025 16:18

Add Kubernetes events collection to self-managed debug tool

115fd52

- Updates file path generation to use simplified directory structure for future zipping

Refactor self-managed debug tool log and event collection

e5f6da5

- Removes unnecessary return values - Add s`Display` implementation for `K8sResourceType` - Improve logging and error handling for empty log/event scenarios

Dump kubectl describe output

1e4db41

SangJunBak requested review from alex-hunt-materialize and antiguru February 10, 2025 06:51

SangJunBak commented Feb 10, 2025

View reviewed changes

alex-hunt-materialize reviewed Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add self-managed debug tool #31282

feat: Add self-managed debug tool #31282

SangJunBak commented Feb 4, 2025 •

edited

Loading

antiguru left a comment

antiguru Feb 4, 2025

SangJunBak Feb 6, 2025

antiguru Feb 4, 2025

SangJunBak Feb 4, 2025

alex-hunt-materialize left a comment

SangJunBak commented Feb 4, 2025

SangJunBak Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize left a comment

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

alex-hunt-materialize Feb 10, 2025

SangJunBak Feb 10, 2025 •

edited

Loading

alex-hunt-materialize Feb 10, 2025

feat: Add self-managed debug tool #31282

Are you sure you want to change the base?

feat: Add self-managed debug tool #31282

Conversation

SangJunBak commented Feb 4, 2025 • edited Loading

Motivation

Tips for reviewer

Checklist

antiguru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hunt-materialize left a comment

Choose a reason for hiding this comment

SangJunBak commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hunt-materialize left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SangJunBak Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SangJunBak commented Feb 4, 2025 •

edited

Loading

SangJunBak Feb 10, 2025 •

edited

Loading