Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add self-managed debug tool #31282

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Feb 4, 2025

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:

  • Collect pod logs
  • Collect info similar to kubectl get all -o wide
  • Up next will be getting events, describe output, rest of TODOs, and tests

I hope to merge all changes in a stack but decided to create a PR for early feedback.

Motivation

  • This PR adds a known-desirable feature.

https://github.com/MaterializeInc/database-issues/issues/8908

Tips for reviewer

  • Currently want to leave it all in one file and separate things out later
  • Curious if the amount of code for something like this is normal!
  • To quickly test, run cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@SangJunBak SangJunBak requested a review from a team February 4, 2025 04:50
@SangJunBak SangJunBak requested a review from ParkMyCar as a code owner February 4, 2025 04:50
@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from 63632d2 to a6465db Compare February 4, 2025 04:54
@SangJunBak SangJunBak removed the request for review from ParkMyCar February 4, 2025 05:11
@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from a6465db to 2460e24 Compare February 4, 2025 05:44
@SangJunBak SangJunBak requested review from a team, aljoscha and morsapaes as code owners February 4, 2025 05:44
@SangJunBak SangJunBak changed the base branch from lts-v0.130 to main February 4, 2025 05:45
@SangJunBak SangJunBak removed request for a team, aljoscha and morsapaes February 4, 2025 05:45
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments inline. Thanks for taking this on!

A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.

We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)

[package]
name = "mz-self-managed-debug"
description = "Debug tool for self-managed Materialize."
version = "0.130.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0, but you'll want to update it prior to merging.)

Also, to update it automatically, adjust line 41 and 61 in bin/bump-version so that the version gets updated automatically. The script takes care of calling bin/bazel gen to sync the bazel files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acc ik Marta mentioned we'd most likely want to decouple the tool from LTS versions/materialize https://materializeinc.slack.com/archives/C07PN7KSB0T/p1737567670003019?thread_ts=1737566344.547089&cid=C07PN7KSB0T, so we might want to make this just version 0.1.0. I think it's very doable that we can have the tool work for any version of materialize.

src/self-managed-debug/Cargo.toml Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
Comment on lines 171 to 162
{
let pod_name = pod.metadata.name.clone().unwrap_or_default();
eprintln!("Failed to process pod {}: {}", pod_name, e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not do what you want: the ? operator returns from the enclosing function, not the block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but if we're referring to the usage of ? in lines 132-168, wouldn't it return from the block given it's an async block?

src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to everything Moritz said.

src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
src/self-managed-debug/src/main.rs Outdated Show resolved Hide resolved
@SangJunBak
Copy link
Contributor Author

After talking to the cloud team, decided to remove dump_k8s_get_all and can add it back later if we need to

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:
- Collect pod logs
- Collect info similar to `kubectl get all -o wide`
- Up next will be getting events, describe output, and rest of TODOs.
Decided high level, human readable logs aren't useful for the cloud team.
- Rename Kubernetes-related CLI arguments to shortform
- Make namespace argument required
- Add error handling for non-existent namespaces
- Add timestamp-based directory structure for log files
- Separate previous and current pod logs into distinct files
- Introduce `Context` struct to track debug tool shared state
@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from e9d47e9 to 7dbfff0 Compare February 7, 2025 23:02
- Updates file path generation to use simplified directory structure for future zipping
- Removes unnecessary return values
- Add s`Display` implementation for `K8sResourceType`
- Improve logging and error handling for empty log/event scenarios
Comment on lines +314 to +319
if !output.status.success() {
return Err(anyhow::anyhow!(
"{}",
String::from_utf8_lossy(&output.stderr)
));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed when we usually get an auth error here, per resource type, we get an error message like:

Failed to dump kubectl describe for deployments: 
The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:49.975938   34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"

The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:50.491841   34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"

The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:50.934489   34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"

The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:51.425720   34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"

The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:51.899735   34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"
Unable to connect to the server: getting credentials: exec: executable aws failed with exit code 255

wondering if it's worth pointing out, if we should truncate this, or if we should output this to an error log file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That message is useful for debugging permissions in AWS. We should definitely show that to the user.

Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still missing the most important thing I want: the object yaml.

Ok(client) => {
for namespace in context.args.k8s_namespaces.clone() {
let _ = match dump_k8s_pod_logs(&context, client.clone(), &namespace).await {
Ok(_) => Some(()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't doing anything with the output, we don't need to return an option.

if let Err(e) = match dump_k8s_pod_logs(&context, client.clone(), &namespace).await {
    // print the error
}

namespace, e
);
if let Some(_) = e.downcast_ref::<kube::Error>() {
eprintln!("Ensure that {} exists.", namespace);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other reasons that you might get a kube error. If you are trying to handle the case that the pod doesn't exist, you can handle that in your match.

Err(kube::Error::Api(e)) if e.code == 404 => {
    // pod doesn't exist
}

I'm not sure we care if the pod doesn't exist here, though.

}
};

let _ = match dump_k8s_events(&context, client.clone(), &namespace).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about if let Err(e) instead.


for handle in describe_handles {
let _ = handle.await;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're just going to wait in order, you might as well just await immediately. You probably want join_all(describe_handles) instead.

.await
{
eprintln!("Failed to export {}: {}", &current_logs_file_name, e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of duplication here. Maybe pull this all out into a separate function that is called for both the previous and current logs?

}
.await
{
if let Some(kube::Error::Api(e)) = e.downcast_ref::<kube::Error>() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be nicer with a three branch match statement:

match result.await {
    Ok(_) => {},
    Err(kube::Error::Api(e)) if e.code == 400 => {
        eprintln!("No previous logs available for pod {}", pod_name);
    },
    Err(e) => {
        eprintln!("Failed to export {}: {}", &previous_logs_file_name, e);
    },
}

I'm also surprised it's a 400, rather than a 404. It sucks that they aren't more specific here.

let file_path = format_resource_path(context.start_time, namespace, &K8sResourceType::Event);
let file_name = format!("{}/events.yaml", file_path);
create_dir_all(&file_path)?;
let mut file = File::options().append(true).create(true).open(&file_name)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want append here? There shouldn't already be any file, and if there is, appending to any existing data would be surprising.

Copy link
Contributor Author

@SangJunBak SangJunBak Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah at this stage, I assume there isn't already a file. But I added append(true) such that serde_yaml::to_writer(&mut file, &event)?; appends per event, which seems like what we want! Lemme know if there's a better way to achieve what I want tho

}

let file_path = format_resource_path(context.start_time, namespace, resource_type);
let file_name = format!("{}/describe.txt", file_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super important, but for cross-platform stuff, you should probably use PathBuf and Path instead of formatting Strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants