-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add self-managed debug tool #31282
base: main
Are you sure you want to change the base?
Conversation
63632d2
to
a6465db
Compare
a6465db
to
2460e24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments inline. Thanks for taking this on!
A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.
We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)
src/self-managed-debug/Cargo.toml
Outdated
[package] | ||
name = "mz-self-managed-debug" | ||
description = "Debug tool for self-managed Materialize." | ||
version = "0.130.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0
, but you'll want to update it prior to merging.)
Also, to update it automatically, adjust line 41 and 61 in bin/bump-version
so that the version gets updated automatically. The script takes care of calling bin/bazel gen
to sync the bazel files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acc ik Marta mentioned we'd most likely want to decouple the tool from LTS versions/materialize https://materializeinc.slack.com/archives/C07PN7KSB0T/p1737567670003019?thread_ts=1737566344.547089&cid=C07PN7KSB0T, so we might want to make this just version 0.1.0. I think it's very doable that we can have the tool work for any version of materialize.
src/self-managed-debug/src/main.rs
Outdated
{ | ||
let pod_name = pod.metadata.name.clone().unwrap_or_default(); | ||
eprintln!("Failed to process pod {}: {}", pod_name, e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not do what you want: the ?
operator returns from the enclosing function, not the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but if we're referring to the usage of ?
in lines 132-168, wouldn't it return from the block given it's an async block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to everything Moritz said.
After talking to the cloud team, decided to remove |
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can: - Collect pod logs - Collect info similar to `kubectl get all -o wide` - Up next will be getting events, describe output, and rest of TODOs.
Decided high level, human readable logs aren't useful for the cloud team.
- Rename Kubernetes-related CLI arguments to shortform - Make namespace argument required - Add error handling for non-existent namespaces
- Add timestamp-based directory structure for log files - Separate previous and current pod logs into distinct files - Introduce `Context` struct to track debug tool shared state
e9d47e9
to
7dbfff0
Compare
- Updates file path generation to use simplified directory structure for future zipping
- Removes unnecessary return values - Add s`Display` implementation for `K8sResourceType` - Improve logging and error handling for empty log/event scenarios
if !output.status.success() { | ||
return Err(anyhow::anyhow!( | ||
"{}", | ||
String::from_utf8_lossy(&output.stderr) | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed when we usually get an auth error here, per resource type, we get an error message like:
Failed to dump kubectl describe for deployments:
The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:49.975938 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"
The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:50.491841 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"
The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:50.934489 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"
The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:51.425720 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"
The SSO session associated with this profile has expired or is otherwise invalid. To refresh this SSO session run aws sso login with the corresponding profile.
E0210 01:49:51.899735 34995 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://26380169F507ACFBF59BE93A7134F6B1.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": getting credentials: exec: executable aws failed with exit code 255"
Unable to connect to the server: getting credentials: exec: executable aws failed with exit code 255
wondering if it's worth pointing out, if we should truncate this, or if we should output this to an error log file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That message is useful for debugging permissions in AWS. We should definitely show that to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still missing the most important thing I want: the object yaml.
Ok(client) => { | ||
for namespace in context.args.k8s_namespaces.clone() { | ||
let _ = match dump_k8s_pod_logs(&context, client.clone(), &namespace).await { | ||
Ok(_) => Some(()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we aren't doing anything with the output, we don't need to return an option.
if let Err(e) = match dump_k8s_pod_logs(&context, client.clone(), &namespace).await {
// print the error
}
namespace, e | ||
); | ||
if let Some(_) = e.downcast_ref::<kube::Error>() { | ||
eprintln!("Ensure that {} exists.", namespace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other reasons that you might get a kube error. If you are trying to handle the case that the pod doesn't exist, you can handle that in your match.
Err(kube::Error::Api(e)) if e.code == 404 => {
// pod doesn't exist
}
I'm not sure we care if the pod doesn't exist here, though.
} | ||
}; | ||
|
||
let _ = match dump_k8s_events(&context, client.clone(), &namespace).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above about if let Err(e)
instead.
|
||
for handle in describe_handles { | ||
let _ = handle.await; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're just going to wait in order, you might as well just await immediately. You probably want join_all(describe_handles)
instead.
.await | ||
{ | ||
eprintln!("Failed to export {}: {}", ¤t_logs_file_name, e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of duplication here. Maybe pull this all out into a separate function that is called for both the previous and current logs?
} | ||
.await | ||
{ | ||
if let Some(kube::Error::Api(e)) = e.downcast_ref::<kube::Error>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be nicer with a three branch match statement:
match result.await {
Ok(_) => {},
Err(kube::Error::Api(e)) if e.code == 400 => {
eprintln!("No previous logs available for pod {}", pod_name);
},
Err(e) => {
eprintln!("Failed to export {}: {}", &previous_logs_file_name, e);
},
}
I'm also surprised it's a 400, rather than a 404. It sucks that they aren't more specific here.
let file_path = format_resource_path(context.start_time, namespace, &K8sResourceType::Event); | ||
let file_name = format!("{}/events.yaml", file_path); | ||
create_dir_all(&file_path)?; | ||
let mut file = File::options().append(true).create(true).open(&file_name)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want append here? There shouldn't already be any file, and if there is, appending to any existing data would be surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah at this stage, I assume there isn't already a file. But I added append(true)
such that serde_yaml::to_writer(&mut file, &event)?;
appends per event, which seems like what we want! Lemme know if there's a better way to achieve what I want tho
} | ||
|
||
let file_path = format_resource_path(context.start_time, namespace, resource_type); | ||
let file_name = format!("{}/describe.txt", file_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super important, but for cross-platform stuff, you should probably use PathBuf and Path instead of formatting Strings.
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:
kubectl get all -o wide
I hope to merge all changes in a stack but decided to create a PR for early feedback.
Motivation
https://github.com/MaterializeInc/database-issues/issues/8908
Tips for reviewer
cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.