-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add self-managed debug tool #31282
base: main
Are you sure you want to change the base?
Conversation
Ok(file_name) | ||
} | ||
|
||
fn format_duration(duration: Duration) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there's a better function I can reuse here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get a TimeDelta
, which has a Display
implementation.
63632d2
to
a6465db
Compare
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can: - Collect pod logs - Collect info similar to `kubectl get all -o wide` - Up next will be getting events, describe output, and rest of TODOs.
a6465db
to
2460e24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments inline. Thanks for taking this on!
A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.
We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)
[package] | ||
name = "mz-self-managed-debug" | ||
description = "Debug tool for self-managed Materialize." | ||
version = "0.130.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0
, but you'll want to update it prior to merging.)
Also, to update it automatically, adjust line 41 and 61 in bin/bump-version
so that the version gets updated automatically. The script takes care of calling bin/bazel gen
to sync the bazel files.
use std::fmt::Debug; | ||
|
||
use std::process; | ||
|
||
use std::sync::LazyLock; | ||
|
||
use chrono::{Duration, Utc}; | ||
use clap::Parser; | ||
use k8s_openapi::api::apps::v1::{Deployment, ReplicaSet, StatefulSet}; | ||
use k8s_openapi::api::networking::v1::{NetworkPolicy, NetworkPolicyPeer, NetworkPolicyPort}; | ||
use k8s_openapi::apimachinery::pkg::util::intstr::IntOrString; | ||
use kube::config::KubeConfigOptions; | ||
use kube::{Client, Config}; | ||
use tabled::{Style, Table, Tabled}; | ||
|
||
use mz_build_info::{build_info, BuildInfo}; | ||
|
||
use mz_ore::cli::{self, CliConfig}; | ||
|
||
use k8s_openapi::api::core::v1::{ContainerStatus, Pod, Service}; | ||
|
||
use kube::api::{Api, ListParams, LogParams, ObjectMeta}; | ||
use mz_ore::error::ErrorExt; | ||
|
||
use std::fs::File; | ||
use std::io::Write; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Structure uses as:
use std::...;
use other-crate::...;
use crate::...;
I.e., three sections: std, other crates, crate
, separated by a newline.
/** | ||
* Creates a k8s client given a context. If no context is provided, the default context is used. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Creates a k8s client given a context. If no context is provided, the default context is used. | |
*/ | |
/// Creates a k8s client given a context. If no context is provided, the default context is used. |
/** | ||
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log. | ||
* Returns a list of file names on success. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log. | |
* Returns a list of file names on success. | |
*/ | |
/// Write k8s pod logs to a file per pod as `mz-pod-logs.<namespace>.<pod-name>.log`. | |
/// Returns a list of file names on success. |
{ | ||
let pod_name = pod.metadata.name.clone().unwrap_or_default(); | ||
eprintln!("Failed to process pod {}: {}", pod_name, e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not do what you want: the ?
operator returns from the enclosing function, not the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but if we're referring to the usage of ?
in lines 132-168, wouldn't it return from the block given it's an async block?
writeln!( | ||
file, | ||
"NetworkPolicies:\n{}\n", | ||
Table::new(netpol_info).with(Style::psql()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider writing json (or something else structured) instead of formatting as a table? If it's json, we could always format it later, but the table doesn't allow us easily read the contents in a program.
Ok(file_name) | ||
} | ||
|
||
fn format_duration(duration: Duration) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get a TimeDelta
, which has a Display
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to everything Moritz said.
} | ||
|
||
// // NetworkPolicies | ||
let netpol_info: Vec<NetworkPolicyInfo> = network_policies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're effectively duplicating all the information in the yaml, or you are removing important details. I don't see much point in a lot of this formatting stuff, when the info is there in the yaml.
Maybe just list the names of the network policies, and they can look at the yaml for the details.
{ | ||
Ok(logs) => logs, | ||
Err(_) => { | ||
// If we get a bad request error, try without the previous flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to just always try to grab both the current and previous logs.
|
||
for line in logs.lines() { | ||
writeln!(file, "{}", line)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just write the whole thing in one go? You aren't doing anything with the split lines.
let client = create_k8s_client(args.kubernetes_context.clone()).await?; | ||
// TODO: Make namespaces mandatory | ||
// TODO: Print a warning if namespace doesn't exist | ||
for namespace in args.kubernetes_namespaces { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to store things in an organized directory structure. Maybe something like:
materialize-debug-{iso_datetime}/{namespace}/{resource_type}/{resource_name}.yaml
or, in the case of logs:
materialize-debug-{iso_datetime}/{namespace}/logs/{pod_name}.{current_or_previous}.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, even if we're zipping at the end!
for pod in &pod_list.items { | ||
if let Err(e) = async { | ||
let pod_name = pod.metadata.name.clone().unwrap_or_default(); | ||
let file_name = format!("mz-pod-logs.{}.{}.log", namespace, pod_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above about directory structure.
ports: String, | ||
#[tabled(rename = "AGE")] | ||
age: String, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need most of this info, but we may want to know if the service has endpoints (ie: is it ready).
file, | ||
"Pods:\n{}\n", | ||
Table::new(pod_info).with(Style::psql()) | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These structs might be useful for printing things out for a human running the tool, but I really just want the yaml serialized object in its entirety.
After talking to the cloud team, decided to remove |
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:
kubectl get all -o wide
I hope to merge all changes in a stack but decided to create a PR for early feedback.
Motivation
https://github.com/MaterializeInc/database-issues/issues/8908
Tips for reviewer
cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.