Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add self-managed debug tool #31282

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Feb 4, 2025

Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:

  • Collect pod logs
  • Collect info similar to kubectl get all -o wide
  • Up next will be getting events, describe output, rest of TODOs, and tests

I hope to merge all changes in a stack but decided to create a PR for early feedback.

Motivation

  • This PR adds a known-desirable feature.

https://github.com/MaterializeInc/database-issues/issues/8908

Tips for reviewer

  • Currently want to leave it all in one file and separate things out later
  • Curious if the amount of code for something like this is normal!
  • To quickly test, run cargo run --bin mz-self-managed-debug -- --kubernetes-context mzcloud-staging-us-east-1-0 --kubernetes-namespaces mz-balancer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@SangJunBak SangJunBak requested a review from a team February 4, 2025 04:50
@SangJunBak SangJunBak requested a review from ParkMyCar as a code owner February 4, 2025 04:50
Ok(file_name)
}

fn format_duration(duration: Duration) -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there's a better function I can reuse here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get a TimeDelta, which has a Display implementation.

@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from 63632d2 to a6465db Compare February 4, 2025 04:54
@SangJunBak SangJunBak removed the request for review from ParkMyCar February 4, 2025 05:11
Introduces a new CLI tool for debugging self-managed Materialize deployments in Kubernetes. The tool can:
- Collect pod logs
- Collect info similar to `kubectl get all -o wide`
- Up next will be getting events, describe output, and rest of TODOs.
@SangJunBak SangJunBak force-pushed the jun/#8908/debug-tool branch from a6465db to 2460e24 Compare February 4, 2025 05:44
@SangJunBak SangJunBak requested review from a team, aljoscha and morsapaes as code owners February 4, 2025 05:44
@SangJunBak SangJunBak changed the base branch from lts-v0.130 to main February 4, 2025 05:45
@SangJunBak SangJunBak removed request for a team, aljoscha and morsapaes February 4, 2025 05:45
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments inline. Thanks for taking this on!

A high-level question I have is why we don't format the data as JSON (or another serialization framework). This would reduce complexity here as a lot of the types already have serde implementations, and we'd not accidentally lose information on the way.

We could then have a second binary that takes the serialized format and pretty-prints it for human consumption. (Or, send both files because it allows users to more easily verify what information we include.)

[package]
name = "mz-self-managed-debug"
description = "Debug tool for self-managed Materialize."
version = "0.130.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want the current version here and elsewhere (right now it's 0.133.0-dev.0, but you'll want to update it prior to merging.)

Also, to update it automatically, adjust line 41 and 61 in bin/bump-version so that the version gets updated automatically. The script takes care of calling bin/bazel gen to sync the bazel files.

src/self-managed-debug/Cargo.toml Show resolved Hide resolved
Comment on lines +12 to +37
use std::fmt::Debug;

use std::process;

use std::sync::LazyLock;

use chrono::{Duration, Utc};
use clap::Parser;
use k8s_openapi::api::apps::v1::{Deployment, ReplicaSet, StatefulSet};
use k8s_openapi::api::networking::v1::{NetworkPolicy, NetworkPolicyPeer, NetworkPolicyPort};
use k8s_openapi::apimachinery::pkg::util::intstr::IntOrString;
use kube::config::KubeConfigOptions;
use kube::{Client, Config};
use tabled::{Style, Table, Tabled};

use mz_build_info::{build_info, BuildInfo};

use mz_ore::cli::{self, CliConfig};

use k8s_openapi::api::core::v1::{ContainerStatus, Pod, Service};

use kube::api::{Api, ListParams, LogParams, ObjectMeta};
use mz_ore::error::ErrorExt;

use std::fs::File;
use std::io::Write;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Structure uses as:

use std::...;

use other-crate::...;

use crate::...;

I.e., three sections: std, other crates, crate, separated by a newline.

Comment on lines +101 to +103
/**
* Creates a k8s client given a context. If no context is provided, the default context is used.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Creates a k8s client given a context. If no context is provided, the default context is used.
*/
/// Creates a k8s client given a context. If no context is provided, the default context is used.

Comment on lines +117 to +120
/**
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log.
* Returns a list of file names on success.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Write k8s pod logs to a file per pod as mz-pod-logs.<namespace>.<pod-name>.log.
* Returns a list of file names on success.
*/
/// Write k8s pod logs to a file per pod as `mz-pod-logs.<namespace>.<pod-name>.log`.
/// Returns a list of file names on success.

Comment on lines +171 to +174
{
let pod_name = pod.metadata.name.clone().unwrap_or_default();
eprintln!("Failed to process pod {}: {}", pod_name, e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not do what you want: the ? operator returns from the enclosing function, not the block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but if we're referring to the usage of ? in lines 132-168, wouldn't it return from the block given it's an async block?

writeln!(
file,
"NetworkPolicies:\n{}\n",
Table::new(netpol_info).with(Style::psql())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider writing json (or something else structured) instead of formatting as a table? If it's json, we could always format it later, but the table doesn't allow us easily read the contents in a program.

Ok(file_name)
}

fn format_duration(duration: Duration) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get a TimeDelta, which has a Display implementation.

Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to everything Moritz said.

}

// // NetworkPolicies
let netpol_info: Vec<NetworkPolicyInfo> = network_policies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're effectively duplicating all the information in the yaml, or you are removing important details. I don't see much point in a lot of this formatting stuff, when the info is there in the yaml.

Maybe just list the names of the network policies, and they can look at the yaml for the details.

{
Ok(logs) => logs,
Err(_) => {
// If we get a bad request error, try without the previous flag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to just always try to grab both the current and previous logs.


for line in logs.lines() {
writeln!(file, "{}", line)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just write the whole thing in one go? You aren't doing anything with the split lines.

let client = create_k8s_client(args.kubernetes_context.clone()).await?;
// TODO: Make namespaces mandatory
// TODO: Print a warning if namespace doesn't exist
for namespace in args.kubernetes_namespaces {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to store things in an organized directory structure. Maybe something like:
materialize-debug-{iso_datetime}/{namespace}/{resource_type}/{resource_name}.yaml

or, in the case of logs:
materialize-debug-{iso_datetime}/{namespace}/logs/{pod_name}.{current_or_previous}.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, even if we're zipping at the end!

for pod in &pod_list.items {
if let Err(e) = async {
let pod_name = pod.metadata.name.clone().unwrap_or_default();
let file_name = format!("mz-pod-logs.{}.{}.log", namespace, pod_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about directory structure.

ports: String,
#[tabled(rename = "AGE")]
age: String,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need most of this info, but we may want to know if the service has endpoints (ie: is it ready).

file,
"Pods:\n{}\n",
Table::new(pod_info).with(Style::psql())
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These structs might be useful for printing things out for a human running the tool, but I really just want the yaml serialized object in its entirety.

@SangJunBak
Copy link
Contributor Author

After talking to the cloud team, decided to remove dump_k8s_get_all and can add it back later if we need to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants