Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging and Events #49

Open
alexlovelltroy opened this issue Aug 23, 2024 · 1 comment
Open

Logging and Events #49

alexlovelltroy opened this issue Aug 23, 2024 · 1 comment
Labels
Partner Objective A broadly scoped objective that is important to a partner

Comments

@alexlovelltroy
Copy link
Member

alexlovelltroy commented Aug 23, 2024

Troubleshooting CSM systems and other HPC systems has taught us several lessons that we would like OpenCHAMI to benefit from. The goal of a logging and event system isn’t to surface all possible information for analysis. It is instead to help system administrators diagnose and remediate problems when they occur and to assess long term trends.

Logging and Troubleshooting contexts

Troubleshooting happens in several different contexts in an HPC system. Logging and events in the system need to support these contexts which may overlap.

  • Job Context: HPC systems exist to run jobs. Remediation at this level is urgent and important.
  • Node Context: When a node is functioning at some differential from its peers, addressing the variance is important, but not urgent unless it interferes with Jobs. Troubleshooting why a node isn’t booting is included here.
  • System Context: System wide issues that are not tied to the functioning of a single compute node are commonly precursors to Job related issues. Troubleshooting them falls into the Urgent and Important quadrant.
  • Control Plane Context: The management system itself must be more resilient to failures than any individual node or job. Troubleshooting problems in this context should be neither important, nor urgent. However, left long enough, they will escalate to impact Jobs.
  • Analytical Context: When addressing performance and behavior issues that are only clear with large datasets over time, the analytical toolset is different from the immediate troubleshooting toolset.

Structured Logging

Standard UNIX logging relies on messages that are emitted by programs, often to the controlling shell of the process. These messages may have an internal structure, but there is no single format that all possible log messages can follow. As such, many log analysis tools have extensive customization options to identify patterns in logs and extract structured information.

Log Aggregation is necessary for some contexts, but local logs can be even more powerful for troubleshooting. Support for both eases diagnostics.

The OpenCHAMI community will develop and maintain a set of standards around logging and metrics that apply to all OpenCHAMI services that support troubleshooting, aimed at the relevant contexts. These standards must be independent of technology choices.

The OpenCHAMI community will develop and maintain standards for infrastructure to support Logging and Metrics at various scale levels as well as conformance tests that allow sites to validate that a solution meets OpenCHAMI specifications.

@alexlovelltroy alexlovelltroy converted this from a draft issue Aug 23, 2024
@alexlovelltroy alexlovelltroy added the Partner Objective A broadly scoped objective that is important to a partner label Aug 23, 2024
@alexlovelltroy alexlovelltroy moved this to In Progress in Roadmap Project Sep 19, 2024
@alexlovelltroy
Copy link
Member Author

This is related to #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Partner Objective A broadly scoped objective that is important to a partner
Projects
Status: In Progress
Development

No branches or pull requests

1 participant