-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💡 [REQUEST] - Better logging and tracing #475
Comments
Traces have the concept of 'span events' to log structured data. For example, the number of good and bad events of a given SLI computation could be saved as trace events for automatic correlation with the API request itself. More details in the OpenTelemetry documentation: https://hatch.pypa.io/dev/config/environment/advanced/ |
The OpenTelemetry documentation details how to instrument Python code with traces, spans, events, links and attributes: https://opentelemetry.io/docs/languages/python/instrumentation/ Next steps:
|
Automatic instrumentation works great out-of-the-box, and provides a good granularity as long as the Python packages are themselves instrumented. For example, wrapping the API server with: opentelemetry-instrument \
--traces_exporter console,otlp \
--service_name slo-generator \
--exporter_otlp_traces_endpoint "localhost:4317" \
--exporter_otlp_traces_insecure true \
slo-generator api --target=run_compute --signature-type=http -c samples/config.yaml exports the following spans to Cloud Trace: Mix automatic and manual instrumentation for more granularity? |
TODO:
|
Summary
Following up on #441, it appears some of the telemetry required to troubleshoot random issues might be missing. Take this opportunity to rethink the metrics/logs/traces collected by the SLO Generator?
Basic Example
I am a huge fan of Chapter 4 in the excellent Zero to Production in Rust. The whole chapter is about Telemetry. The author starts with basic logging, then attaches Request IDs to every log (so he can correlate entries that show up in a random order in the logging service), then ultimately decides to use traces to track individual requests (to get the context automatically, without adding it explicitly). I feel like the same principle can be applied to each request to the SLO Generator API, or to each request to a backend/exporter. Traces could replace or extend the existing logs, and make troubleshooting much easier without having to enable the (very verbose) Debug mode with
DEBUG=1
.A great opportunity to migrate to an agnostic stack like OpenTelemetry for metrics, logs and traces, with all these data exported to stdout/stderr and/or the OpenTelemetry Collector over the OpenTelemetry Protocol (OLTP). On GCP, Cloud Run supports sidecars for such a model, and the OpenTelemtry Collector can easily export to Cloud Operations.
Screenshots
No response
Drawbacks
Might require a significant rework, as well as the approval of existing users who rely on the logs themselves or on log-based metrics extracted from the log entries with regular expressions.
Unresolved questions
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: