-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SDK span telemetry metrics #1631
base: main
Are you sure you want to change the base?
Conversation
04f924f
to
8bbea82
Compare
Related #1580 |
|
||
With this implementation, for example the first Batching Span Processor would have `batching_span_processor/0` | ||
as `otel.sdk.component.name`, the second one `batching_span_processor/1` and so on. | ||
These values will therefore be reused in the case of an application restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some information to tell the application restart? (e.g. PID + start_time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have uptime metric for this -
### Metric: `process.uptime` |
@@ -34,6 +36,44 @@ Attributes used by non-OTLP exporters to represent OpenTelemetry Scope's concept | |||
| <a id="otel-scope-name" href="#otel-scope-name">`otel.scope.name`</a> | string | The name of the instrumentation scope - (`InstrumentationScope.Name` in OTLP). | `io.opentelemetry.contrib.mongodb` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) | | |||
| <a id="otel-scope-version" href="#otel-scope-version">`otel.scope.version`</a> | string | The version of the instrumentation scope - (`InstrumentationScope.Version` in OTLP). | `1.0.0` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) | | |||
|
|||
## OTel SDK Telemetry Attributes | |||
|
|||
Attributes used for OpenTelemetry SDK self-monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow each language implementations to have additional attributes that are language specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason why implementations shouldn't be allowed to add additional attributes. I would expect this to be the general case for all semconv metrics? Metrics are aggregateable, so they can be analyzed and presented as if those additional attributes weren't present.
There are two caveats I can think of:
- The metrics are recommended to be enabled by default. Therefore they must have a very, very low cardinality to justify this and not cause to much overhead. So depending on the cardinality of the additional attributes, they should probably be opt-in.
- The attributes might conflict with future additions to the spec, so you'll end up with breaking changes. So best to use some language-specific attribute naming.
attributes: | ||
- ref: otel.sdk.component.type | ||
- ref: otel.sdk.component.name | ||
- ref: error.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about retry? - e.g. the first attempt failed, the second attempt succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't record intermediate results on logical metrics, we could report another layer like otel.sdk.span.exporter.attempts
(or let HTTP/grpc metric instrumentation do its thing).
attributes: | ||
- ref: otel.sdk.component.type | ||
- ref: otel.sdk.component.name | ||
- ref: error.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't record intermediate results on logical metrics, we could report another layer like otel.sdk.span.exporter.attempts
(or let HTTP/grpc metric instrumentation do its thing).
model/otel/metrics.yaml
Outdated
|
||
- id: metric.otel.sdk.span.processor.spans_processed | ||
type: metric | ||
metric_name: otel.sdk.span.processor.spans_processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we do
metric_name: otel.sdk.span.processor.spans_processed | |
metric_name: otel.sdk.processor.span.count |
to avoid repeating span
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otel.sdk.processor.span.count
doesn't specify what kind of spans it counts. What about
otel.sdk.processor.span.processed.count
just like otel.sdk.exporter.span.exported.count
?
I'd also switch to otel.sdk.processor.span.queue.capacity
and otel.sdk.processor.span.queue.size
then to keep the namespaces consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds great!
display_name: OTel SDK Telemetry Attributes | ||
brief: Attributes used for OpenTelemetry SDK self-monitoring | ||
attributes: | ||
- id: otel.sdk.component.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to repeat otel.sdk
everywhere? can we do otel
? it's pretty obvious it's about SDK and we usually omit obvious things in attribute and metric names.
If we just stick to otel
, there is a chance collector could reuse some of the attributes and metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Otel SDK batching span processor (defined by the spec) for example is different from the collector batch processor.
SDK and collector have different concepts and specifications, therefore evolve differently. That's why I think it causes more confusion trying to combine those instead of accepting bits of duplication and keeping them separated. See also the "Prior Work" section of the PR description.
To give a concrete example, imagine we add a otel.component.cpu_usage
metric to quantify the overhead of a component.
You now have in a collector:
- a collector batch span processor processing incoming OTLP data
- The Otel SDK monitoring the collector itself. It exports the monitoring data (e.g. spans about collector components) via a SDK batching span processor.
You know encounter the otel.component.cpu_usage
with otel.component.type=batch_span_processor
. Which of the processors does it correspond to? This won't happen if you use otel.sdk
and otel.collector
namespaces.
So to summarize: I think due to the fact that sdk and collector use similar names to talk about different things, it makes sense to use the sdk
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I remember correctly that collector uses otelcol
as a metric namespace? would we change it to otel.collector
?
My main motivation for this proposal is
do we need to repeat otel.sdk everywhere? can we do otel ? it's pretty obvious it's about SDK and we usually omit obvious things in attribute and metric names.
if we use otel
for otel SDK (resource attributes along with component names should make it obvious that it's reported by the SDK), and otelcol
for collector, then we keep SDK metrics nice and short and there is no ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that collector uses otelcol as a metric namespace?
I think so, yes.
would we change it to otel.collector
I think we could, but I also think it isn't necessary. With my reasoning I just wanted to make sure that we use separate namespaces for collector and SDK to avoid confusion.
Placing SDK metrics directly in otel.*
is fine from a maintenance / evolution perspective imo as long as collector stuff always uses otel.otelcol
(or whatever namespace is selected).
From a useability perspective I prefer having metric/attribute names in isolation being precise and self-explanatory.
From that perspective e.g. looking at just otel.span.processor.processed.count
doesn't make it obvious whether this an SDK metric or a collector metric, you'd need to dig deeper into the definition. In contrast otel.sdk.span.processor.processed.count
is unambigous.
The useability perspective is just a matter of taste, so I'm okay with removing the .sdk.
namespace from all metrics and attributes if you disagree here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You convinced me on the metric names, but I feel we don't need to have distinct attribute names for collector and SDK.
E.g. otel.component.type
sounds perfectly fine and the metric name would give it all the necessary context. WDYT?
instrument: counter | ||
unit: "{span}" | ||
attributes: | ||
- ref: otel.sdk.component.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include recommended server.address
and server.port
attributes on exporter metrics. It's good to know where you are sending data to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those would not apply to all exporters (e.g. stdout). My thinking is that we should encourage using protocol-level instrumentation (e.g. http/gRPC) for details like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of agree with @dashpole here. I don't think this belongs in this metric.
Nonetheless, I think it would make sense to add exporter.request.*
metrics to track request stats (e.g. bytes sent, response codes, server details). However, I don't think that this should happen in this PR, but rather in a separate, follow-up PR. It is an enhancement to gain more fine grained insights in addition to the metrics to this PR, but doesn't have an impact on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those would not apply to all exporters (e.g. stdout).
not a problem, just add them with requirement level recommended: when applicable
. We do include these attributes on logical operations across semconv, so they do belong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 830edfb.
However, for attribute references for metrics there is no schema.compliant recommended: when applicable
IINM.
I added note: recommended when applicable
instead, please let me know if this is the correct approach.
model/otel/metrics.yaml
Outdated
type: metric | ||
metric_name: otel.sdk.span.created_count | ||
stability: development | ||
brief: "The number of spans which have been created" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question came up when I was implementing this: Should this include non-recording spans? Right now, non-recording spans are essentially no-op spans. Adding instrumentation to them might have performance implications, since the overhead of non-recording spans is currently close to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
I think it is very valuable to have a way of computing the effective sampling rate. This implies that we need the number of unrecorded spans, because the number of recorded unsampled spans is only a subset of the total number of unsampled spans.
I think we should however do this by adding a separate sampling-result metric (using the tri-state sample_result
attribute suggested here. This means unrecorded spans don't need to track their liveness or end and we can still easily compute the effective sampling rate.
Alternatively, we could add back the created_count
metric and enforce the tri-state sampled attribute from this comment.
For live_count
and ended_count
we could either:
- Disallow them to be tracked for unrecorded spans: I think this would lead to confusion due to the mismatch when looking at the aggregated
created_count
andended_count
metrics. - Force
live_count
andended_count
to be tracked for unrecorded spans and omitcreated_count
: This would means we have the overhead of tracking two metrics instead of one andTracerProviders
wouldn't be able to return a simple NooOp span, but one which tracks theend()
call exactly once.
That's why I'm thinking adding a separate sampler metric is the best compromise. However, I think we should do this in a separate PR for sampler metrics I'd say. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised this at the Go SIG today. I think we should include metrics for non-recording spans to start, since they are very useful. When we implement this, we can benchmark the actual implications of this decision. But since most instrumentation libraries that record a span also make a metric observation for each request, this shouldn't be a huge deal. If it turns out to be bad, we can revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include metrics for non-recording spans to start, since they are very useful
Could you elaborate more, I'm not sure if you mean any of my proposed solutions or a different one?
And what purpose are you thinking of with they are very useful
?
- to get the effective sampling rate
- to detect span leaks (missing
end()
calls) even for unrecorded spans - Something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both getting the sampling rate and detecting span leaks. I like your proposed solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean we change the spec and enforce this proposal?
Force live_count and ended_count to be tracked for unrecorded spans and omit created_count: This would mean we have the overhead of tracking two metrics instead of one and TracerProviders wouldn't be able to return a simple NooOp span, but one which tracks the end() call exactly once
So we say implementation MUST track live
/ended
instead of MAY track for unrecorded spans ?
In the current state of my PR it is optional for implementations to track these for unrecorded spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer SHOULD for now. If there are languages where it really isn't feasible, then they should probably omit the non-sampled metric entirely. But that kind of language really belongs in the SDK specification, rather than in the convention.
I completed the Go prototype of the proposed semantic conventions: open-telemetry/opentelemetry-go#6153 |
stability: development | ||
brief: "The number of spans for which the export has finished, either successful or failed" | ||
note: | | ||
For successful exports, `error.type` must be empty. For failed exports, `error.type` must contain the failure cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be a bit more prescriptive here and provide example values. When I was implementing this, it wasn't clear how granular I should be. Should this just always be "rejected" if the backend returned an error code? Or should it be more specific, like a gRPC status code: "deadline_exceeded" or "invalid_argument".
Personally, I prefer a more restrictive set of values for the error, like "rejected", "dropped", "timeout", but this metric will be much more useful for users if exporters use consistent values for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the main value of the metric here comes from detecting whether there is an error or not rather than having the same failure reasons across languages. The metrics are about detecting failures, for analyzing/mitigating failures you'd usually dive deeper (e.g. inspect the logs of the failing services).
I think for example http.server.request.duration is a suitable comparison here. That metrics also doesn't prescribe values for error.type
, even though HTTP protocol errors are well defined (though there are very many).
Describing concrete error.type
values would be even harder, because they will highly depend on the exporter protocol. Also where to stop? E.g. timeout
or dns-timeout
/ response-timeout
?
That's why I would say just use what makes sense for your language and what you'd use when writing e.g. an HTTP instrumentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable. An enum would definitely be too restrictive, and I agree SDKs need the ability to deviate when it makes sense. WDYT about providing example values as an anchor for style and granularity? It just seems annoying to not be able to group (or to have to write a complex query) just because one language chose "timeout" and another chose "Timeout".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can add more example attribute values: rejected
is already there, we can add timeout
.
With dropped
I don't know how this fits in here: When an error occurs, data is dropped. But "dropped" isn't the cause, it's the result what happens with the data.
Any other ones you can think of? Maybe just 500
and java.net.UnknownHostException
from the HTTP error.type
examples to also have some protocol / language specific examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, dropped probably doesn't make sense in this context (didn't think it through entirely). java.net.UnknownHostException
is probably a good one to include as well. It paints the correct picture: use a common error string when possible, otherwise use a language-specific one.
Changes
With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.
We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.
I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.
Prior work
This PR can be seen as a follow up to the closed OTEP 259:
So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).
In my opinion, it is a good thing to separate the collector and SDK self-metrics:
Existing Metrics in Java SDK
For reference, here is what the existing health metrics currently look like in the Java SDK:
Batch Span Processor metrics
queueSize
, value is the current size of the queuespanProcessorType
=BatchSpanProcessor
(there was a formerExecutorServiceSpanProcessor
which has been removed)BatchSpanProcessor
instances are usedprocessedSpans
, value is the number of spans submitted to the ProcessorspanProcessorType
=BatchSpanProcessor
dropped
(boolean
),true
for the number of spans which could not be processed due to a full queueThe SDK also implements pretty much the same metrics for the
BatchLogRecordProcessor
justspan
replaced everywhere withlog
Exporter metrics
Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a
type
attribute.Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:
exporterName
=otlp
transport
is one ofgrpc
,http
(= protobuf) orhttp-json
The transport is used just for the instrumentation scope name:
io.opentelemetry.exporters.<exporterName>-<transport>
Based on that, the following metrics are exposed:
Counter
<exporterName>.exporter.seen
: The number of records (spans, metrics or logs) submitted to the exportertype
: one ofspan
,metric
orlog
Counter
<exporterName>.exporter.exported
: The number of records (spans, metrics or logs) actually exported (or failed)type
: one ofspan
,metric
orlog
success
(boolean):false
for exporter failuresMerge requirement checklist
[chore]