Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: OpenTelemetry #111

Open
1 task done
niklastreml opened this issue Feb 21, 2024 · 8 comments
Open
1 task done

Feature: OpenTelemetry #111

niklastreml opened this issue Feb 21, 2024 · 8 comments
Assignees
Labels
request Indicates a feature request

Comments

@niklastreml
Copy link
Collaborator

Is there an existing feature request for this?

  • I have searched the existing issues

Problem Description

The prometheus format is not sufficient for exporting metrics from the traceroute check, aside from the amount of hops. To provide more complex data, it might be necessary to support a second data format like opentelemetry.

Solution Description

Implement the OTEL library and inject it into the checks, so they can write their own traces. Traceroute can use this to collect more detailed data about how every single invocation of the check behaves, essentially allowing a user to visualize how packets move from sparrow to their target

Who can address the issue?

No response

Additional Context

https://github.com/open-telemetry/opentelemetry-go
https://opentelemetry.io/docs/languages/go/
https://www.jaegertracing.io/docs/1.54/client-libraries/#go
https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/traces/

@niklastreml niklastreml added the request Indicates a feature request label Feb 21, 2024
@lvlcn-t lvlcn-t changed the title Feature: OpenTelemtry Feature: OpenTelemetry Feb 21, 2024
@niklastreml
Copy link
Collaborator Author

niklastreml commented Feb 21, 2024

There might be a way to build traceroute metrics with prometheus, but I'm not sure if I like it.
We can create a sparrow_traceroute_hop metric that looks sort of like this:

sparrow_traceroute_hop{target="somedomain.com" n=2 hopAddr="10.0.0.1"} 152
  • target is what the user wants to check in the config
  • hopAddr is the specific address where the packet failed
  • n declares that hopAddr is the nth hop on the way to target
  • the actual metric is the time in milliseconds it took to get to that hop and back

The following traceroute cli output would map to prometheus metrics like so:

$ traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  MyRouter (68.87.192.226)         0.345 ms
 2  68.87.192.227 (68.87.192.227)    9.291 ms
 3  dns.google (8.8.8.8)             9.231 ms
sparrow_traceroute_hop{target="8.8.8.8" n=1 hopAddr="68.87.192.226"} 0.345
sparrow_traceroute_hop{target="8.8.8.8" n=2 hopAddr="68.87.192.227"} 9.291
sparrow_traceroute_hop{target="8.8.8.8" n=3 hopAddr="8.8.8.8"} 9.231

I'm not sure how easy it is to create a nice visualization from this though. Let me know what your thought on this are @lvlcn-t @puffitos @y-eight

@y-eight
Copy link
Member

y-eight commented Feb 21, 2024

I think it is not a good idea. Especially when the route is changing often. The metrics do not provide a dependency to each other over time.

Even if we would clear the last state, the issue is transported to the next layer (the Prometheus).

Update: I am not 100% sure, it might be fine for a start. Clearing the state and expose the last traceroute as metrics. Especially it is easy to implement in contrast to OTEL. In the end we need to get feedback from the experts.

@niklastreml niklastreml mentioned this issue Feb 21, 2024
2 tasks
@niklastreml
Copy link
Collaborator Author

thats what I thought too. We would be generating a lot of metrics with different labels with this. In the worst case we'd get maxHops * targets metrics, which isn't exactly great

@puffitos
Copy link
Collaborator

Let's forget Prometheus for this one. The metrics will explode if we go that way. Let's try to make some simple API response available and then concentrate on offering OTEL traces, which can be visualised in the the preferred monitoring tool, which is grafana.

@lvlcn-t
Copy link
Collaborator

lvlcn-t commented Feb 21, 2024

@niklastreml Imo we shouldn't use prometheus metrics for something they're not intended/hardly capable of. If the traceroute check was the only check that would implement otel traces then we could consider building this workaround for prometheus metrics but since the other checks will also write their own traces we should focus on implementing OTel.

@puffitos I don't know about forgetting prometheus completely for the traceroute check. I think we should still offer the amount of hops as prometheus metric for the traceroute check.

@lvlcn-t
Copy link
Collaborator

lvlcn-t commented Feb 21, 2024

When implementing the traces for the dns check, we should keep this in mind: #81 (comment)

@puffitos
Copy link
Collaborator

@lvlcn-t sorry, I communicated my intention incorrectly; we shouldn't use prometheus metrics for each trace, as that would make the metrics' cardinality rise quickly. We can use the metrics for number of hops to target, totalDuration, meanDuration and so on, if those metrics make any sense in a trace route scenario. I haven't used the check in a while, so I'm not exactly sure what's critical information and what's not.

@lvlcn-t
Copy link
Collaborator

lvlcn-t commented Apr 13, 2024

I've started to implement the first step of completing this issue in feat/otel-provider. I've added a otel provider setup depending on the user's (startup) configuration.
To be able to keep the provider selection dynamic, I'm using the OpenTelemetry Protocol (OTLP) to communicate with the collector. This way, the user can choose the provider and the collector to use.

The following configuration options will be available for the user (naming is not final):

# Configures the telemetry exporter.
telemetry:
  # The telemetry exporter to use.
  # Options:
  # grpc: Exports telemetry using OTLP via gRPC.
  # http: Exports telemetry using OTLP via HTTP.
  # stdout: Prints telemetry to stdout.
  # noop | "": Disables telemetry.
  exporter: grpc
  # The address to export telemetry to.
  url: localhost:4317
  # The token to use for authentication.
  # If the exporter does not require a token, this can be left empty.
  token: ""
  # The path to the tls certificate to use.
  # To disable tls, either set this to an empty string or set it to insecure.
  certPath: ""

Since the OpenTelemetry Protocol (OTLP) is a standard, the user can choose any collector that supports this protocol. The user can also choose to use the stdout exporter to see the traces in the console (for debugging purposes) or the noop exporter to disable the telemetry. If the user chooses an external collector, the user can also provide a bearer token for authentication. The user can also provide a path to a tls certificate to use for secure communication.

Next I'm wondering if I should already open the PR even though the actual traces are not yet implemented. This way we can split the issue into smaller parts and everyone can instrument one part of the codebase. What do you think?

This was referenced Aug 5, 2024
@niklastreml niklastreml self-assigned this Aug 7, 2024
@niklastreml niklastreml modified the milestones: 0.4, 0.4.0 Aug 7, 2024
@lvlcn-t lvlcn-t self-assigned this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
request Indicates a feature request
Projects
None yet
Development

No branches or pull requests

4 participants