Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add open telemetry metrics and traces #3404

Merged
merged 21 commits into from
Feb 7, 2025

Conversation

nasdf
Copy link
Member

@nasdf nasdf commented Jan 23, 2025

Relevant issue(s)

Resolves #293
Resolves #74

Description

This PR adds OpenTelemetry metrics and tracing. Telemetry is only enabled if the telemetry build flag is set.

Our default telemetry configuration uses the http exporter, but we can expand it to grpc and console if needed.

The exporters are configured through the standardized OpenTelemetry environment variables found here:

https://pkg.go.dev/go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp
https://pkg.go.dev/go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Manually testing with Jaeger

docker run --rm --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 5778:5778 \
  -p 9411:9411 \
  jaegertracing/jaeger:2.2.0
DEFRA_KEYRING_SECRET=secret \
 OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
 go run -tags telemetry ./cmd/defradb start

Trace dashboard
Jaeger UI

Detailed trace example
_ 3240f64_ GetCollections DefraDB _ Jaeger UI

Nested trace example
_ b146bd9_ ExecRequest DefraDB _ Jaeger UI

Specify the platform(s) on which this was tested:

  • MacOS

@nasdf nasdf self-assigned this Jan 23, 2025
@@ -39,7 +39,8 @@ import (
)

var (
log = corelog.NewLogger("db")
log = corelog.NewLogger("db")
tracer = otel.Tracer("github.com/sourcenetwork/defradb/internal/db")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I think I agree with John's suggestion of extracting this out to a thin package so we can avoid scattering references throughout the code base to the same 3rd party package.

If we ever swap it out we'll probably still need to change the func signatures (I wouldn't put much thought into designing them), but at least everything will be in the same place, and it makes adding/standardising any middleware much easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the metrics were all moved to the internal metric package and put behind a build flag telemetry.

@@ -212,6 +213,9 @@ func (db *DB) AddPolicy(
ctx context.Context,
policy string,
) (client.AddPolicyResult, error) {
ctx, span := tracer.Start(ctx, "AddPolicy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I do not know about performance/C-GO, but we can probably avoid having to manually specify "AddPolicy" by using the runtime package and stuff like runtime.CallersFrames instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I was able to get this working. Thanks for the suggestion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but I still see the manual specification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preview above is outdated. Should be fixed in the actual code.

Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 84.58150% with 35 lines in your changes missing coverage. Please review.

Project coverage is 78.34%. Comparing base (d6b003b) to head (4c4f59d).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
cli/start.go 35.71% 9 Missing ⚠️
internal/telemetry/otel.go 83.33% 6 Missing and 3 partials ⚠️
internal/telemetry/telemetry.go 40.00% 6 Missing and 3 partials ⚠️
internal/db/collection.go 85.71% 3 Missing ⚠️
internal/db/collection_delete.go 0.00% 3 Missing ⚠️
internal/telemetry/noop.go 71.43% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3404      +/-   ##
===========================================
+ Coverage    78.27%   78.34%   +0.07%     
===========================================
  Files          393      395       +2     
  Lines        36113    36286     +173     
===========================================
+ Hits         28266    28426     +160     
- Misses        6187     6194       +7     
- Partials      1660     1666       +6     
Flag Coverage Δ
all-tests 78.34% <84.58%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cli/config.go 80.00% <ø> (ø)
internal/db/collection_get.go 80.65% <100.00%> (-4.10%) ⬇️
internal/db/collection_index.go 87.82% <100.00%> (+0.28%) ⬆️
internal/db/collection_update.go 74.80% <100.00%> (+0.61%) ⬆️
internal/db/db.go 67.98% <100.00%> (+1.18%) ⬆️
internal/db/p2p_replicator.go 62.33% <100.00%> (+0.60%) ⬆️
internal/db/p2p_schema_root.go 80.92% <100.00%> (-0.78%) ⬇️
internal/db/request.go 87.88% <100.00%> (+5.02%) ⬆️
internal/db/schema.go 84.81% <100.00%> (ø)
internal/db/store.go 73.89% <100.00%> (+8.70%) ⬆️
... and 8 more

... and 15 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6b003b...4c4f59d. Read the comment docs.

@nasdf nasdf changed the title feat: OpenTelemetry [DO NOT REVIEW] feat: Add open telemetry metrics and traces Jan 29, 2025
@nasdf nasdf marked this pull request as ready for review January 29, 2025 21:36
@nasdf nasdf requested review from a team and AndrewSisley January 29, 2025 21:37
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: This looks really good considering my natural aversion to it, good job at minimising it's impact on readability.

I have not really reviewed where you have added the tracer (i.e. all the tracer.Start calls), as I would prefer to not have any by default and instead add them in when we really feel the need to (like debug logs) - I will let someone else review those.

todo: Please add a job to the CI test matrix so that we have one job that executes the tests with the tracer enabled.

@@ -46,7 +49,10 @@ func NewParser() (*parser, error) {
return p, nil
}

func (p *parser) BuildRequestAST(request string) (*ast.Document, error) {
func (p *parser) BuildRequestAST(ctx context.Context, request string) (*ast.Document, error) {
_, span := tracer.Start(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why are you not assigning the return value to ctx here? It looks like a bug-in-waiting. Did the compiler/linter complain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was a linter warning because the context is only used for creating the span. The reason for adding the context to the function signature is that it allows the creation of nested spans by using context values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, thanks for the answer.

I think I have a slight preference to assigning it with a //nolint:foo comment as that would be less risky in case we later use ctx in this function, but it is a bit ugly so please don't feel any pressure to change it if you aren't certain in your agreement :)

Copy link
Member

@shahzadlone shahzadlone Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a slightly stronger preference to leave as it was and not do the linter suppression. Sorry Keenan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up reverting the change since more of us preferred the original.

cli/start.go Outdated
if !cfg.GetBool("no-telemetry") {
err := metric.ConfigureTelemetry(cmd.Context())
if err != nil {
log.ErrorE("failed to configure telemetry", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Perhaps define in cli/errors.go

(thought because its a log).

internal/metric/metric_test.go Outdated Show resolved Hide resolved
internal/metric/noop.go Outdated Show resolved Hide resolved
tests/integration/utils.go Show resolved Hide resolved
Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, just left some comments.

Please do add the CI test with the tracer enabled (doesn't have to be a matrix) just one workflow is good. You can do something like the test-view or test-encryption job in .github/workflows/test-coverage.yml.

@@ -21,6 +21,7 @@ defradb start [flags]
--max-txn-retries int Specify the maximum number of retries per transaction (default 5)
--no-encryption Skip generating an encryption key. Encryption at rest will be disabled. WARNING: This cannot be undone.
--no-p2p Disable the peer-to-peer network synchronization system
--no-telemetry telemetry Disables telemetry reporting. Telemetry is only enabled in builds that use the telemetry flag.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is telemetry next to --no-telemetry supposed to be there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! should be fixed now

- name: Test coverage & save coverage report in an artifact
uses: ./.github/composites/test-coverage-with-artifact
with:
coverage-artifact-name: "coverage_encryption"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This typo is very problematic as it will fail all reports due to duplicate artifact name (or overwrite which was the old behavior). Please fix the name to something relevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! should be fixed now


//go:build telemetry

package metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Feel free to rename the package to telemetry at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -50,7 +50,7 @@ func NewParser() (*parser, error) {
}

func (p *parser) BuildRequestAST(ctx context.Context, request string) (*ast.Document, error) {
_, span := tracer.Start(ctx)
ctx, span := tracer.Start(ctx) //nolint:ineffassign,staticcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Why do you need to do this. 1) ctx is not used anywhere and 2) it is ineffectual assignment.

I like what you had before.

Same for all other instances

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He change it because Andy asked him to do so I think. I'm also not a fan of this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't seem like a strong preference by him, I commented on the original thread now.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Keenan!

@nasdf nasdf merged commit 34eab3f into sourcenetwork:develop Feb 7, 2025
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Epic] Add Observability (metrics) Tracing system
4 participants