-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: adding observability #17
base: main
Are you sure you want to change the base?
Conversation
Hey @jabolina, first and foremost much thanks and props for taking the time to do this! Really really appreciate the effort and kindness 😊 Observability is definitely a valuable and needed feature to this project. That being said, I think there are a couple things to talk about now:
ContextAs you touch on being able to traverse multiple services in order to evaluate eventual problems, these might help in your overall perception of the project, that are also relevant to this PR:
I'd also like your thoughts, if you have any, on these two considerations. Mainly on 2., which we've not really made up our minds about yet. Moving forwardOpenTelemetryI'm slightly familiar to OpenTelemetry (and have been meaning to study it on a deeper level), so take these considerations w/ a grain of salt. In our context, tracing exists in two levels of boundaries:
OK, when?Being pragmatic, in order to fully implement an OpenTelemetry API within this project, there are a few things that IMHO needs to be in place so we can leverage its full capacity:
As to our priority, there's currently a focus on improving authorization overall design (#18), as it is a more urgent matter. But this is the most time-consuming task of this first Milestone. Ofc if we're not considering documentation... sigh 😪 So, to sum up, I can say that I'm available to start on 1. and 2. by next week. How does that sound? Do you think I'm missing something? Have any thoughts about what I've talked about here? All thoughts are welcome :-) |
ayay, sorry for taking so long to answer 😅 😅 I think having this 3 points already well defined before evolving this (in code) will make the process easier. In this PR I tried to be generic to any tool/lib used, but this is the hard path, if we work on something already pre-defined, live Currently here where I work they developed something similar for the event communication and tracing. I did took some inspiration from here while thinking about this. I really like the message structure, where is easier for versioning, carrying metadata around and identifying a step inside the whole message lifetime inside the system. About the message bus, I used RabbitMQ in the past, and it handle everything needed. But, I think it will vary on the future use cases for this project, since it will work as a template for other projects, for example:
Maybe nothing of this really matter and the project that uses this template should be responsible for handling this specific cases. So about the PR, I think, since this is not near a critical nor essential feature at this time, we could start with some definitions about what is needed and what will be required for tracing the available features, when we have both a well defined message structure and broker, and the What do you think? If you prefer, to keep everything organized, I can close this PR and start a doc on the discussions? |
This is still early in the process for tracing, but I think is a good time to open a PR just to see if this make sense. Some days ago I came across the repository, and since I wanted to do something related with observability I started this.
Proposal
Since this project will be used as a template for another projects, and one of the available feature is the event-driven architecture I thought this could be a nice feature to have. When starting to process a message that traverse multiple services, failures will be hard to debug and to identify the root cause, for this purpose this (early) feature is adding the possibility to start tracing distributed events using the OpenTelemetry and exporting the collected information to Jaeger. For exposing some metrics to monitor the service, a Prometheus exporter is also included.
With the final implementation will be possible to verify which services an event had contact, the time spent in every step and to easily identify failures. On a distributed environment, with multiple services this should be a nice feature to have. The tracing feature is the main focus for the implementation, the gathering of metrics using the Prometheus can be enhanced -the possibility to add labels and expose more features -, but I think is already possible to use. Some examples will be added at the end.
Current state
In the current state I thinks some errors could arise when tracing a block that spawns multiple threads, since the span context must be propagated between the threads. Using coroutines could be another trouble, since coroutines will be using a single thread, a single span context will exist, and maybe this could lead to spans being associated with wrong traces.
Using the current implementation, if we have the following scenario:
The traces will be collected correctly, even with the second thread spawned by a already child thread, the context can be identified between threads, even though I'm not really sure if the second thread is identifying the parent thread correctly. But, on the following scenario
everything explodes:Since the context changes on every call, the traces could be wrong. At this moment, a stack is being used to control each span,
which represent well a synchronous sequence of function calls, with more development this struct maybe needs to be changed. I did not tried to use coroutines to verify the behavior, but I guess the same problem remains. I think the majority of use cases needs this context propagation to work correctly.
Probably exists more scenarios where the traces are not collected correctly, so with the current state is not ready to be used and trusted. Also, the opentelemetry libraries available for Python is receiving commits frequently, some of the docs are not updated and some researches are needed during development.
Examples
A simple example was written to verify how things are working. This example is tracing something similar with the first draw above, where a block is spawned in a separate thread. This spawned thread will spawn a separate block that sleeps for 1 second and a blocking call to calculate a Fibonacci of n.
To verify the metric collection, a counter was added to the Fibonacci calls, since is recursive is more interesting, and a time evaluation was added to the root block. At the end, the collected metrics are displayed to the stdout. The code is:
The collected metrics printed to stdout is:
The collected traces, that can be seen on the Jaeger dashboard:
To use this is needed to have the jaeger collector exposed. This can be done using the available docker image:
Next steps
To avoid the need to have a jaeger container executed locally, create a
NoopTracer
where nothing is really traced, based on the configured environment. For this test version nothing is configurable, so adding a configurable client is also needed. Then, focusing to solve the context propagation between threads/coroutines, and adding some tests to validate.I started this only for fun, but if you guys think is something nice to have I could work on this on my spare time (classes and work consume most of my time), and try to fix everything needed to a working version, and since python is not my native language some things will probably need changes.
Tools used
I only choose to use the OpenTelemetry because they are part of the CNCF and are trying to set standards for observability and telemetry. The jaeger for collecting data was also because it is a open source tool and also part of CNCF, another option that works really well for this scenario is Datadog, but this is a paid solution.
For metrics, the Prometheus option also seems natural, following the same logic for the other tools.