Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: adding observability #17

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

jabolina
Copy link

@jabolina jabolina commented Mar 27, 2021

This is still early in the process for tracing, but I think is a good time to open a PR just to see if this make sense. Some days ago I came across the repository, and since I wanted to do something related with observability I started this.

Proposal

Since this project will be used as a template for another projects, and one of the available feature is the event-driven architecture I thought this could be a nice feature to have. When starting to process a message that traverse multiple services, failures will be hard to debug and to identify the root cause, for this purpose this (early) feature is adding the possibility to start tracing distributed events using the OpenTelemetry and exporting the collected information to Jaeger. For exposing some metrics to monitor the service, a Prometheus exporter is also included.

With the final implementation will be possible to verify which services an event had contact, the time spent in every step and to easily identify failures. On a distributed environment, with multiple services this should be a nice feature to have. The tracing feature is the main focus for the implementation, the gathering of metrics using the Prometheus can be enhanced -the possibility to add labels and expose more features -, but I think is already possible to use. Some examples will be added at the end.

Current state

In the current state I thinks some errors could arise when tracing a block that spawns multiple threads, since the span context must be propagated between the threads. Using coroutines could be another trouble, since coroutines will be using a single thread, a single span context will exist, and maybe this could lead to spans being associated with wrong traces.

Using the current implementation, if we have the following scenario:

caller           parent           childA          childB
   |                |               |               |
   | --- Thread --->|               |               |
   |                | --Blocking--> |               |
   |                |               |               |
   |                |               | ---Thread---> |
   |                |               |               |
   |                |               | <----Done---- |
   |                |               |               |
   |                | <----Done---- |               |
   | <----Done----- |               |               |
   |                |               |               |
   V                V               V               V

The traces will be collected correctly, even with the second thread spawned by a already child thread, the context can be identified between threads, even though I'm not really sure if the second thread is identifying the parent thread correctly. But, on the following scenario everything explodes:

caller           parent           childA          childB
   |                |               |               |
   | --- Thread --->|               |               |
   |                | ---Thread---> |               |
   |                |               |               |
   |                | -------------Thread---------> |
   |                |               |               |
   |                | <------------Done------------ |
   |                |               |               |
   |                | <----Done---- |               |
   | <----Done----- |               |               |
   |                |               |               |
   V                V               V               V

Since the context changes on every call, the traces could be wrong. At this moment, a stack is being used to control each span,
which represent well a synchronous sequence of function calls, with more development this struct maybe needs to be changed. I did not tried to use coroutines to verify the behavior, but I guess the same problem remains. I think the majority of use cases needs this context propagation to work correctly.

Probably exists more scenarios where the traces are not collected correctly, so with the current state is not ready to be used and trusted. Also, the opentelemetry libraries available for Python is receiving commits frequently, some of the docs are not updated and some researches are needed during development.

Examples

A simple example was written to verify how things are working. This example is tracing something similar with the first draw above, where a block is spawned in a separate thread. This spawned thread will spawn a separate block that sleeps for 1 second and a blocking call to calculate a Fibonacci of n.

To verify the metric collection, a counter was added to the Fibonacci calls, since is recursive is more interesting, and a time evaluation was added to the root block. At the end, the collected metrics are displayed to the stdout. The code is:

import threading
import time

from src.observability.decorator import default_tracer, trace, count, default_measurer, observe


@trace("identity")
def identity_fibo():
    def lazy():
        default_tracer.add_property("identify", threading.get_ident())
        time.sleep(1)
    t = threading.Thread(target=lazy)
    t.start()
    t.join()


@count(name="fibonacci", description="Count how many times Fibonacci was called")
def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)


@trace("fibo")
def middleware(n):
    default_tracer.add_property("nth", str(n))
    identity_fibo()
    return fib(n)


@trace("base")
@observe(name="fibonacci_caller", description="Time took by the caller to execute 10 fibonacci calls")
def traced_function(n):
    for i in range(10):
        multiplier = i + 1
        default_tracer.add_property("idx-" + str(multiplier), str(middleware(n * multiplier)))


if __name__ == '__main__':
    threads = [threading.Thread(target=traced_function, args=(3,)) for _ in range(2)]
    [t.start() for t in threads]
    [t.join() for t in threads]

    print(default_measurer.export().decode('utf-8'))

The collected metrics printed to stdout is:

# HELP fibonacci_caller Time took by the caller to execute 10 fibonacci calls
# TYPE fibonacci_caller summary
fibonacci_caller_count 2.0
fibonacci_caller_sum 96.8708688889983
# HELP fibonacci_caller_created Time took by the caller to execute 10 fibonacci calls
# TYPE fibonacci_caller_created gauge
fibonacci_caller_created 1.6168688051154459e+09
# HELP fibonacci_total Count how many times Fibonacci was called
# TYPE fibonacci_total counter
fibonacci_total 7.049132e+06
# HELP fibonacci_created Count how many times Fibonacci was called
# TYPE fibonacci_created gauge
fibonacci_created 1.6168688061167545e+09

The collected traces, that can be seen on the Jaeger dashboard:

image

To use this is needed to have the jaeger collector exposed. This can be done using the available docker image:

$ docker run -p 16686:16686 -p 6831:6831/udp -p 14250:14250 jaegertracing/all-in-one

Next steps

To avoid the need to have a jaeger container executed locally, create a NoopTracer where nothing is really traced, based on the configured environment. For this test version nothing is configurable, so adding a configurable client is also needed. Then, focusing to solve the context propagation between threads/coroutines, and adding some tests to validate.

I started this only for fun, but if you guys think is something nice to have I could work on this on my spare time (classes and work consume most of my time), and try to fix everything needed to a working version, and since python is not my native language some things will probably need changes.

Tools used

I only choose to use the OpenTelemetry because they are part of the CNCF and are trying to set standards for observability and telemetry. The jaeger for collecting data was also because it is a open source tool and also part of CNCF, another option that works really well for this scenario is Datadog, but this is a paid solution.

For metrics, the Prometheus option also seems natural, following the same logic for the other tools.

@ruiconti
Copy link
Contributor

ruiconti commented Mar 30, 2021

Hey @jabolina, first and foremost much thanks and props for taking the time to do this! Really really appreciate the effort and kindness 😊

Observability is definitely a valuable and needed feature to this project. jaeger looks really promising and I'm honestly impressed with that cascade visualization of an execution trace.

That being said, I think there are a couple things to talk about now:

  1. is to add a little bit of context that we haven't been able to document yet and
  2. discuss how we can move forward with this proposition

Context

As you touch on being able to traverse multiple services in order to evaluate eventual problems, these might help in your overall perception of the project, that are also relevant to this PR:

  1. Communication standards: The way in which two or more different services communicate are meant to be only through an external message bus. Here, events and commands are a type of message.
  2. Message bus status-quo: Currently, there're no internal/external buses in use because this project was a forked out of a specific service that didn't use them. 😕 However, there's a concrete implementation of an internal bus here and we plan to get them back soon enough (Improvements on interaction of presentation and service layer #6). As for the external message bus, we're considering using a centralized message broker (in this case, RabbitMQ is top on the list).
  3. Concurrent execution: We're using uvicorn (w/ uvloop), which is a http gateway that enables us to leverage Python's asyncio protocol in a performant manner. This also means that we agree to not use Threads (as they're rather useless due to GIL enforcement) and to implement concurrent executions only using (asyncio's) coroutines primitives.

I'd also like your thoughts, if you have any, on these two considerations. Mainly on 2., which we've not really made up our minds about yet.

Moving forward

OpenTelemetry

I'm slightly familiar to OpenTelemetry (and have been meaning to study it on a deeper level), so take these considerations w/ a grain of salt. In our context, tracing exists in two levels of boundaries:

  • Within a service—aggregate: IMHO we are able to implicitly embed a context propagation within our internal message bus abstractions. That way, we can effectively trace a chain of events raised inside a service boundary.
  • Outside a service: Since we're most likely to adopt a message broker, I'm confident that we can successfully trace all context spans.

OK, when?

Being pragmatic, in order to fully implement an OpenTelemetry API within this project, there are a few things that IMHO needs to be in place so we can leverage its full capacity:

  1. Implementation of a broker for a common external bus e.g. roughly RabbitMQ container + publisher + consumers (w/o issue yet)
  2. Structuring of docker-compose & repo for multiple services under one repo Structure improvements #11
  3. Addition of another service to the template (w/o issue yet)

As to our priority, there's currently a focus on improving authorization overall design (#18), as it is a more urgent matter. But this is the most time-consuming task of this first Milestone. Ofc if we're not considering documentation... sigh 😪

So, to sum up, I can say that I'm available to start on 1. and 2. by next week.

How does that sound? Do you think I'm missing something? Have any thoughts about what I've talked about here? All thoughts are welcome :-)

@jabolina
Copy link
Author

ayay, sorry for taking so long to answer 😅 😅 I think having this 3 points already well defined before evolving this (in code) will make the process easier. In this PR I tried to be generic to any tool/lib used, but this is the hard path, if we work on something already pre-defined, live uvloop, message broker and messages structures we should be able to use some of the otel auto-instrumentation tools.

Currently here where I work they developed something similar for the event communication and tracing. I did took some inspiration from here while thinking about this. I really like the message structure, where is easier for versioning, carrying metadata around and identifying a step inside the whole message lifetime inside the system.

About the message bus, I used RabbitMQ in the past, and it handle everything needed. But, I think it will vary on the future use cases for this project, since it will work as a template for other projects, for example:

  • The data transported through the broker is sensitive? Can we have both at-rest and in-flight encryption for this cases?
  • Do we need to handle distributed transactions?
  • Do we need exactly-once message delivery?
  • Will it be a managed instance? If we need to handle ourselves, how complex is it?

Maybe nothing of this really matter and the project that uses this template should be responsible for handling this specific cases.
Maybe starting to create something that should not be complex at this time, with only a generic interface not dependent on the transport guarantees will be better, and as the need arise start adding features.

So about the PR, I think, since this is not near a critical nor essential feature at this time, we could start with some definitions about what is needed and what will be required for tracing the available features, when we have both a well defined message structure and broker, and the uvloop is already in use. And with this I can also have some more time to learn more about open-telemetry features and tools 😆.

What do you think? If you prefer, to keep everything organized, I can close this PR and start a doc on the discussions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants