Best practises for collecting metrics and statistics #1542

thomaseizinger · 2022-03-02T03:27:06Z

thomaseizinger
Mar 2, 2022

This discussion is meant to serve as a place where we can collect ideas around how to best collect metrics. I'll create a separate thread for each problem to not clutter things too much. I believe the current state of affairs is worth improving.

I think it will beneficial in overall maintenance if we can push as much as possible into prometheus instead of requiring another daemon to collect those metrics.

For example, avoiding database queries for metrics
- allows us to reduce the code in our application that we need to maintain
- improves the performance of the application
Avoiding a dedicated program for collecting metrics gives us more flexibility in what we want to monitor. If aggregation of values (such as closed position size) happens in a dedicated program, then changing our dashboard may requires changing and re-deployment of an application. If we on the contrary define metrics in the actual binary, we have more flexibility in which queries we write and can change those without changing the code.

thomaseizinger · 2022-03-02T03:32:38Z

thomaseizinger
Mar 2, 2022
Author

Time series vs. accumulated metrics

Problem definition

Metrics as per definition of prometheus are time-series data, meaning that they record the value of something at specific points in time and persist that value. An easy to understand example is the temperature measured by a sensor. This value changes over time (can go up and down, thus best modeled by a gauge) and most importantly, historical values don't affect future values.

Looking at our current dashboard, most of what we are interested in doesn't naturally fit this definition. For example, the total open position size is an accumulated value that changes with every CFD. Technically it fits the definition of a gauge. However, metrics are reset to 0 on startup.

Possible solutions

Query the database

We can either query the database from within the maker binary or if we decide to go for something like postgres, we can externally query the DB. This allows us to define an accumulated metric that can be set to the original value on startup.

Use prometheus queries to accumulate data

At least for counters, we can use the increase() function to accumulate the value of a counter even across restarts. This takes care of resets to 0 by measuring, how much the counter value increased and also allows you to compute a total. See more here: siimon/prom-client#364

Unfortunately, this does not work for gauges as per the documentation.

Initialize the gauge on startup with the correct value from the DB

This should work but feels hacky. Metrics are by design in-memory in prometheus so there should be a way of handling these resets in a general way.

Use two counters (total position size and closed position size)

These two counters could use increase() do deal with resets and the total open position size would the simply be the difference between the two.

2 replies

thomaseizinger Mar 2, 2022
Author

This comment is interesting:

That also seems like it would require setting your Prometheus retention period to "forever".

That's a bit of a different use case, and out of scope for Prometheus. If you want a perfect count of how many times something has happened ever, logs are usually the appropriate solution. Prometheus works over arbitrary time periods of a specified duration, not unbounded time periods.

thomaseizinger Mar 2, 2022
Author

As documented in this talk, we may be able to use sum on a gauge to mask resets.

However, the important thing to consider here is that prometheus will always just operate on the specified time-scale. This means that if you choose a too short time-scale, we may not see the aggregate of all values? I am not sure about this, will see to test it out!

thomaseizinger · 2022-03-02T08:13:31Z

thomaseizinger
Mar 2, 2022
Author

I think it will beneficial in overall maintenance if we can push as much as possible into prometheus instead of requiring another daemon to collect those metrics.

* For example, avoiding database queries for metrics
  
  * allows us to reduce the code in our application that we need to maintain
  * improves the performance of the application

* Avoiding a dedicated program for collecting metrics gives us more flexibility in what we want to monitor. If aggregation of values (such as closed position size) happens in a dedicated program, then changing our dashboard may requires changing and re-deployment of an application. If we on the contrary define metrics in the actual binary, we have more flexibility in which queries we write and can change those without changing the code.

After watching https://www.youtube.com/watch?v=67Ulrq6DxwA, it became clear to me that this isn't really a possible way forward. The speaker mentioned like 10 times that metrics are inaccurate by design and anything that needs accuracy should use logs (or something else).

In general, it seems like prometheus by itself is not designed for reporting on aspects of the domain like the closed position size. I wonder if there is something better that we can plug into grafana?

Otherwise, having a daemon that separately queries a shared database (like postgres) would be an option that allows us to reduce the load on the actual system.

3 replies

thomaseizinger Mar 2, 2022
Author

Grafana has a Postgres data source: https://grafana.com/docs/grafana/latest/datasources/postgres/

This means we can either query the database directly in case #1545 is successful or have a dedicated reporting database that we write data to.

The advantage of the latter would be that we don't have to put knowledge about our internal data model into SQL queries in grafana. However, the same may also be achieved with reporting views. We would need to evaluate the performance of that.

bonomat Mar 2, 2022
Maintainer

After watching youtube.com/watch?v=67Ulrq6DxwA, it became clear to me that this isn't really a possible way forward. The speaker mentioned like 10 times that metrics are inaccurate by design and anything that needs accuracy should use logs (or something else).

That's an interesting point I haven't heard before.

re-db:
yes, I think for proper reporting querying the db is a good idea which I thought I had mentioned before - way better than what's in use now with iterating over all CFDs and collecting data 😬
Whether queries, views or a separate reporting db is the way to go, I don't know. My gut feeling says that a separate db might be some code overhead but can give us a better query performance than having to parse the JSON blob in our db for every other query.

thomaseizinger Mar 2, 2022
Author

After watching youtube.com/watch?v=67Ulrq6DxwA, it became clear to me that this isn't really a possible way forward. The speaker mentioned like 10 times that metrics are inaccurate by design and anything that needs accuracy should use logs (or something else).

That's an interesting point I haven't heard before.

The gist of the video is that metrics are always sampled, i.e. they represent data at points in time and thus there is always a form of extrapolation going on depending on the time window you are looking at. This is only true of the metrics are captured in real time though (like number of requests etc).

re-db:
yes, I think for proper reporting querying the db is a good idea which I thought I had mentioned before - way better than what's in use now with iterating over all CFDs and collecting data grimacing
Whether queries, views or a separate reporting db is the way to go, I don't know. My gut feeling says that a separate db might be some code overhead but can give us a better query performance than having to parse the JSON blob in our db for every other query.

I don't recall us talking about it before but two minds having the same idea independently is definitely a form of validation. I am also tending towards a separate reporting DB. It should be much more efficient to store the reporting data in exactly the form it is needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practises for collecting metrics and statistics #1542

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best practises for collecting metrics and statistics #1542

thomaseizinger Mar 2, 2022

Replies: 2 comments · 5 replies

thomaseizinger Mar 2, 2022 Author

Time series vs. accumulated metrics

Problem definition

Possible solutions

thomaseizinger Mar 2, 2022 Author

thomaseizinger Mar 2, 2022 Author

thomaseizinger Mar 2, 2022 Author

thomaseizinger Mar 2, 2022 Author

bonomat Mar 2, 2022 Maintainer

thomaseizinger Mar 2, 2022 Author

thomaseizinger
Mar 2, 2022

Replies: 2 comments 5 replies

thomaseizinger
Mar 2, 2022
Author

thomaseizinger Mar 2, 2022
Author

thomaseizinger Mar 2, 2022
Author

thomaseizinger
Mar 2, 2022
Author

thomaseizinger Mar 2, 2022
Author

bonomat Mar 2, 2022
Maintainer

thomaseizinger Mar 2, 2022
Author