Skip to content

Metrics and http scrape endpoints for Digipost monitoring

License

Notifications You must be signed in to change notification settings

digipost/digipost-micrometer-prometheus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maven Central License

digipost-micrometer-prometheus

Common application tag MeterFilter

To include the name of your application in all reported metrics, make sure to configure this as early as possible before any meters are bound to your MeterRegistry. E.g. configure this where you create your registry:

MeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
MeterFilters.tryIncludeApplicationNameCommonTag().ifPresentOrElse(
        config()::meterFilter,
        () -> LOG.warn("Unable to include application common tag in MeterRegistry"));

This will try to determine the name of your application by resolving the value of the Implementation-Title key in the MANIFEST.MF file of the JAR file containing the class which started your application. You also have the option to provide a class yourself instead of relying on this being automatically discovered. The class should be located in the JAR which also contains the MANIFEST.MF which contains the Implementation-Title you would like to use as your application name.

The example above logs a warning should this discovery mechanism fail to resolve your application name. You may choose to handle this in any way depending on your preference, e.g. throw an exception instead of just logging.

You can also skip all this automatic discovery, and just supply the name of your application when configuring the filter:

MeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
meterRegistry.config().meterFilter(MeterFilters.includeApplicationNameCommonTag("my-application"))

See PR #34 for more examples on how to configure the filter.

Micrometer metrics

Usage with a MeterRegistry:

new ApplicationInfoMetrics().bindTo(this);

Application metric with data from MANIFEST.MF or environment variables.

This is what is expected to exist in the manifest or as key value environment variables:

Build-Jdk-Spec: 12
Git-Build-Time: 2019-12-19T22:52:05+0100
Git-Build-Version: 1.2.3
Git-Commit: ffb9099

This will create this metric in Prometheus running java 11:

# HELP app_info General build and runtime information about the application. This is a static value
# TYPE app_info gauge
app_info{application="my-application",buildNumber="ffb9099",buildTime="2019-12-19T22:52:05+0100",buildVersion="1.2.3",javaBuildVersion="12",javaVersion="11"} 1.0

(Note, application="my-application" will only be included if you configured the "Common application tag MeterFilter" described previously.)

The following metric will be created if no values are present in the manifest or environment variables:

# HELP app_info General build and runtime information about the application. This is a static value
# TYPE app_info gauge
app_info{javaVersion="11"} 1.0

Simple Prometheus server

The SimplePrometheusServer will take a log-method in its constructor to log when the server is up.

To start the server you need your instance of PrometheusMeterRegistry and a port. The prometheus metrics will then be on host:port/metrics

new SimplePrometheusServer(LOG::info)
        .startMetricsServer(prometheusRegistry, 9610);

TimedThirdPartyCall

With TimedThirdPartyCall you can wrap your code to get metrics on the call with extended funtionality on top of what micrometer Timed gives you.

An example:

BiFunction<MyResponse, Optional<RuntimeException>, AppStatus> warnOnSituation =
        (response, possibleException) -> possibleException.isPresent() || "ERROR_SITUATION".equals(response.data) ? AppStatus.WARN : AppStatus.OK;

TimedThirdPartyCall<MyResponse> getStuff = TimedThirdPartyCallDescriptor
        .create("ExternalService", "getStuff", prometheusRegistry)
        .callResponseStatus(warnOnSituation);

getStuff.call(() -> new MyResponse("ERROR_SITUATION"));

This will produce a number of metrics:

app_third_party_call_total{name="ExternalService_getStuff", status="OK"} 0.0
app_third_party_call_total{name="ExternalService_getStuff", status="WARN"} 1.0
app_third_party_call_total{name="ExternalService_getStuff", status="FAILED"} 0.0
app_third_party_call_seconds_count{name="ExternalService_getStuff"} 1.0
app_third_party_call_seconds_sum{name="ExternalService_getStuff"} 6.6018E-5
app_third_party_call_seconds_max{name="ExternalService_getStuff"} 6.6018E-5

The idea is that Timed only count exections overall. What we want in addition is finer granularity to create better alerts in our alerting rig. By specifying a function by witch we say OK/WARN/FAILED we can exclude error-situations that we want to igore from alerts reacting to FAILED or a percentage of FAILED/TOTAL.

You can also use simple exception-mapper-function for a boolean OK/FAILED:

TimedThirdPartyCall<String> getStuff = TimedThirdPartyCallDescriptor
        .create("ExternalService", "getStuff", prometheusRegistry)
        .exceptionAsFailure();

String result = getStuff.call(() -> "OK");

For timing void functions you can use NoResultTimedThirdPartyCall, acquired by invoking the noResult() method:

NoResultTimedThirdPartyCall voidFunction = TimedThirdPartyCallDescriptor
        .create("ExternalService", "voidFunction", prometheusRegistry)
        .noResult()  // allows timing void function calls
        .exceptionAsFailure();

voidFunction.call(() -> {});

You can also defined percentiles (default 0.5, 0.95, 0.99):

TimedThirdPartyCallDescriptor
        .create("ExternalService", "getStuff", prometheusRegistry)
        .callResponseStatus(warnOnSituation, 0.95, 0.99);

MetricsUpdater

Often you want to have metrics that might be slow to get. Examples of this is count rows in a Postgres-database or maybe stats from a keystore. Typically you want to have som kind of worker threat that updates this value on a regular basis. But how do you know that your worker thread is not stuck? For this you can use the MetricsUpdater class. Create an instance of it and specify the number of threads you want. Now registrer a runnable at an interval.

metricsUpdater.registerAsyncUpdate("count-table", Duration.ofMinutes(10), () -> {
    //Slow count of huge database
});

You can the alert if this is stale:

- alert: AsyncUpdateScrapeErrors
     expr: app_async_update_scrape_errors > 0
     for: 2m

EventLogger

Sometimes you want to have metrics for some event that happens in your application. And sometimes you want som kind of alert or warning when they occur at a given rate. This implementation is a way to achieve that in a generic way.

Your application need to implement the interface AppBusinessEvent. We usually do that with an enum so that we have easy access to the instance of the event. You can se a complete implementation of this in AppBusinessEventLoggerTest. You can also use the interface AppSensorEvent to add a multiplier score (severity) to an event.

EventLogger eventLogger = new AppBusinessEventLogger(meterRegistry);
eventLogger.log(MyBusinessEvents.VIOLATION_WITH_WARN);

This should produce a prometheus scrape output like this:

# HELP app_business_events_1min_warn_thresholds
# TYPE app_business_events_1min_warn_thresholds gauge
app_business_events_1min_warn_thresholds{name="VIOLATION_WITH_WARN",} 5.0
# HELP app_business_events_total
# TYPE app_business_events_total counter
app_business_events_total{name="VIOLATION_WITH_WARN",} 1.0

You can then use the gauge app_business_events_1min_warn_thresholds to register alerts with your system:

  - alert: MyEvents
    expr: >
      sum by (job,name) (increase(app_business_events_total[5m]))
      >=
      max by (job,name) (app_business_events_1min_warn_thresholds) * 5
    labels:
      severity: warning
    annotations:
      summary: 'High event-count for `{{ $labels.name }}`'
      description: >
        Job: `{{ $labels.job }}`, event: `{{ $labels.name }}`, has 15min count of `{{ $value | printf "%.1f" }}`

The nice thing here is that by doing the sum by (job, name) you will compare only the metrics with the same name. For this eksample that is VIOLATION_WITH_WARN which is your uniqe event name in the system.

LogbackLoggerMetrics

Log-events metrics for specified logback appender. Dimensions for level and logger.

LogbackLoggerMetrics.forRootLogger().bindTo(meterRegistry);
//or
LogbackLoggerMetrics.forLogger("my.logger.name").bindTo(meterRegistry);

This will produce the following prometheus scrape output:

# HELP logback_logger_events_total
# TYPE logback_logger_events_total counter
logback_logger_events_total{application="my-application",level="warn",logger="ROOT",} 0.0
logback_logger_events_total{application="my-application",level="error",logger="ROOT",} 0.0
logback_logger_events_total{application="my-application",level="trace",logger="ROOT",} 0.0
logback_logger_events_total{application="my-application",level="info",logger="ROOT",} 18.0
logback_logger_events_total{application="my-application",level="debug",logger="ROOT",} 0.0

Excluding logger from metric

If you for some reasons don't want log events from a spesific logger to be included in the metric this can be done:

LogbackLoggerMetrics.forRootLogger()
    .excludeLogger("ignored.logger.name")
    .bindTo(prometheusRegistry);

Thresholds

Metrics for logging level threshold can also be created with the methods warnThreshold5min and errorThreshold5min.

Ex:

LogbackLoggerMetrics.forLogger(Logger.ROOT_LOGGER_NAME)
        .warnThreshold5min(10)
        .errorThreshold5min(5)
        .bindTo(prometheusRegistry);

This will produce the following metrics during prometheus scraping, in addition to the metrics above:

# HELP logger_events_1min_threshold
# TYPE logger_events_1min_threshold gauge
logger_events_5min_threshold{application="my-application",level="warn",logger="ROOT",} 10.0
logger_events_5min_threshold{application="my-application",level="error",logger="ROOT",} 5.0

These metrics can be used for alerting in combination with the metrics above. Prometheus expression:

sum by (job,level,logger) (increase(logback_logger_events_total[5m]))
>=
max by (job,level,logger) (log_events_5min_threshold)