Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-14077 Add ProcessGroup Performance Metrics to Prometheus #9577

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

esecules
Copy link
Contributor

@esecules esecules commented Dec 11, 2024

Summary

NIFI-14077

  • Add ProcessGroup Performance Metrics to Prometheus if performance metric collection is available
  • Add total task duration metric to each process group, not just the root process group
  • Add allowable values to the API docs for the producer and includedRegistries parameters so users know what there is to query without going to the source code.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

(no new deps)

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

…TAL_TASK_DURATION to ProcessGroups, add allable values to the documentation of the flow metrics API.
@esecules
Copy link
Contributor Author

esecules commented Dec 11, 2024

I suspect the macos system test is flakey so I am rerunning it on my fork https://github.com/esecules/nifi/actions/runs/12285436758?pr=3

EDIT:
The run passed on my fork hopefully it does the same below (thanks for the speedy retry @exceptionfactory!)

@exceptionfactory
Copy link
Contributor

I suspect the macos system test is flakey so I am rerunning it on my fork https://github.com/esecules/nifi/actions/runs/12285436758?pr=3

EDIT: The run passed on my fork hopefully it does the same below (thanks for the speedy retry @exceptionfactory!)

You're welcome, I have recently noticed a couple flaky tests in the system-tests workflow.

@esecules
Copy link
Contributor Author

esecules commented Dec 12, 2024

Looks like the retry failed still 😔, however it's passing in my fork and my machine (MacBook).

Do these tests run in parallel and might interfere with each other?

Or the tests might be running on an underpowered server and the 60 second timeout isn't sufficient for GitHub Actions? Github's macos runners do only have 3 cores. Ubuntu runners have 4 cores. I am assuming apache/nifi uses the default runners.

@esecules
Copy link
Contributor Author

Looks like system tests passed on retry! Thanks for all the retries.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing these changes @esecules. I'm not complete sure about introducing the additional gauges because the processing performance information can be disabled. However, it looks like the null handling accounts for that fact, so this approach may be sufficient. I noted a few minor recommendations.

Comment on lines +570 to +571
description = "The producer for flow file metrics. Each producer may have its own output format. " +
"Allowed values: [prometheus, json]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of indicating the allowed values in the description, it should be possible to provide schema that indicates allowed values.

@@ -275,5 +275,35 @@ public NiFiMetricsRegistry() {
.help("Provenance repository free space in bytes")
.labelNames("instance", "component_type", "component_name", "component_id", "parent_id", "repo_identifier")
.register(registry));

nameToGaugeMap.put("PROCESSING_PERF_CPU_MILLIS", Gauge.build()
.name("nifi_processing_cpu_duration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is longer, including performance in the name would provide better alignment with the source of the information.

Suggested change
.name("nifi_processing_cpu_duration")
.name("nifi_processing_performance_cpu_duration")


nameToGaugeMap.put("PROCESSING_PERF_CPU_MILLIS", Gauge.build()
.name("nifi_processing_cpu_duration")
.help("Estimated cpu time (in milliseconds) used by this component")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.help("Estimated cpu time (in milliseconds) used by this component")
.help("Estimated CPU time (in milliseconds) used by this component")


nameToGaugeMap.put("PROCESSING_PERF_GC_MILLIS", Gauge.build()
.name("nifi_processing_gc_duration")
.help("Estimated gc time (in milliseconds) used by this component")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.help("Estimated gc time (in milliseconds) used by this component")
.help("Estimated garbage collection time (in milliseconds) used by this component")

@esecules
Copy link
Contributor Author

Thanks for proposing these changes @esecules. I'm not complete sure about introducing the additional gauges because the processing performance information can be disabled. However, it looks like the null handling accounts for that fact, so this approach may be sufficient. I noted a few minor recommendations.

I'll check if I already added test coverage for when the performance feature is disabled and I'll add it if it's missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants