Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Cases for JMX -> Prom Exporter Regexps #14155

Merged
merged 47 commits into from
Oct 23, 2024

Conversation

suddendust
Copy link
Contributor

@suddendust suddendust commented Oct 3, 2024

Instructions:

This PR adds test cases that test that all gauges/timers/meters for each Pinot component are exported with the right names and labels. Given most of these metrics are already in use, the tests validate that they keep getting exported the way they're being used right now (to maintain backward compatibility).

Pinot currently uses the JMX->Prom exporter to export metrics to Prometheus. Normally, this runs as an agent in the JVM. In this case however, we've got tests using four different config files (server.yml, broker.yml, controller.yml and minion.yml). Loading the same agent in the same JVM multiple times isn't allowed, so I had to resort to copying the agent's code to some degree and starting up the HTTP servers manually.

Edit: The tests detected that the following metrics were not being exported right now:

  • pinot_server_luceneIndexingDelayMs
  • pinot_server_luceneIndexingDelayDocs
  • pinot_server_realtimeRowsSanitized: This metric is exported right now but no useful information is exported (table, tableType, topic or partition). This is because it being exported by a catch-all metric right now that encodes very less information.
pinot_server_realtimeRowsSanitized_Count{app="pinot", apps_kubernetes_io_pod_index="0", cluster_name="pinot", component="pinot-server", component_name="server-default-tenant-1", controller_revision_hash="pinot-server-default-tenant-1-197ef0f2-dcfc7c4c4", heritage="StarTreeOperator", instance="10.168.39.97:8080", job="kubernetes-pods", kubernetes_namespace="managed", kubernetes_pod_name="pinot-server-default-tenant-1-197ef0f2-0", profile_id="197ef0f2", statefulset_kubernetes_io_pod_name="pinot-server-default-tenant-1-197ef0f2-0"}

Compare it to realtimeRowsConsumed that has table, tableType, topic and partition.

pinot_server_realtimeRowsConsumed_Count{app="pinot", apps_kubernetes_io_pod_index="0", cluster_name="pinot", component="pinot-server", component_name="server-default-tenant-1", controller_revision_hash="pinot-server-default-tenant-1-197ef0f2-dcfc7c4c4", heritage="StarTreeOperator", instance="10.168.39.97:8080", job="kubernetes-pods", kubernetes_namespace="managed", kubernetes_pod_name="pinot-server-default-tenant-1-197ef0f2-0", partition="0", profile_id="197ef0f2", statefulset_kubernetes_io_pod_name="pinot-server-default-tenant-1-197ef0f2-0", table="f_express_order_v2", tableType="REALTIME", topic="stg-confluentproto-midas-express-order-fact"}

The new regex exports this metric correctly.

I have added regexps to export them in server.yml.

@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.82%. Comparing base (59551e4) to head (2fcd382).
Report is 1215 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14155      +/-   ##
============================================
+ Coverage     61.75%   63.82%   +2.07%     
- Complexity      207     1536    +1329     
============================================
  Files          2436     2626     +190     
  Lines        133233   144721   +11488     
  Branches      20636    22147    +1511     
============================================
+ Hits          82274    92371   +10097     
- Misses        44911    45537     +626     
- Partials       6048     6813     +765     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.77% <ø> (+2.06%) ⬆️
java-21 63.72% <ø> (+2.09%) ⬆️
skip-bytebuffers-false 63.81% <ø> (+2.06%) ⬆️
skip-bytebuffers-true 63.68% <ø> (+35.95%) ⬆️
temurin 63.82% <ø> (+2.07%) ⬆️
unittests 63.82% <ø> (+2.07%) ⬆️
unittests1 55.59% <ø> (+8.70%) ⬆️
unittests2 34.29% <ø> (+6.56%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test!

Right now all tests are hard-coded and it won't be able to capture newly added metrics automatically. Instead, can we loop over the enums to ensure all the metrics are tested, which is also future proof?

@suddendust
Copy link
Contributor Author

suddendust commented Oct 3, 2024

can we loop over the enums to ensure all the metrics are tested, which is also future proof?

Yes, that's the ideal way to do it. However, exported metric names are not standardised so it's not possible to derive it from the enum names (we should standardize them going forward). Further, metrics accept different kind of arguments for labelling. For example, some accept rawTableName, some accept tableNameWithType and some accept clientId (tableNameWithType-partition-topic). Determining these would be on a case-to-case basis.

@suddendust
Copy link
Contributor Author

suddendust commented Oct 3, 2024

it won't be able to capture newly added metrics automatically

For this, I have added a check in each test case that basically asserts on the count of metrics exported. For any newly added metric, this check would fail. It's not fool-proof but does provide a basic check. Perhaps I can strengthen this further by adding a check that verifies that for each enum, we have an exported metric that contains the enum string in some form.

@gortiz
Copy link
Contributor

gortiz commented Oct 7, 2024

Where is the agent being added?

Copy link
Contributor

@soumitra-st soumitra-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 103 to 128
public void setupTest()
throws Exception {
//read test configuration
JsonNode jsonNode = JsonUtils.DEFAULT_READER.readTree(loadResourceAsString("metrics/testConfig.json"));
_exporterConfigsParentDir = jsonNode.get(CONFIG_KEY_JMX_EXPORTER_PARENT_DIR).textValue();
_pinotComponentToConfigFileMap.put(PinotComponent.CONTROLLER,
jsonNode.get(CONFIG_KEY_CONTROLLER_CONFIG_FILE_NAME).textValue());
_pinotComponentToConfigFileMap.put(PinotComponent.SERVER,
jsonNode.get(CONFIG_KEY_SERVER_CONFIG_FILE_NAME).textValue());
_pinotComponentToConfigFileMap.put(PinotComponent.BROKER,
jsonNode.get(CONFIG_KEY_BROKER_CONFIG_FILE_NAME).textValue());
_pinotComponentToConfigFileMap.put(PinotComponent.MINION,
jsonNode.get(CONFIG_KEY_MINION_CONFIG_FILE_NAME).textValue());

String pinotMetricsFactory = jsonNode.get(CONFIG_KEY_PINOT_METRICS_FACTORY).textValue();
switch (pinotMetricsFactory) {
case "YammerMetricsFactory":
_pinotMetricsFactory = new YammerMetricsFactory();
break;
case "DropwizardMetricsFactory":
_pinotMetricsFactory = new DropwizardMetricsFactory();
break;
default:
throw new IllegalArgumentException("Unknow metrics factory specified in test config: " + pinotMetricsFactory
+ ", supported ones are: YammerMetricsFactory and DropwizardMetricsFactory");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code. Assuming in future we test both Yammer and Dropwizard... how do you expect we should write the tests? It looks it is reading some config from testConfig.json but it is not possible to read two different files so... why is this better than forcing to use Yammer in the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear. Due to the limitations of TestNG, what I would expect is to have 1 test class per component (broker, controller, etc) and per metric registry (Yammer, Dropwizard, etc).

In order to have so, I would expect PinotPrometheusMetricsTest to be abstract and have two abstract methods:

  1. PinotComponent getComponent(), as explained in another comment.
  2. PinotMetricsFactory getPinotMetricsFactory(), that returns the registry to be used.

Then for each component we may have another abstract class (like BrokerPrometheusMetricsTest) that inherits PinotPrometheusMetricsTest and implements getComponents and includes all test methods and finally two not-abstract classes YammerBrokerPrometheusMetricsTest and DropwizardBrokerPrometheusMetricsTest which extends BrokerPrometheusMetricsTest and only implements getPinotMetricsFactory().

Something like:

classDiagram
  class PinotPrometheusMetricsTest {
      
  }
  class BrokerPrometheusMetricsTest {
      getComponent
  }
  class DropwizardBrokerPrometheusMetricsTest {
      getPinotMetricsFactory
  }
  class YammerBrokerPrometheusMetricsTest {
    getPinotMetricsFactory
  }
    class ServerPrometheusMetricsTest {
        getComponent
    }
    class DropwizardServerPrometheusMetricsTest {
        getPinotMetricsFactory
    }
    class YammerServerPrometheusMetricsTest {
        getPinotMetricsFactory
    }
  
  PinotPrometheusMetricsTest <|-- BrokerPrometheusMetricsTest
  BrokerPrometheusMetricsTest <|-- DropwizardBrokerPrometheusMetricsTest
  BrokerPrometheusMetricsTest <|-- YammerBrokerPrometheusMetricsTest

  PinotPrometheusMetricsTest <|-- ServerPrometheusMetricsTest
  ServerPrometheusMetricsTest <|-- DropwizardServerPrometheusMetricsTest
  ServerPrometheusMetricsTest <|-- YammerServerPrometheusMetricsTest
Loading

Copy link
Contributor Author

@suddendust suddendust Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comment, have restructured the code as discussed.

@@ -107,6 +107,10 @@
</plugins>
</build>
<dependencies>
<dependency>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the dependencies added here be in test scope ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already in test scope in parent pom:

     <!-- JMX exporter-->
      <dependency>
        <groupId>io.prometheus.jmx</groupId>
        <artifactId>jmx_prometheus_javaagent</artifactId>
        <version>0.19.0</version>
        <scope>test</scope>
      </dependency>


private String _exporterConfigsParentDir;

private final Map<PinotComponent, String> _pinotComponentToConfigFileMap = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute is only used to do _pinotComponentToConfigFileMap.get(pinotComponent), where pinotComponent is defined by getPinotComponent(), which is an abstract method of this class. Instead I suggest to create an abstract getPrometheusConfigFile() we can implement on each test. For example, if yammer and dropwizard end up having different config files (which seems to be the solution for the future), we could override the method on each test class to return the specific prometheus file for each one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed, thanks!

Copy link
Contributor

@gortiz gortiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the (non blocker) comment I've just added, LGTM

//all exported server metrics have this prefix
private static final String EXPORTED_METRIC_PREFIX = "pinot_server_";

private static final List<ServerMeter> METERS_ACCEPTING_CLIENT_ID =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not DRY. If a new metric is added then it has to be added to one of these lists ? I am sure that will be missed.
Couple of options:

  • Add this metadata to ServerMeter etc
  • Check that every metric defined is in one of the lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any new metric that'll be added will have tableNameWithType supplied to them. Here's the relevant piece of code for ServerMeter:

    } else {
      if (METERS_ACCEPTING_CLIENT_ID.contains(serverMeter)) {
        addMeterWithLabels(serverMeter, CLIENT_ID);
        assertMeterExportedCorrectly(serverMeter.getMeterName(),
            ExportedLabels.PARTITION_TABLENAME_TABLETYPE_KAFKATOPIC);
      } else if (METERS_ACCEPTING_RAW_TABLE_NAMES.contains(serverMeter)) {
        addMeterWithLabels(serverMeter, ExportedLabelValues.TABLENAME);
        assertMeterExportedCorrectly(serverMeter.getMeterName(), ExportedLabels.TABLENAME);
      } else {
        //we pass tableNameWithType to all remaining meters
        addMeterWithLabels(serverMeter, TABLE_NAME_WITH_TYPE);
        assertMeterExportedCorrectly(serverMeter.getMeterName(), ExportedLabels.TABLENAME_TABLETYPE);
      }

We explicitly have to create a list for METERS_ACCEPTING_CLIENT_ID or METERS_ACCEPTING_RAW_TABLE_NAMES et. al. because this is how these metrics are being used right now in code. Ideally, we should not be using rawTableName anywhere but tableNameWithType. However, to maintain backward compatibility with metrics as they are being used currently, I have to explicitly filter them.

Now suppose a new metric is added in which the user expects the partition as one of the exported labels. If he does not add an explicit check for this new metric in the test, then it would go to the else block (with ExportedLabels.TABLENAME_TABLETYPE) and the test would PASS - This would ensure that at least the table and tabletype labels are present. We'll need to document that if you want some special label, you'll need to add a test case for it explicitly otherwise you'll only end up getting table and tableType.



/**
* Disabling tests as Pinot currently uses Yammer and these tests fail for for {@link DropwizardMetricsFactory}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove DropWizard tests instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have two metric registries rn, we need tests for both. The only reason why they're disabled is that these tests fail for Dropwizard. So I have kept them explicit that we want these tests but are disabling them because the code for DW seems broken.

@Jackie-Jiang Jackie-Jiang merged commit 09e8812 into apache:master Oct 23, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants