Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasource metadata queries #1322

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

shreyabiradar07
Copy link
Contributor

@shreyabiradar07 shreyabiradar07 commented Oct 4, 2024

Description

This PR has the following changes:

  • Updates NAMESPACE_QUERY, WORKLOAD_INFO_QUERY, CONTAINER_INFO_QUERY dsmetadata queries to provide the average namespace, workload and container status over the past 15 days.

  • Also includes namespace label in CONTAINER_INFO query to perform the join operation on (pod, namespace) with the pod to handle the case where same workloads are present in different namespaces.

  • Handles an edge case - wherein for workload type job with no containers was earlier resulting in NullPointerException, which is now fixed.

Fixes # (1316)

Type of change

  • Bug fix
  • Enhancement
  • New feature
  • Docs update
  • Breaking change (What changes might users need to make in their application due to this PR?)
  • Requires DB changes

How has this been tested?

Please describe the tests that were run to verify your changes and steps to reproduce. Please specify any test configuration required.

  • New Test X
  • Functional testsuite

Tested by running local monitoring demos on ResourceHub cluster - ./local_monitoring_demo.sh -c openshift -i quay.io/shbirada/update_dsmetadata_queries:v1

Fixed the NullPointerException seen before:

Error while converting DataSourceMetadata Object to KruizeDSMetadataEntry table due to Cannot invoke "java.util.HashMap.values()" because the return value of "com.autotune.common.data.dataSourceMetadata.DataSourceWorkload.getDataSourceContainerHashMap()" is null

java.lang.NullPointerException: Cannot invoke "java.util.HashMap.values()" because the return value of "com.autotune.common.data.dataSourceMetadata.DataSourceWorkload.getDataSourceContainerHashMap()" is null

pod logs after handling the edge case: https://privatebin.corp.redhat.com/?87922ed8fdb9df55#D6FBWxRYbKHTYEsTEcts1D3aNGtmb3jgEe3eXhf5FU2F

pod logs with Exception - https://privatebin.corp.redhat.com/?527498b097f364ac#F93EzwiZG121QPcc73FKk8NU9bCTqzLMgSsbPsHuoFVS

Test Configuration

  • Kubernetes clusters tested on: ResourceHub OpenShift cluster

Checklist 🎯

  • Followed coding guidelines
  • Comments added
  • Dependent changes merged
  • Documentation updated
  • Tests added or updated

Additional information

docker image - quay.io/shbirada/update_dsmetadata_queries:v1

@shreyabiradar07 shreyabiradar07 self-assigned this Oct 4, 2024
@shreyabiradar07 shreyabiradar07 added bug Something isn't working local_monitoring labels Oct 4, 2024
@shreyabiradar07 shreyabiradar07 marked this pull request as ready for review October 4, 2024 08:11
@shreyabiradar07 shreyabiradar07 added this to the Kruize 0.0.26 Release milestone Oct 4, 2024
Copy link
Contributor

@kusumachalasani kusumachalasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query changes looks good.

As discussed, please update the conditions on when the containers can be null for a workload.

Copy link
Contributor

@kusumachalasani kusumachalasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have CONTAINER_INFO_QUERY which has both container and workload details I feel it is redundant to run WORKLOAD_INFO_QUERY and also we can avoid running an extra for loop for workload -

for (DataSourceWorkload dataSourceWorkload : dataSourceNamespace.getDataSourceWorkloadHashMap().values()) {

This can help in reducing the time taken to get the datasource details and also avoids redundancy.

Copy link
Contributor

@kusumachalasani kusumachalasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query changes LGTM!

@dinogun
Copy link
Contributor

dinogun commented Oct 8, 2024

@shreyabiradar07 Please resolve the conflicts

@dinogun
Copy link
Contributor

dinogun commented Oct 15, 2024

@shreyabiradar07 Can you please rebase on top of mvp_demo and test

@shreyabiradar07 shreyabiradar07 force-pushed the update-dsmetadata-queries branch from dfb7392 to f31d0d3 Compare October 15, 2024 13:45
@shreyabiradar07
Copy link
Contributor Author

Local monitoring demo response : https://privatebin.corp.redhat.com/?6fb5876b4ae8e63f#AoXBYDV9BTfX15vYDexamQYrnbHpWiMBkuq7BAZzaAzX

Encountering the following assertion error for rest_apis/test_list_metadata.py local monitoring tests:

assert [<ValidationError: "'containers' is a required property">, <ValidationError: "'containers' is a required property">, <ValidationError: ''>] == ''

Failed test logs : https://privatebin.corp.redhat.com/?cd5f4646e2d1f0e3#41XkaVvktuBJXmQDm4F11WTvfPXq4Rh7DRN164LyGUhQ

@@ -64,7 +64,7 @@
}
}
},
"required": ["workload_name", "workload_type", "containers"]
"required": ["workload_name", "workload_type"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is containers removed?

Copy link
Contributor Author

@shreyabiradar07 shreyabiradar07 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below assertion error is encountered in edge cases where there are no containers for a job workload type as "containers" is a required field:

assert [<ValidationError: "'containers' is a required property">, <ValidationError: "'containers' is a required property">, <ValidationError: ''>] == ''

Following are the few examples for edge cases observed after updating the datasource metadata queries to gather metric data for last 15 days:

"collect-profiles-28807155": {
                  "workload_name": "collect-profiles-28807155",
                  "workload_type": "job"
                },
"image-pruner-28802880": {
                  "workload_name": "image-pruner-28802880",
                  "workload_type": "job"
                }

Failed test logs : https://privatebin.corp.redhat.com/?cd5f4646e2d1f0e3#41XkaVvktuBJXmQDm4F11WTvfPXq4Rh7DRN164LyGUhQ

@shreyabiradar07
Copy link
Contributor Author

As we have CONTAINER_INFO_QUERY which has both container and workload details I feel it is redundant to run WORKLOAD_INFO_QUERY and also we can avoid running an extra for loop for workload -

for (DataSourceWorkload dataSourceWorkload : dataSourceNamespace.getDataSourceWorkloadHashMap().values()) {

This can help in reducing the time taken to get the datasource details and also avoids redundancy.

As suggested created issue #1345 to work on the enhancement of performance and queries, will work on the same in the next release

@shreyabiradar07
Copy link
Contributor Author

The query changes looks good.

As discussed, please update the conditions on when the containers can be null for a workload.

Summarizing the steps followed to capture conditions when the containers can be null for a workload.

Tried importing metadata for 3 scenarios:

  • before running a job
  • when the job is in running state
  • after completion of job

but couldn't capture the case - where containers are null

Next steps:

  • Delete the existing job.
  • After 15 mins, run the workload query and check if the workload related to job exists.
  • Run the container query.

after 15m & 30m job data for both workload and container queries was present

container=null was observed specifically for namespace openshift-operator-lifecycle-manager , workload type job for example: 'collect-profiles-28795995

@kusumachalasani pointed out that collect-profiles jobs are very short lived (its ~ 4sec duration). So, the container metrics related to it doesn't have enough time to get it exported to prometheus because of which container details are null.

Copy link
Contributor

@chandrams chandrams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dinogun dinogun merged commit 6f88eea into kruize:mvp_demo Oct 16, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working local_monitoring
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants